Exercises ice duration data

In this exercise we will model lake ice duration data using linear regression models. According to the United States Environmental Protection Agency, lake ice duration can be an indicator of climate change. This is because lake ice is dependent on several environmental factors, so changes in these factors will influence the formation of ice on top of lakes. As a result, the study and analysis of lake ice formation can inform scientists about how quickly the climate is changing, and are critical to minimizing disruptions to lake ecosystems. We can examine the ice duration of Lake Mendota and Lake Monona in the are of Madison, WI. The dataset is extracted from the ntl_icecover dataset of the R package lterdatasampler.

You can load the data as follows:

import pandas as pd
import matplotlib.pyplot as plt
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
df_icecover.head()
         lakeid      ice_on     ice_off  ice_duration  year
0  Lake Mendota  1855-12-18  1856-04-14           118  1855
1  Lake Mendota  1856-12-06  1857-05-06           151  1856
2  Lake Mendota  1857-11-25  1858-03-26           121  1857
3  Lake Mendota  1858-12-08  1859-03-14            96  1858
4  Lake Mendota  1859-12-07  1860-03-26           110  1859

Question 1

Complete the code below to visualize the ice duration over the years for both lakes.

Select the observations corresponding to each lake using boolean indexing and plot the data using matplotlib.

import pandas as pd
import matplotlib.pyplot as plt
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")

# Filter lakes
df_lake_mendota = df_icecover[df_icecover['lakeid'] == "Lake Mendota"]
df_lake_monona  = df_icecover[df_icecover['lakeid'] == "Lake Monona"]

# Colors
mycol = ["#6a5acd", "#e64173"]

# Plot
plt.figure(figsize=(8,5))
plt.plot(df_lake_mendota['year'], df_lake_mendota['ice_duration'], 'o', color=mycol[0], label='Mendota')
plt.plot(df_lake_monona['year'], df_lake_monona['ice_duration'], 'o', color=mycol[1], label='Monona')

plt.xlabel("Year")
plt.ylabel("Ice duration (days)")
plt.title("Ice duration over years")
plt.legend(loc="upper right")
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

Question 2

One of your colleague proposes to consider the following models to analyze the ice duration data: \[ \text{IceCover}_i = \beta_0 + \beta_1 \text{Year}_i + \epsilon_i \] He estimates the model and obtain the following results:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np

df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
mod1 = smf.ols("ice_duration ~ year", data=df_icecover).fit()
print(mod1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           ice_duration   R-squared:                       0.246
Model:                            OLS   Adj. R-squared:                  0.244
Method:                 Least Squares   F-statistic:                     107.2
Date:                Wed, 12 Nov 2025   Prob (F-statistic):           6.14e-22
Time:                        11:11:16   Log-Likelihood:                -1406.9
No. Observations:                 331   AIC:                             2818.
Df Residuals:                     329   BIC:                             2825.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    495.4410     37.926     13.063      0.000     420.834     570.048
year          -0.2027      0.020    -10.356      0.000      -0.241      -0.164
==============================================================================
Omnibus:                       25.409   Durbin-Watson:                   2.150
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               32.243
Skew:                          -0.597   Prob(JB):                     9.97e-08
Kurtosis:                       3.955   Cond. No.                     7.85e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

He wants to test if the ice duration is reducing over the years, regardless of the lake. What are the correct hyptheses to test this claim?

Question 3

How do you interpret the coefficent associated with \(\beta_0\) in mod1?

Question 4

A second colleague of yours proposes to consider the following model to analyze the ice duration data: \[ \text{IceCover}_i = \beta_0 + \beta_1 \text{Year}_i + \beta_2 \text{LakeID}_i + \epsilon_i \] More precisely, she wants to test if the ice duration is reducing over the years, while including an intercept for each lake.

She estimates the model and obtain the following results:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
mod2 = smf.ols("ice_duration ~ year + lakeid", data=df_icecover).fit()
print(mod2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           ice_duration   R-squared:                       0.247
Model:                            OLS   Adj. R-squared:                  0.242
Method:                 Least Squares   F-statistic:                     53.69
Date:                Wed, 12 Nov 2025   Prob (F-statistic):           6.74e-21
Time:                        11:11:16   Log-Likelihood:                -1406.7
No. Observations:                 331   AIC:                             2819.
Df Residuals:                     328   BIC:                             2831.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept               494.7583     37.980     13.027      0.000     420.043     569.474
lakeid[T.Lake Monona]     1.1167      1.873      0.596      0.551      -2.568       4.801
year                     -0.2027      0.020    -10.342      0.000      -0.241      -0.164
==============================================================================
Omnibus:                       25.159   Durbin-Watson:                   2.152
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.751
Skew:                          -0.595   Prob(JB):                     1.27e-07
Kurtosis:                       3.941   Cond. No.                     7.86e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.86e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Complete the following code to plot the data points and the regression lines for both lakes.

Use the method predict of the fitted model to obtain the predicted values for each lake over a range of years, and plot these predictions as lines on top of the scatter plot of the data points.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np

# Load data
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")

# Fit linear model with lake-specific intercepts
mod2 = smf.ols("ice_duration ~ year + lakeid", data=df_icecover).fit()
print(mod2.summary())

# Separate lakes
df_mendota = df_icecover[df_icecover['lakeid'] == "Lake Mendota"]
df_monona  = df_icecover[df_icecover['lakeid'] == "Lake Monona"]

# Colors
colors = ["#6a5acd", "#e64173"]

# Plot data points
plt.figure(figsize=(8,5))
plt.scatter(df_mendota['year'], df_mendota['ice_duration'], color=colors[0], label="Mendota")
plt.scatter(df_monona['year'], df_monona['ice_duration'], color=colors[1], label="Monona")

# Prepare years for predictions
years = np.linspace(df_icecover['year'].min(), df_icecover['year'].max(), 200)

# Predict for Mendota
pred_mendota = mod2.predict(pd.DataFrame({'year': years, 'lakeid': 'Lake Mendota'}))
plt.plot(years, pred_mendota, color=colors[0], linestyle='-', linewidth=2)

# Predict for Monona
pred_monona = mod2.predict(pd.DataFrame({'year': years, 'lakeid': 'Lake Monona'}))
plt.plot(years, pred_monona, color=colors[1], linestyle='-', linewidth=2)

# Labels, legend, title
plt.xlabel("Year")
plt.ylabel("Ice duration (days)")
plt.title("Ice duration over years with regression lines")
plt.legend()
plt.grid(True)
plt.show()

Question 5

A third colleague of yours proposes to consider the following model to analyze the ice duration data: \[ \text{IceCover}_i = \beta_0 + \beta_1 \text{Year}_i + \beta_2 \text{LakeID}_i + \beta_3 \text{Year}_i \times \text{LakeID}_i + \epsilon_i \] What is the correct formula to specify this model in statsmodels?

Question 6

She estimates the model and obtain the following results:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
mod3 = smf.ols("ice_duration ~ year + lakeid + year:lakeid", data=df_icecover).fit()
print(mod3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           ice_duration   R-squared:                       0.247
Model:                            OLS   Adj. R-squared:                  0.240
Method:                 Least Squares   F-statistic:                     35.76
Date:                Wed, 12 Nov 2025   Prob (F-statistic):           5.18e-20
Time:                        11:11:16   Log-Likelihood:                -1406.6
No. Observations:                 331   AIC:                             2821.
Df Residuals:                     327   BIC:                             2836.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                    479.2000     54.027      8.870      0.000     372.915     585.485
lakeid[T.Lake Monona]         31.9296     76.027      0.420      0.675    -117.635     181.494
year                          -0.1946      0.028     -6.980      0.000      -0.249      -0.140
year:lakeid[T.Lake Monona]    -0.0159      0.039     -0.405      0.685      -0.093       0.061
==============================================================================
Omnibus:                       25.821   Durbin-Watson:                   2.154
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               32.984
Skew:                          -0.601   Prob(JB):                     6.88e-08
Kurtosis:                       3.972   Cond. No.                     2.06e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.06e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

This colleague wants to test whether the ice duration is evolving at different rates for the two lakes. What would be the appropriate hypotheses for this test?

Question 7

What is the p-value associated with this test according to the model summary above?

Question 8

What can you conclude regarding the difference in ice duration between the two lakes (considering \(\alpha=0.05\))?

Question 9

Complete the following code to plot the data points and the regression lines for both lakes according to the third model.

Use the method predict of the fitted model to obtain the predicted values for each lake over a range of years, and plot these predictions as lines on top of the scatter plot of the data points.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np

# Load data
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")

# Fit linear model with lake-specific intercepts
mod3 = smf.ols("ice_duration ~ year + lakeid + year:lakeid", data=df_icecover).fit()
print(mod3.summary())

# Separate lakes
df_mendota = df_icecover[df_icecover['lakeid'] == "Lake Mendota"]
df_monona  = df_icecover[df_icecover['lakeid'] == "Lake Monona"]

# Colors
colors = ["#6a5acd", "#e64173"]

# Plot data points
plt.figure(figsize=(8,5))
plt.scatter(df_mendota['year'], df_mendota['ice_duration'], color=colors[0], label="Mendota")
plt.scatter(df_monona['year'], df_monona['ice_duration'], color=colors[1], label="Monona")

# Prepare years for predictions
years = np.linspace(df_icecover['year'].min(), df_icecover['year'].max(), 200)

# Predict for Mendota
pred_mendota = mod3.predict(pd.DataFrame({'year': years, 'lakeid': 'Lake Mendota'}))
plt.plot(years, pred_mendota, color=colors[0], linestyle='-', linewidth=2)

# Predict for Monona
pred_monona = mod3.predict(pd.DataFrame({'year': years, 'lakeid': 'Lake Monona'}))
plt.plot(years, pred_monona, color=colors[1], linestyle='-', linewidth=2)

# Labels, legend, title
plt.xlabel("Year")
plt.ylabel("Ice duration (days)")
plt.title("Ice duration over years with regression lines")
plt.legend()
plt.grid(True)
plt.show()

Question 10

A climatologist uses mod3 to predict the ice duration in the year \(2500\) for both lakes and obtains the following results:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np

# Load data
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
# Fit linear model with lake-specific intercepts and trend
mod3 = smf.ols("ice_duration ~ year + lakeid + year:lakeid", data=df_icecover).fit()
# Define levels for the categorical variable
lake_levels = df_icecover['lakeid'].astype('category').cat.categories
# Predict for Mendota
pred_mendota = mod3.predict(pd.DataFrame({
    'year': [2500], 
    'lakeid': pd.Categorical(['Lake Mendota'], categories=lake_levels)
}))
# Predict for Monona
pred_monona = mod3.predict(pd.DataFrame({
    'year': [2500], 
    'lakeid': pd.Categorical(['Lake Monona'], categories=lake_levels)
}))

print(f"Predicted ice duration in 2500 for Lake Mendota: {pred_mendota.values[0]:.2f} days")
Predicted ice duration in 2500 for Lake Mendota: -7.37 days
print(f"Predicted ice duration in 2500 for Lake Monona: {pred_monona.values[0]:.2f} days")
Predicted ice duration in 2500 for Lake Monona: -15.21 days

Is this prediction reliable? Justify your answer.

Question 11

Based on the AIC values, which model would you select among the three models fitted above?

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
mod1 = smf.ols("ice_duration ~ year", data=df_icecover).fit()
mod2 = smf.ols("ice_duration ~ year + lakeid", data=df_icecover).fit()
mod3 = smf.ols("ice_duration ~ year + lakeid + year:lakeid", data=df_icecover).fit()
aic_values = pd.DataFrame({
    'Model': ['mod1', 'mod2', 'mod3'],
    'AIC': [mod1.aic, mod2.aic, mod3.aic]
})
print(aic_values)
  Model          AIC
0  mod1  2817.758574
1  mod2  2819.400099
2  mod3  2821.233773

Question 12

An expert says that the data are actually not independent because the ice duration in a given year might be correlated with the ice duration in the previous year. Is this a problem to perform inference with the models fitted above?