
Exercises ice duration data
In this exercise we will model lake ice duration data using linear regression models. According to the United States Environmental Protection Agency, lake ice duration can be an indicator of climate change. This is because lake ice is dependent on several environmental factors, so changes in these factors will influence the formation of ice on top of lakes. As a result, the study and analysis of lake ice formation can inform scientists about how quickly the climate is changing, and are critical to minimizing disruptions to lake ecosystems. We can examine the ice duration of Lake Mendota and Lake Monona in the are of Madison, WI. The dataset is extracted from the ntl_icecover dataset of the R package lterdatasampler.
You can load the data as follows:
import pandas as pd
import matplotlib.pyplot as plt
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
df_icecover.head() lakeid ice_on ice_off ice_duration year
0 Lake Mendota 1855-12-18 1856-04-14 118 1855
1 Lake Mendota 1856-12-06 1857-05-06 151 1856
2 Lake Mendota 1857-11-25 1858-03-26 121 1857
3 Lake Mendota 1858-12-08 1859-03-14 96 1858
4 Lake Mendota 1859-12-07 1860-03-26 110 1859
Question 1
Complete the code below to visualize the ice duration over the years for both lakes.
Select the observations corresponding to each lake using boolean indexing and plot the data using matplotlib.
import pandas as pd
import matplotlib.pyplot as plt
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
# Filter lakes
df_lake_mendota = df_icecover[df_icecover['lakeid'] == "Lake Mendota"]
df_lake_monona = df_icecover[df_icecover['lakeid'] == "Lake Monona"]
# Colors
mycol = ["#6a5acd", "#e64173"]
# Plot
plt.figure(figsize=(8,5))
plt.plot(df_lake_mendota['year'], df_lake_mendota['ice_duration'], 'o', color=mycol[0], label='Mendota')
plt.plot(df_lake_monona['year'], df_lake_monona['ice_duration'], 'o', color=mycol[1], label='Monona')
plt.xlabel("Year")
plt.ylabel("Ice duration (days)")
plt.title("Ice duration over years")
plt.legend(loc="upper right")
plt.xticks(rotation=45)
plt.grid(True)
plt.show()Question 2
One of your colleague proposes to consider the following models to analyze the ice duration data: \[ \text{IceCover}_i = \beta_0 + \beta_1 \text{Year}_i + \epsilon_i \] He estimates the model and obtain the following results:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
mod1 = smf.ols("ice_duration ~ year", data=df_icecover).fit()
print(mod1.summary()) OLS Regression Results
==============================================================================
Dep. Variable: ice_duration R-squared: 0.246
Model: OLS Adj. R-squared: 0.244
Method: Least Squares F-statistic: 107.2
Date: Wed, 12 Nov 2025 Prob (F-statistic): 6.14e-22
Time: 11:11:16 Log-Likelihood: -1406.9
No. Observations: 331 AIC: 2818.
Df Residuals: 329 BIC: 2825.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 495.4410 37.926 13.063 0.000 420.834 570.048
year -0.2027 0.020 -10.356 0.000 -0.241 -0.164
==============================================================================
Omnibus: 25.409 Durbin-Watson: 2.150
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32.243
Skew: -0.597 Prob(JB): 9.97e-08
Kurtosis: 3.955 Cond. No. 7.85e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
He wants to test if the ice duration is reducing over the years, regardless of the lake. What are the correct hyptheses to test this claim?
Question 3
How do you interpret the coefficent associated with \(\beta_0\) in mod1?
Question 4
A second colleague of yours proposes to consider the following model to analyze the ice duration data: \[ \text{IceCover}_i = \beta_0 + \beta_1 \text{Year}_i + \beta_2 \text{LakeID}_i + \epsilon_i \] More precisely, she wants to test if the ice duration is reducing over the years, while including an intercept for each lake.
She estimates the model and obtain the following results:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
mod2 = smf.ols("ice_duration ~ year + lakeid", data=df_icecover).fit()
print(mod2.summary()) OLS Regression Results
==============================================================================
Dep. Variable: ice_duration R-squared: 0.247
Model: OLS Adj. R-squared: 0.242
Method: Least Squares F-statistic: 53.69
Date: Wed, 12 Nov 2025 Prob (F-statistic): 6.74e-21
Time: 11:11:16 Log-Likelihood: -1406.7
No. Observations: 331 AIC: 2819.
Df Residuals: 328 BIC: 2831.
Df Model: 2
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 494.7583 37.980 13.027 0.000 420.043 569.474
lakeid[T.Lake Monona] 1.1167 1.873 0.596 0.551 -2.568 4.801
year -0.2027 0.020 -10.342 0.000 -0.241 -0.164
==============================================================================
Omnibus: 25.159 Durbin-Watson: 2.152
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.751
Skew: -0.595 Prob(JB): 1.27e-07
Kurtosis: 3.941 Cond. No. 7.86e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.86e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Complete the following code to plot the data points and the regression lines for both lakes.
Use the method predict of the fitted model to obtain the predicted values for each lake over a range of years, and plot these predictions as lines on top of the scatter plot of the data points.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
# Load data
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
# Fit linear model with lake-specific intercepts
mod2 = smf.ols("ice_duration ~ year + lakeid", data=df_icecover).fit()
print(mod2.summary())
# Separate lakes
df_mendota = df_icecover[df_icecover['lakeid'] == "Lake Mendota"]
df_monona = df_icecover[df_icecover['lakeid'] == "Lake Monona"]
# Colors
colors = ["#6a5acd", "#e64173"]
# Plot data points
plt.figure(figsize=(8,5))
plt.scatter(df_mendota['year'], df_mendota['ice_duration'], color=colors[0], label="Mendota")
plt.scatter(df_monona['year'], df_monona['ice_duration'], color=colors[1], label="Monona")
# Prepare years for predictions
years = np.linspace(df_icecover['year'].min(), df_icecover['year'].max(), 200)
# Predict for Mendota
pred_mendota = mod2.predict(pd.DataFrame({'year': years, 'lakeid': 'Lake Mendota'}))
plt.plot(years, pred_mendota, color=colors[0], linestyle='-', linewidth=2)
# Predict for Monona
pred_monona = mod2.predict(pd.DataFrame({'year': years, 'lakeid': 'Lake Monona'}))
plt.plot(years, pred_monona, color=colors[1], linestyle='-', linewidth=2)
# Labels, legend, title
plt.xlabel("Year")
plt.ylabel("Ice duration (days)")
plt.title("Ice duration over years with regression lines")
plt.legend()
plt.grid(True)
plt.show()Question 5
A third colleague of yours proposes to consider the following model to analyze the ice duration data: \[
\text{IceCover}_i = \beta_0 + \beta_1 \text{Year}_i + \beta_2 \text{LakeID}_i + \beta_3 \text{Year}_i \times \text{LakeID}_i + \epsilon_i
\] What is the correct formula to specify this model in statsmodels?
Question 6
She estimates the model and obtain the following results:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
mod3 = smf.ols("ice_duration ~ year + lakeid + year:lakeid", data=df_icecover).fit()
print(mod3.summary()) OLS Regression Results
==============================================================================
Dep. Variable: ice_duration R-squared: 0.247
Model: OLS Adj. R-squared: 0.240
Method: Least Squares F-statistic: 35.76
Date: Wed, 12 Nov 2025 Prob (F-statistic): 5.18e-20
Time: 11:11:16 Log-Likelihood: -1406.6
No. Observations: 331 AIC: 2821.
Df Residuals: 327 BIC: 2836.
Df Model: 3
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 479.2000 54.027 8.870 0.000 372.915 585.485
lakeid[T.Lake Monona] 31.9296 76.027 0.420 0.675 -117.635 181.494
year -0.1946 0.028 -6.980 0.000 -0.249 -0.140
year:lakeid[T.Lake Monona] -0.0159 0.039 -0.405 0.685 -0.093 0.061
==============================================================================
Omnibus: 25.821 Durbin-Watson: 2.154
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32.984
Skew: -0.601 Prob(JB): 6.88e-08
Kurtosis: 3.972 Cond. No. 2.06e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.06e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
This colleague wants to test whether the ice duration is evolving at different rates for the two lakes. What would be the appropriate hypotheses for this test?
Question 7
What is the p-value associated with this test according to the model summary above?
Question 8
What can you conclude regarding the difference in ice duration between the two lakes (considering \(\alpha=0.05\))?
Question 9
Complete the following code to plot the data points and the regression lines for both lakes according to the third model.
Use the method predict of the fitted model to obtain the predicted values for each lake over a range of years, and plot these predictions as lines on top of the scatter plot of the data points.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
# Load data
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
# Fit linear model with lake-specific intercepts
mod3 = smf.ols("ice_duration ~ year + lakeid + year:lakeid", data=df_icecover).fit()
print(mod3.summary())
# Separate lakes
df_mendota = df_icecover[df_icecover['lakeid'] == "Lake Mendota"]
df_monona = df_icecover[df_icecover['lakeid'] == "Lake Monona"]
# Colors
colors = ["#6a5acd", "#e64173"]
# Plot data points
plt.figure(figsize=(8,5))
plt.scatter(df_mendota['year'], df_mendota['ice_duration'], color=colors[0], label="Mendota")
plt.scatter(df_monona['year'], df_monona['ice_duration'], color=colors[1], label="Monona")
# Prepare years for predictions
years = np.linspace(df_icecover['year'].min(), df_icecover['year'].max(), 200)
# Predict for Mendota
pred_mendota = mod3.predict(pd.DataFrame({'year': years, 'lakeid': 'Lake Mendota'}))
plt.plot(years, pred_mendota, color=colors[0], linestyle='-', linewidth=2)
# Predict for Monona
pred_monona = mod3.predict(pd.DataFrame({'year': years, 'lakeid': 'Lake Monona'}))
plt.plot(years, pred_monona, color=colors[1], linestyle='-', linewidth=2)
# Labels, legend, title
plt.xlabel("Year")
plt.ylabel("Ice duration (days)")
plt.title("Ice duration over years with regression lines")
plt.legend()
plt.grid(True)
plt.show()Question 10
A climatologist uses mod3 to predict the ice duration in the year \(2500\) for both lakes and obtains the following results:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
# Load data
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
# Fit linear model with lake-specific intercepts and trend
mod3 = smf.ols("ice_duration ~ year + lakeid + year:lakeid", data=df_icecover).fit()
# Define levels for the categorical variable
lake_levels = df_icecover['lakeid'].astype('category').cat.categories
# Predict for Mendota
pred_mendota = mod3.predict(pd.DataFrame({
'year': [2500],
'lakeid': pd.Categorical(['Lake Mendota'], categories=lake_levels)
}))
# Predict for Monona
pred_monona = mod3.predict(pd.DataFrame({
'year': [2500],
'lakeid': pd.Categorical(['Lake Monona'], categories=lake_levels)
}))
print(f"Predicted ice duration in 2500 for Lake Mendota: {pred_mendota.values[0]:.2f} days")Predicted ice duration in 2500 for Lake Mendota: -7.37 days
print(f"Predicted ice duration in 2500 for Lake Monona: {pred_monona.values[0]:.2f} days")Predicted ice duration in 2500 for Lake Monona: -15.21 days
Is this prediction reliable? Justify your answer.
Question 11
Based on the AIC values, which model would you select among the three models fitted above?
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
df_icecover = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/df_icecover.csv")
mod1 = smf.ols("ice_duration ~ year", data=df_icecover).fit()
mod2 = smf.ols("ice_duration ~ year + lakeid", data=df_icecover).fit()
mod3 = smf.ols("ice_duration ~ year + lakeid + year:lakeid", data=df_icecover).fit()
aic_values = pd.DataFrame({
'Model': ['mod1', 'mod2', 'mod3'],
'AIC': [mod1.aic, mod2.aic, mod3.aic]
})
print(aic_values) Model AIC
0 mod1 2817.758574
1 mod2 2819.400099
2 mod3 2821.233773
Question 12
An expert says that the data are actually not independent because the ice duration in a given year might be correlated with the ice duration in the previous year. Is this a problem to perform inference with the models fitted above?