Example reading data: improving our model

We have seen in the previous section that our model didn’t appear suitable for our data. In the lecture slides, we propose the following alternative model:

\[ \text { Score }_i=\beta_0+\beta_1 \text { Group }_i+\beta_2 \text { Age }_i+\beta_3 \text { Group }_i \text { Age }_i+\epsilon_i, \quad \epsilon_i \stackrel{i i d}{\sim} \mathcal{N}\left(0, \sigma^2\right) \]

The parameters of this second model can be estimated as follows:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import numpy as np
import scipy.stats as stats

reading = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/reading.csv")

# Fit the linear model: score ~ group + age
mod2 = smf.ols('score ~ group*age', data=reading).fit()
print(mod2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  score   R-squared:                       0.910
Model:                            OLS   Adj. R-squared:                  0.906
Method:                 Least Squares   F-statistic:                     189.6
Date:                Wed, 12 Nov 2025   Prob (F-statistic):           2.70e-29
Time:                        11:11:34   Log-Likelihood:                -159.45
No. Observations:                  60   AIC:                             326.9
Df Residuals:                      56   BIC:                             335.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 -3.2489      2.790     -1.164      0.249      -8.839       2.341
group[T.Treatment]       -36.0307      7.539     -4.779      0.000     -51.134     -20.928
age                        6.1207      0.311     19.693      0.000       5.498       6.743
group[T.Treatment]:age     5.9539      1.047      5.688      0.000       3.857       8.051
==============================================================================
Omnibus:                        0.210   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.901   Jarque-Bera (JB):                0.005
Skew:                           0.013   Prob(JB):                        0.998
Kurtosis:                       3.036   Cond. No.                         145.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

To assess its adequacy, we compare the predictions and data as follows:

age_to_predict = np.arange(6, 13)
data_control = pd.DataFrame({"age": age_to_predict, "group": "Control"})
data_treatment = pd.DataFrame({"age": age_to_predict, "group": "Treatment"})

prediction_control = mod2.predict(data_control)
prediction_treatment = mod2.predict(data_treatment)

# Define custom colors
colors = {
    'Control': '#c4bedf',    # soft purple-blue
    'Treatment': '#f6aed5'   # soft pink
}

# Map colors to groups
group_col = reading["group"].map(colors)

plt.figure(figsize=(7, 5))
plt.scatter(reading["age"], reading["score"], c=group_col, label=None)
plt.plot(age_to_predict, prediction_control, color=colors['Control'], label="Control")
plt.plot(age_to_predict, prediction_treatment, color=colors['Treatment'], label="Treatment")
plt.xlabel("Age")
plt.ylabel("Score")
plt.legend()
plt.title("Model 2 Predictions vs Data")
plt.show()

Based on this graph, our second model appears to be more adequate than our first model. We can also compare the AIC of these two models as follows:

#  Compare AICs of Model 1 and Model 2
print("AIC(mod1):", round(mod1.aic,2))

AIC(mod1): 352.27

print("AIC(mod2):", round(mod2.aic,2))

AIC(mod2): 326.91

This approach also suggests that the second model is more adequate. Similarly to what we did previously, we can construct the model diagnostic graphs as follows:

import scipy.stats as stats
#  residuals vs fitted values for Model 2
colors = {
    'Control': '#c4bedf',
    'Treatment': '#f6aed5'
}
group_col = reading["group"].map(colors)

plt.figure(figsize=(6, 5))
plt.scatter(mod2.fittedvalues, mod2.resid, c=group_col)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted (Model 2)")
plt.show()

#  Q–Q plot for residuals (Model 2) 
plt.figure(figsize=(6, 5))
(osm, osr), (slope, intercept, r) = stats.probplot(mod2.resid, dist="norm")
# Black dots
plt.scatter(osm, osr, color='black', s=20)

# Black reference line
plt.plot(osm, slope * osm + intercept, color='black', linewidth=1.5)
plt.title("Normal Q–Q Plot of Residuals")
plt.xlabel("Theoretical Quantiles")
plt.ylabel("Sample Quantiles")
plt.show()

Both graphs suggest that the model is a reasonable approximation for the data we considering. However, in the lecture slides we consider a third model which is given by:

\[ \text { Score }_i=\beta_0+\beta_1\left(\text { Age }_i-6\right)+\beta_2 \text { Group }_i\left(\text { Age }_i-6\right)+\epsilon_i, \quad \epsilon_i \stackrel{i i d}{\sim} \mathcal{N}\left(0, \sigma^2\right) \]

To consider this model we first need to construct the variable “Age - 6”, which can be done as follows:

# Create the variable age_minus_6
reading["age_minus_6"] = reading["age"] - 6

Then, the parameters of this third model can be estimated as follows:

# Fit the model
mod3 = smf.ols("score ~ age_minus_6 + group:age_minus_6", data=reading).fit()
print(mod3.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  score   R-squared:                       0.910
Model:                            OLS   Adj. R-squared:                  0.907
Method:                 Least Squares   F-statistic:                     289.2
Date:                Wed, 12 Nov 2025   Prob (F-statistic):           1.42e-30
Time:                        11:11:35   Log-Likelihood:                -159.47
No. Observations:                  60   AIC:                             324.9
Df Residuals:                      57   BIC:                             331.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                         33.3504      0.790     42.192      0.000      31.768      34.933
age_minus_6                        6.1522      0.260     23.625      0.000       5.631       6.674
group[T.Treatment]:age_minus_6     5.8103      0.716      8.119      0.000       4.377       7.243
==============================================================================
Omnibus:                        0.192   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.909   Jarque-Bera (JB):                0.008
Skew:                           0.027   Prob(JB):                        0.996
Kurtosis:                       3.020   Cond. No.                         5.84
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

To assess the model adequacy, we compare the predictions and data as follows (note that this model is based on age_minus_6 and not age):

#  Compare predictions and data 
age_to_predict = np.arange(0, 7)
data_control = pd.DataFrame({
    "age_minus_6": age_to_predict,
    "group": "Control"
})
data_treatment = pd.DataFrame({
    "age_minus_6": age_to_predict,
    "group": "Treatment"
})

prediction_control = mod3.predict(data_control)
prediction_treatment = mod3.predict(data_treatment)

plt.figure(figsize=(7, 5))
plt.scatter(reading["age"], reading["score"], c=group_col, label=None)
plt.plot(age_to_predict + 6, prediction_control, color=colors['Control'], label="Control")
plt.plot(age_to_predict + 6, prediction_treatment, color=colors['Treatment'], label="Treatment")
plt.xlabel("Age")
plt.ylabel("Score")
plt.legend()
plt.title("Model 3 Predictions vs Data")
plt.show()

Based on this graph, our third model appears to be a better fit compared to the second model. We can also compare the AIC of the models we have considered as follows:

# Compare AICs of the three models
print("AIC(mod1):", round(mod1.aic,2))

AIC(mod1): 352.27

print("AIC(mod2):", round(mod2.aic,2))

AIC(mod2): 326.91

print("AIC(mod3):", round(mod3.aic,2))

AIC(mod3): 324.95

This approach suggests that the third model is the most adequate among all three models we have considered. Finally, we can construct the model diagnostic graphs as follows:

#  Diagnostic plots for Model 3 
plt.figure(figsize=(6, 5))
plt.scatter(mod3.fittedvalues, mod3.resid, c=group_col)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted (Model 3)")
plt.show()

#  QQ plot for residuals of Model 3 
plt.figure(figsize=(6, 5))
(osm, osr), (slope, intercept, r) = stats.probplot(mod3.resid, dist="norm")
# Black dots
plt.scatter(osm, osr, color='black', s=20)

# Black reference line
plt.plot(osm, slope * osm + intercept, color='black', linewidth=1.5)
plt.title("Normal Q–Q Plot of Residuals")
plt.xlabel("Theoretical Quantiles")
plt.ylabel("Sample Quantiles")
plt.show()