Example reading data: estimation

In this exercise, we will replicate the analysis of the reading dataset presented in the lectures slides. Our first step is to load the data . We can load the data as follows:

# Import data
import pandas as pd
reading = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/reading.csv")

We can now take a look at the data.

# Import data
n = 5
print(reading.head(n)) #print the first n rows of the dataset
   id      score        age    group
0   1  59.988809  10.937378  Control
1   2  60.394867  10.261197  Control
2   3  70.023947  11.795009  Control
3   4  41.477072   6.471665  Control
4   5  29.154968   6.321822  Control

Comparing groups

To compare the two groups (i.e. Control and Treatment), we can start by comparing the empirical distributions using boxplots similarly to the graph presented in slide 2. For this purpose, we will construct two new variables, called control and treatment, which correspond to the scores in these groups. This can be done as follows:

control = reading['score'][reading['group'] == 'Control']
treatment = reading['score'][reading['group'] == 'Treatment']

Representing the distribution of the score between groups

Using these variables we can easily represent the conditional distribution of the score as follows:

import matplotlib.pyplot as plt

# Create the boxplot
plt.figure(figsize=(6, 5))
plt.boxplot([control, treatment], tick_labels=['Control', 'Treatment']);

# Add labels and title
plt.ylabel('Score')
plt.title('Comparison of Reading Scores by Group')

# Show the plot
plt.show()

Modelling

As suggested in slide 9, a possible model is given by: \[ \text { Score }_i=\beta_0+\beta_1 \text { Group }_i+\beta_2 \text { Age }_i+\epsilon_i, \quad \epsilon_i \stackrel{i i d}{\sim} \mathcal{N}\left(0, \sigma^2\right) \] We will refer to this model as “Model 1”. The parameters of this first model can be estimated using the function smf.ols as follows:

import statsmodels.formula.api as smf

# Fit the linear model
mod1 = smf.ols('score ~ group + age', data=reading).fit()

The summary of the fitted model can then be printed with:

# Display the summary
print(mod1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  score   R-squared:                       0.859
Model:                            OLS   Adj. R-squared:                  0.854
Method:                 Least Squares   F-statistic:                     173.0
Date:                Wed, 12 Nov 2025   Prob (F-statistic):           6.16e-25
Time:                        11:11:21   Log-Likelihood:                -173.13
No. Observations:                  60   AIC:                             352.3
Df Residuals:                      57   BIC:                             358.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept             -7.8639      3.324     -2.366      0.021     -14.520      -1.208
group[T.Treatment]     6.3771      1.393      4.578      0.000       3.587       9.167
age                    6.6457      0.370     17.985      0.000       5.906       7.386
==============================================================================
Omnibus:                        0.633   Durbin-Watson:                   2.068
Prob(Omnibus):                  0.729   Jarque-Bera (JB):                0.386
Skew:                           0.196   Prob(JB):                        0.824
Kurtosis:                       3.013   Cond. No.                         50.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.