Analysis of Variance (ANOVA)

Sébastien Biass

sebastien.biasse@unige.ch

Earth Sciences, University of Geneva 🇨🇭

Stéphane Guerrier

Stephane.Guerrier@unige.ch

Earth Sciences & Pharmaceutical Sciences, University of Geneva 🇨🇭

Dominique Couturier

Cancer Research UK, University of Cambridge 🇬🇧

Yuming Zhang

T.H. Chan School of Public Health, Harvard University 🇺🇸

How does it work?

Statistical methods are based on several fundamental concepts, the most central of which is to consider the information available (in the form of data) resulting from a random process.

As such, the data represent a random sample of a totally or conceptually accessible population.

Then, statistical inference allows to infer the properties of a population based on the observed sample. This includes deriving estimates and testing hypotheses.

Source: luminousmen

Hypothesis testing

In general (scientific) hypotheses can be translated into a set of (non-overlapping idealized) statistical hypotheses:

\[H_0: \theta\hspace{0.1cm} {\color{#eb078e}{\in}}\hspace{0.1cm}\Theta_0 \quad \text{and} \quad H_a: \theta\hspace{0.1cm} {\color{#eb078e}{\not\in}}\hspace{0.1cm}\Theta_0\]

In a hypothesis test, the statement being tested is called the null hypothesis \(\color{#373895}{H_0}\). A hypothesis test is designed to assess the strength of the evidence against the null hypothesis.
The alternative hypothesis \(\color{#373895}{H_a}\) is the statement we hope or suspect to be true instead of \(\color{#373895}{H_0}\).
Each hypothesis excludes the other, so that one can exclude one in favor of the other using the data.
Example: a drug represses the progression of cancer

\[H_0: \mu_{\text{drug}}\hspace{0.1cm} {\color{#eb078e}{=}}\hspace{0.1cm}\mu_{\text{control}} \ \text{ and } \ H_a: \mu_{\text{drug}}\hspace{0.1cm} {\color{#eb078e}{<}}\hspace{0.1cm}\mu_{\text{control}}.\]

Hypothesis testing

Outcome	\(H_0\) is true	\(H_0\) is false
Can’t reject \(H_0\)	✅ Correct decision (prob=\(1-\alpha\))	⚠️ Type II error (prob=\(1-\beta\))
Reject \(H_0\)	⚠️ Type I error (prob=\(\alpha\))	✅ Correct decision (prob=\(\beta\))

The type I error corresponds to the probability of rejecting \(H_0\) when \(H_0\) is true (also called false positive). The type II error corresponds to the probability of not rejecting \(H_0\) when \(H_a\) is true (also called false negative).

A test is of significance level \(\color{#e64173}{\alpha}\) when the probability of making a type I error equals \(\alpha\). Usually we consider \(\alpha = 5\%\), however, this can vary depending on the context.

A test is of power \(\color{#6A5ACD}{\beta}\) when the probability to make a type II error is \(1-\beta.\) In other words, the power of a test is its probability of rejecting \(H_0\) when \(H_0\) is false (or the probability of accepting \(H_a\) when \(H_a\) is true).

What are p-values?

In statistics, the p-value is defined as the probability of observing a test statistic that is at least as extreme as actually observed, assuming that the null hypothesis \(H_0\) is true.

Informally, a p-value can be understood as a measure of plausibility of the null hypothesis given the data. A small p-value indicates strong evidence against \(H_0\).

When the p-value is small enough (i.e., smaller than the significance level \(\alpha\)), the test based on the null and alternative hypotheses is considered significant, meaning we reject the null hypothesis in favor of the alternative. This is generally what we want because it “verifies” our (research) hypothesis.

When the p-value is not small enough, with the available data, we cannot reject the null hypothesis, so nothing can be concluded. 🤔

The obtained p-value summarizes the incompatibility between the data and the model constructed under the set of assumptions.

“Absence of evidence is not evidence of absence.” ^👋

👋 From the British Medical Journal.

How to understand p-values?

👋 If you want to know more have a look here.

P-values may be controversial

P-values have been misused many times because understanding what they mean is not intuitive!

👋 If you want to know more have a look here.

Quick review: Normal distribution

\[Y\sim N(\mu,\sigma^{2}), \ \ \ \ \ f_{Y}(y) = \frac{1}{\sqrt{2\pi{\color{drawColor6} \sigma}^{2}}}\ e^{-\frac{(y-{\color{drawColor6} \mu})^{2}}{2{\color{drawColor6} \sigma}^{2}}}\]

\[\mathbb{E}[Y] = \mu, \ \ \ \ \ \ \ \ \ \ \ \ \text{Var}[Y] = \sigma^{2},\]

\[Z = \frac{Y-\mu}{\sigma} \sim N(0,1), \ \ \ \ \ f_{Z}(z) = \frac{1}{\sqrt{2\pi}}\ e^{-\frac{z^{2}}{2}}.\]

Probability density function of a normal distribution:

Two-sample location tests

In practice, we often encounter problems where our goal is to compare the means (or locations) of two samples. For example,

A scientist is interested in comparing the vaccine efficacy of the Pfizer-BioNTech and the Moderna vaccine.
A bank wants to know which of two proposed plans will most increase the use of its credit cards.
A psychologist wants to compare male and female college students’ impression on a selected webpage.

We will discuss three two-sample location tests:

Two independent sample Student’s t-test
Two independent sample Welch’s t-test
Two independent sample Mann-Whitney-Wilcoxon test

Two independent sample Student’s t-test

This test considers the following assumed model for group A and B

\[X_{i(g)} = {\color{#eb078e}{\mu_{g}}}\hspace{0.1cm}+\hspace{0.1cm} \varepsilon_{i(g)} = \mu + {\color{#eb078e}{\delta_{g}}} \hspace{0.1cm}+\hspace{0.1cm} \varepsilon_{i(g)},\]

where \(g=A,B\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,\color{#eb078e}{\sigma^{2}}\)\()\) and \(\sum n_{g}\delta_{g} =0\).

📝 \(\color{#6A5ACD}{n_A}\) \(=\) sample size of group A, \(\color{#6A5ACD}{\mu_{A} = \mu + \delta_A}\) \(=\) population mean of group A, \(\color{#06bcf1}{n_B}\) and \(\color{#06bcf1}{\mu_{B} = \mu + \delta_B}\) are similarly defined for group B.

Hypotheses:

\(H_0: \color{#6A5ACD}{\mu_A} \hspace{0.1cm}\)\(-\hspace{0.1cm} \color{#06bcf1}{\mu_B} \color{#eb078e}{=} \hspace{0.1cm}\)\(\mu_0\ \ \ \ \text{and} \ \ \ \ H_a: \color{#6A5ACD}{\mu_A}\hspace{0.1cm}\)\(-\hspace{0.1cm} \color{#06bcf1}{\mu_B} \ \hspace{0.1cm}\)\(\hspace{0.1cm} \big[ \color{#eb078e}{>}\hspace{0.1cm}\)\(\hspace{0.1cm} \text{ or } \color{#eb078e}{<}\hspace{0.1cm}\)\(\hspace{0.1cm} \text{ or } \color{#eb078e}{\neq}\)\(\big]\hspace{0.1cm}\mu_0.\)

Test statistic’s distribution under \(H_0\):

\[ \color{#b4b4b4}{T = \frac{(\overline{X}_{A}-\overline{X}_{B})-\mu_0}{s_{p}\sqrt{n_{A}^{-1}+n_{B}^{-1}}} \ {\underset{H_0}{\sim}} \ \text{Student}(n_{A}+n_{B}-2).} \]

Discussion — Student’s t-test

Python function:

from scipy import stats

stats.ttest_ind(a = ..., b = ..., alternative = ..., equal_var = True)

This test strongly relies on the assumed absence of outliers. If outliers appear to be present the Mann-Whitney-Wilcoxon test (see later) is (probably) a better option.

For moderate and small sample sizes, the sample distribution should be at least approximately normal with no strong skewness to ensure the reliability of the test.

In practice, the assumption of equal variance is hard to verify so we recommend to avoid this test in practice.

Two independent sample Welch’s t-test

This test considers the following assumed model for group A and B

\[X_{i(g)} = {\color{#eb078e}{\mu_{g}}}\hspace{0.1cm}+\hspace{0.1cm} \varepsilon_{i(g)} = \mu + {\color{#eb078e}{\delta_{g}}} \hspace{0.1cm}+\hspace{0.1cm} \varepsilon_{i(g)},\]

where \(g=A,B\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,\color{#eb078e}{\sigma_g^{2}}\)\()\) and \(\sum n_{g}\delta_{g} =0\).

Hypotheses:

\[H_0: {\color{#6A5ACD}{\mu_A}} \hspace{0.1cm}-\hspace{0.1cm} {\color{#06bcf1}{\mu_B}} {\color{#eb078e}{=}} \hspace{0.1cm}\mu_0\ \ \ \ \text{and} \ \ \ \ H_a: {\color{#6A5ACD}{\mu_A}}\hspace{0.1cm}-\hspace{0.1cm} {\color{#06bcf1}{\mu_B}} \ \hspace{0.1cm}\hspace{0.1cm} \big[ {\color{#eb078e}{>}}\hspace{0.1cm}\hspace{0.1cm} \text{ or } {\color{#eb078e}{<}}\hspace{0.1cm}\hspace{0.1cm} \text{ or } {\color{#eb078e}{\neq}}\big]\hspace{0.1cm}\mu_0.\]

Test statistic’s distribution under \(H_0\):

\[ \color{#b4b4b4}{T = \frac{(\overline{X}_{A}-\overline{X}_{B})-\mu_0}{\sqrt{s^2_A/n_{A} + s^2_B/n_{B}}} \ {\underset{H_0}{\sim}} \ \text{Student}(df).} \]

Discussion — Welch’s t-test

Python function:

from scipy import stats

stats.ttest_ind(a = ..., b = ..., alternative = ..., equal_var = False)

This test strongly relies on the assumed absence of outliers. If outliers appear to be present the Mann-Whitney-Wilcoxon test (see later) is (probably) a better option.

For moderate and small sample sizes, the sample distribution should be at least approximately normal with no strong skewness to ensure the reliability of the test.

This test does not require the variances of the two groups to be equal. If the variances of the two groups are the same (which is rather unlikely in practice), the Welch’s t-test loses a little bit of power compared to the Student’s t-test.

The computation of \(df\) (i.e. the degrees of freedom of the distribution under the null) is beyond the scope of this class.

Mann-Whitney-Wilcoxon test

This test considers the following assumed model for group A and B

\(X_{i(g)} = \color{#eb078e}{\theta_{g}}\hspace{0.1cm}\)\(+\hspace{0.1cm} \varepsilon_{i(g)} = \theta + \color{#eb078e}{\delta_{g}} \hspace{0.1cm}\)\(+\hspace{0.1cm} \varepsilon_{i(g)}\),

where \(g=A,B\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,\color{#eb078e}{\sigma^{2}}\)\()\) and \(\sum n_{g}\delta_{g} =0\).

📝 \(\color{#6A5ACD}{n_A}\) \(=\) sample size of group A, \(\color{#6A5ACD}{\theta_{A} = \theta + \delta_A}\) \(=\) population mean of group A, \(\color{#06bcf1}{n_B}\) and \(\color{#06bcf1}{\theta_{B} = \theta + \delta_B}\) are similarly defined for group B.

Hypotheses:

\(H_0: \color{#6A5ACD}{\theta_A} \hspace{0.1cm}\)\(-\hspace{0.1cm} \color{#06bcf1}{\theta_B} \color{#eb078e}{=} \hspace{0.1cm}\)\(\theta_0\ \ \ \ \text{and} \ \ \ \ H_a: \color{#6A5ACD}{\theta_A}\hspace{0.1cm}\)\(-\hspace{0.1cm} \color{#06bcf1}{\theta_B} \ \hspace{0.1cm}\)\(\hspace{0.1cm} \big[ \color{#eb078e}{>}\hspace{0.1cm}\)\(\hspace{0.1cm} \text{ or } \color{#eb078e}{<}\hspace{0.1cm}\)\(\hspace{0.1cm} \text{ or } \color{#eb078e}{\neq}\)\(\big]\hspace{0.1cm}\theta_0.\)

Test statistic’s distribution under \(H_0\):

\[ \color{#b4b4b4}{Z = \frac{\sum_{i=1}^{n_{B}}R_{i(g)}-[n_{B}(n_{A}+n_{B}+1)/2]}{\sqrt{n_{A}n_{B}(n_{A}+n_{B}+1)/12}},} \] where \(\color{#b4b4b4}{R_{i(g)}}\) denotes the global rank of the \(\color{#b4b4b4}{i}\)-th observation of group \(\color{#b4b4b4}{g}\).

Discussion – Mann–Whitney–Wilcoxon

Python function:

from scipy import stats

stats.mannwhitneyu(a = ..., b = ...)

This test is “robust” in the sense that (unlike the t-tests) it is not overly affected by outliers.

For the Mann–Whitney–Wilcoxon test to be comparable to the t-tests (i.e. testing for the mean) we need to assume symmetric distributions and equality in variances .

Then, we have \(\color{#6A5ACD}{\theta_A = \mu_A}\) and \(\color{#06bcf1}{\theta_B = \mu_B}\).

Compared to the t-tests, the Mann–Whitney–Wilcoxon test is less powerful if their requirements (Gaussian and possibly same variances) are met.

The distribution of this method under the null is complicated and can be obtained by different methods (e.g. exact, asymptotic normal, …).
The details are beyond the scope of this class.

Comparing diets A and B

Graph

Import

# Import data
import pandas as pd
diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
n = 5
print(diet.head(n)) #print the first n rows of the dataset

   id  gender  age  height diet.type  initial.weight  final.weight
0   1  Female   22     159         A              58          54.2
1   2  Female   46     192         A              60          54.0
2   3  Female   55     170         A              64          63.3
3   4  Female   33     171         A              64          61.1
4   5  Female   50     170         A              65          62.2


# Compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]

# Select diet
posA = diet["diet.type"] == "A"
posB = diet["diet.type"] == "B"

# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietB = diet["weight.loss"][diet["diet.type"]=="B"]

Student

from scipy.stats import ttest_ind

t_stat, p_val = ttest_ind(dietA, dietB, alternative="two-sided", equal_var=True)

print("t-statistic:", round(t_stat, 4))

t-statistic: 0.0475

print("p-value:", round(p_val, 4))

p-value: 0.9623

Welch

from scipy import stats

# Welch's t-test (equal_var=False makes it Welch)
t_stat, p_value = stats.ttest_ind(dietA, dietB, equal_var=False)

print("t-statistic:", round(t_stat, 4))

t-statistic: 0.0476

print("p-value:", round(p_value, 4))

p-value: 0.9622

Wilcox

from scipy import stats

# Wilcoxon rank-sum test (Mann–Whitney U)
stat, p_value = stats.mannwhitneyu(dietA, dietB)

print("Wilcoxon statistic:", round(stat, 4))

Wilcoxon statistic: 277.0

print("p-value:", round(p_value, 4))

p-value: 0.6526

Results

Step 1: Define hypotheses \(H_0: \mu_A = \mu_B, \quad H_a: \mu_A \neq \mu_B\)

Step 2: Define \(\color{#eb078e}{\alpha}\) We consider \(\alpha = 5\%\)

Step 3: Compute p-value
Welch’s t-test appears suitable in this case, and therefore, we obtain \(\text{p-value} = 96.22\%\) (see Python output tab for details).

Step 4: Conclusion
We have \(\text{p-value} > \alpha\) so we fail to reject the null hypothesis at the significance level of \(5\%\).

Problems with multiple samples

In practice, we often even encounter situations where we need to compare the means of more than 2 groups. For example, we want to compare the weight loss efficacy of several diets, say diets A, B, C. Your theory could, for example, be the following: 0 < \(\mu_A\) = \(\mu_B\) < \(\mu_C\). A possible approach to evaluate its validity:

Show that \(\mu_C\) is greater than \(\mu_A\) and \(\mu_B\) (Test 1: \(H_0:\) \(\mu_A\) \(=\) \(\mu_C\), \(H_a:\) \(\mu_A\) \(<\) \(\mu_C\); Test 2: \(H_0:\) \(\mu_B\) \(=\) \(\mu_C\), \(H_a:\) \(\mu_B\) \(<\) \(\mu_C\)). Here we hope to reject \(H_0\) in both cases.
Show that \(\mu_A\) and \(\mu_B\) are greater than 0 (Test 3: \(H_0:\) \(\mu_A\) \(=0\), \(H_a:\) \(\mu_A\) \(>0\); Test 4: \(H_0:\) \(\mu_B\) \(=0\), \(H_a:\) \(\mu_B\) \(>0\)). Here we also hope to reject \(H_0\) in both cases.
Compare \(\mu_A\) and \(\mu_B\) (Test 5: \(H_0:\) \(\mu_A\) \(=\) \(\mu_B\), \(H_a:\) \(\mu_A\) \(\neq\) \(\mu_B\)). Here we hope not to reject \(H_0\). ⚠️ This does not imply that \(\mu_A\) \(=\) \(\mu_B\) is true, but at least the result would not contradict our theory.

Is there a problem in doing many tests?

Are jelly beans causing acne? Maybe… but why only green ones? 🤨

Source: xkcd

Are jelly beans causing acne?

Source: xkcd

Maybe a specific color?

Source: xkcd

Maybe a specific color?

Source: xkcd

And finally…

Source: xkcd

👋 If you want to know more about this comic, have a look here.

Multiple testing can be dangerous!

Remember that a p-value is random as its value depends on the data.
If multiple hypotheses are tested, the chance of observing a rare event increases, and therefore, the chance to incorrectly reject a null hypothesis (i.e. making a Type I error) increases.
For example, if we consider \(k\) (independent) tests (whose null hypotheses are all correct), we have

\[\begin{align} \alpha_k &= \Pr(\text{reject } H_0 \text{ at least once}) \\ &= 1 - \Pr(\text{not reject } H_0 \text{ test 1}) \times \ldots \times \Pr(\text{not reject } H_0 \text{ test k})\\ &= 1 - (1-\alpha) \times \ldots \times (1-\alpha) = 1 - (1-\alpha)^k \end{align}\]

Therefore, \(\alpha_k\) increases rapidly with \(k\) (e.g. \(\alpha_1 = 0.05\), \(\alpha_2 \approx 0.098\), \(\alpha_{10} \approx 0.4013\), \(\alpha_{100} \approx 0.9941\)).
Hence performing multiple tests, with the same or different data, is dangerous ⚠️ (but very common! 😟) as it can lead to significant results, when actually there are none!

Possible solutions

Suppose that we are interested in making \(k\) tests and that we want the probability of rejecting the null at least once (assuming the null hypotheses to be correct for all tests) \(\alpha_k\) to be equal to \(\alpha\) (typically 5%). Instead of using \(\alpha\) for the individual tests we will use \(\alpha_c\) (i.e. a corrected \(\alpha\)). Then, for \(k\) (potentially dependent) tests we have

\[\begin{align} \alpha_k &= \alpha = \Pr(\text{reject } H_0 \text{ at least once}) \\ &= \Pr(\text{reject } H_0 \text{ test 1} \text{ OR } \ldots \text{ OR } \text{reject } H_0 \text{ test k})\\ &{\color{#eb078e}{\leq}} \sum_{i = 1}^k \Pr(\text{reject } H_0 \text{ test i}) = \alpha_c \times k. \end{align}\]

Solving for \(\alpha_c\) we obtain: \(\color{#6A5ACD}{\alpha_c = \alpha/k}\), which is called Bonferroni correction. By making use of the Boole’s inequality, this approach does not require any assumptions about dependence among the tests or about how many of the null hypotheses are true.

Possible solutions

The Bonferroni correction can be conservative if there are a large number of tests, as it comes at the cost of reducing the power of the individual tests (e.g. if \(\alpha = 5\%\) and \(k = 20\), we get \(\alpha_c = 0.05/20 = 0.25\%\)). There exists a (slightly) “tighter” bound for \(\alpha_k\), which is given by

\[ \alpha_k = \Pr(\text{reject } H_0 \text{ at least once}) \hspace{0.1cm} {\color{#eb078e}{\leq}} \hspace{0.1cm} 1 - (1 - \alpha_c)^k. \]

Solving for \(\alpha_c\) we obtain: \(\color{#6A5ACD}{\alpha_c = 1 - (1 - \alpha)^{1/k}}\), which is called Dunn–Šidák correction. This correction is (slightly) less stringent than the Bonferroni correction (since \(1 - (1 - \alpha)^{1/k} > \alpha/k\) for \(k \geq 2\)).
There exist many other alternative methods for multiple testing corrections. It is important to mention that when \(\text{\textit{k}}\) is large (say \(>\) 100) the Bonferroni and Dunn–Šidák corrections become inapplicable and methods based on the idea of False Discovery Rate should be preferred. However, these recent methods are beyond the scope of this class.

Multiple-sample location tests

To compare several means of different populations, a standard approach is to start our analysis by using the multiple-sample location tests. More precisely, we proceed our analysis with the following steps:

Step 1: We first perform the multiple-sample location tests, where the null hypothesis states that all the locations are the same. If we cannot reject the null hypothesis, we stop our analysis here. Otherwise, we move on to Step 2.
Step 2: We compare the groups mutually (using \(\alpha_c\)) with two-sample location tests in order to verify our hypothesis.

We will discuss three multiple-sample location tests:

Fisher’s one-way ANalysis Of VAriance (ANOVA)
Welch’s one-way ANOVA
Kruskal-Wallis test

Fisher’s one-way ANOVA

This test considers the following assumed model for G groups \[X_{i(g)} = {\color{#eb078e}{\mu_{g}}} + \varepsilon_{i(g)} = \mu + {\color{#eb078e}{\delta_{g}}} + \varepsilon_{i(g)},\] where \(g=1,\ldots, G\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,{\color{#eb078e}{\sigma^{2}}})\) and \(\sum n_{g}\delta_{g}=0\).

📝 \(n_i =\) sample size of group \(i\), \(\mu_i = \mu + \delta_i =\) population mean of group \(i\), \(i=1,\ldots, G\).

Hypotheses: \[H_0: {\color{#6A5ACD}{\mu_1}} {\color{#eb078e}{=}} {\color{#06bcf1}{\mu_2}} {\color{#eb078e}{=}} \ldots {\color{#eb078e}{=}} {\color{#8bb174}{\mu_G}} \ \ \ \ \text{and} \ \ \ \ H_a: \mu_i {\color{#eb078e}{\neq}} \mu_j \ \ \text{for at least one pair of} \ \ (i,j).\]

Test statistic’s distribution under \(H_0\):

\(\color{#b4b4b4}{F = \frac{N s^2_{\overline{X}}}{s_p^2} \ {\underset{H_0}{\sim}} \ \text{Fisher}(G-1, N-G),}\) where \(\small \color{#b4b4b4}{s^2_{\overline{X}} = \frac{1}{G-1} \sum_{g=1}^G \frac{n_g}{N} (\overline{X}_g - \overline{X})^2}\),

\(\small \color{#b4b4b4}{s_p^2 = \frac{1}{N-G} \sum_{g=1}^G (n_g-1)s_g^2}\), \(\small \color{#b4b4b4}{N = \sum_{g=1}^G n_g}\), and \(\small \color{#b4b4b4}{\overline{X} = \frac{1}{N} \sum_{g=1}^G n_g \overline{X}_g}\).

Discussion - Fisher’s one-way ANOVA

Python function:

from scipy import stats

stats.f_oneway(group_A, group_B, group_C)

This test strongly relies on the assumed absence of outliers. If outliers appear to be present the Kruskal-Wallis test (see later) is (probably) a better option.

For moderate and small sample sizes, the sample distribution should be at least approximately normal with no strong skewness to ensure the reliability of the test.

In practice, the assumption of equal variance is hard to verify so we recommend to avoid this test in practice.

Welch’s one-way ANOVA

Test statistic’s distribution under \(H_0\): \[\color{#b4b4b4}{F^* = \frac{s^{*^2}_{\overline{X}}}{1+\frac{2(G-2)}{3\Delta}} \ {\underset{H_0}{\sim}} \ \text{Fisher}(G-1, \Delta),}\] where \(\color{#b4b4b4}{s^{*^2}_{\overline{X}} = \frac{1}{G-1} \sum_{g=1}^G w_g (\overline{X}_g - \overline{X}^*)^2}\), \(\color{#b4b4b4}{\Delta = [\frac{3}{G^2-1} \sum_{g=1}^G \frac{1}{n_g} (1-\frac{w_g}{\sum_{g=1}^G w_g})]^{-1}}\), \(\color{#b4b4b4}{w_g = \frac{n_g}{s_g^2}}\),

and \(\small \color{#b4b4b4}{\overline{X}^* = \sum_{g=1}^G \frac{w_g\overline{X}_g}{\sum_{g=1}^G w_g}}\).

Discussion - Welch’s one-way ANOVA

Python function:

from statsmodels.stats.oneway import anova_oneway

anova_oneway(data, groups=groups, use_var='unequal', welch_correction=True)

This test strongly relies on the assumed absence of outliers. If outliers appear to be present the Kruskal-Wallis test (see later) is (probably) a better option.

For moderate and small sample sizes, the sample distribution should be at least approximately normal with no strong skewness to ensure the reliability of the test.

This test does not require the variances of the groups to be equal. If the variances of all the groups are the same (which is rather unlikely in practice), the Welch’s one-way ANOVA loses a little bit of power compared to the Fisher’s one-way ANOVA.

Kruskal-Wallis test

This test considers the following assumed model for G groups \[X_{i(g)} = {\color{#eb078e}{\theta_{g}}} + \varepsilon_{i(g)} = \theta + {\color{#eb078e}{\delta_{g}}} + \varepsilon_{i(g)},\] where \(g=1,\ldots, G\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,{\color{#eb078e}{\sigma^{2}}})\) and \(\sum n_{g}\delta_{g}=0\).

📝 \(n_i =\) sample size of group \(i\), \(\theta_i = \theta + \delta_i =\) population location of group \(i\), \(i=1,\ldots, G\).

Hypotheses: \[H_0: {\color{#6A5ACD}{\theta_1}} {\color{#eb078e}{=}} {\color{#06bcf1}{\theta_2}} {\color{#eb078e}{=}} \ldots {\color{#eb078e}{=}} {\color{#8bb174}{\theta_G}} \ \ \ \ \text{and} \ \ \ \ H_a: \theta_i {\color{#eb078e}{\neq}} \theta_j \ \ \text{for at least one pair of} \ \ (i,j).\]

\(\color{#b4b4b4}{\text{Test statistic's distribution under } H_0: \quad H = \frac{\frac{12}{N(N+1)} \sum_{g=1}^G \frac{\overline{R}_g}{n_g}-3(N-1)}{1-\frac{\sum_{v=1}^V{t_v^3 - t_v}}{N^3 - N}} \ {\underset{H_0}{\sim}} \ \mathcal{X}(G-1)}\)

\(\color{#b4b4b4}{\text{where } \overline{R}_g = \frac{1}{n_g} \sum_{i=1}^{n_g} R_{i(g)} \ \text{with } R_{i(g)} \ \text{denoting the global rank of the } i^{th} \ \text{observation of group } g,}\)

\(\color{#b4b4b4}{V \ \text{is the number of different values/levels in } X \ \text{and } t_v \ \text{denotes the number of times a given}}\)

\(\color{#b4b4b4}{\text{value/level occurred in } X.}\)

Discussion - Kruskal-Wallis test

Python function:

from scipy import stats

stats.kruskal(group_A, group_B, group_C)

This test is “robust” in the sense that (unlike the one-way ANOVA) it is not overly affected by outliers.
For the Kruskal-Wallis test to be comparable to the one-way ANOVAs (i.e. testing for the mean) we need to assume: (1) The distributions are symmetric, (2) the variances are the same. Then, we have \(\color{#eb078e}{\theta_i = \mu_i,\hspace{0.1cm} i=1,\ldots,G}\).
Compared to the one-way ANOVAs, the Kruskal-Wallis test is less powerful if their requirements (Gaussian and possibly same variances) are met.

Exercise: Comparing diets A, B and C

Graph

Exercise: Comparing diets A, B and C

Import

# Import data
import pandas as pd
diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]

# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietB = diet["weight.loss"][diet["diet.type"]=="B"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]

# Create data frame
dat = pd.DataFrame({
    "response": list(dietA) + list(dietB) + list(dietC),
    "groups": (["A"] * len(dietA)) + (["B"] * len(dietB)) + (["C"] * len(dietC))
})