Earth Sciences & Pharmaceutical Sciences, University of Geneva 🇨🇭
Statistical methods are based on several fundamental concepts, the most central of which is to consider the information available (in the form of data) resulting from a random process.
As such, the data represent a random sample of a totally or conceptually accessible population.
Then, statistical inference allows to infer the properties of a population based on the observed sample. This includes deriving estimates and testing hypotheses.
Source: luminousmen
\[H_0: \theta\hspace{0.1cm} {\color{#eb078e}{\in}}\hspace{0.1cm}\Theta_0 \quad \text{and} \quad H_a: \theta\hspace{0.1cm} {\color{#eb078e}{\not\in}}\hspace{0.1cm}\Theta_0\]
\[H_0: \mu_{\text{drug}}\hspace{0.1cm} {\color{#eb078e}{=}}\hspace{0.1cm}\mu_{\text{control}} \ \text{ and } \ H_a: \mu_{\text{drug}}\hspace{0.1cm} {\color{#eb078e}{<}}\hspace{0.1cm}\mu_{\text{control}}.\]
| Outcome | \(H_0\) is true | \(H_0\) is false |
|---|---|---|
| Can’t reject \(H_0\) | ✅ Correct decision (prob=\(1-\alpha\)) | ⚠️ Type II error (prob=\(1-\beta\)) |
| Reject \(H_0\) | ⚠️ Type I error (prob=\(\alpha\)) | ✅ Correct decision (prob=\(\beta\)) |
In statistics, the p-value is defined as the probability of observing a test statistic that is at least as extreme as actually observed, assuming that the null hypothesis \(H_0\) is true.
Informally, a p-value can be understood as a measure of plausibility of the null hypothesis given the data. A small p-value indicates strong evidence against \(H_0\).
When the p-value is small enough (i.e., smaller than the significance level \(\alpha\)), the test based on the null and alternative hypotheses is considered significant, meaning we reject the null hypothesis in favor of the alternative. This is generally what we want because it “verifies” our (research) hypothesis.
When the p-value is not small enough, with the available data, we cannot reject the null hypothesis, so nothing can be concluded. 🤔
The obtained p-value summarizes the incompatibility between the data and the model constructed under the set of assumptions.
“Absence of evidence is not evidence of absence.” 👋
👋 From the British Medical Journal.
👋 If you want to know more have a look here.
P-values have been misused many times because understanding what they mean is not intuitive!
👋 If you want to know more have a look here.
\[Y\sim N(\mu,\sigma^{2}), \ \ \ \ \ f_{Y}(y) = \frac{1}{\sqrt{2\pi{\color{drawColor6} \sigma}^{2}}}\ e^{-\frac{(y-{\color{drawColor6} \mu})^{2}}{2{\color{drawColor6} \sigma}^{2}}}\]
\[\mathbb{E}[Y] = \mu, \ \ \ \ \ \ \ \ \ \ \ \ \text{Var}[Y] = \sigma^{2},\]
\[Z = \frac{Y-\mu}{\sigma} \sim N(0,1), \ \ \ \ \ f_{Z}(z) = \frac{1}{\sqrt{2\pi}}\ e^{-\frac{z^{2}}{2}}.\]
Probability density function of a normal distribution:
In practice, we often encounter problems where our goal is to compare the means (or locations) of two samples. For example,
A scientist is interested in comparing the vaccine efficacy of the Pfizer-BioNTech and the Moderna vaccine.
A bank wants to know which of two proposed plans will most increase the use of its credit cards.
A psychologist wants to compare male and female college students’ impression on a selected webpage.
We will discuss three two-sample location tests:
Two independent sample Student’s t-test
Two independent sample Welch’s t-test
Two independent sample Mann-Whitney-Wilcoxon test
This test considers the following assumed model for group A and B
\[X_{i(g)} = {\color{#eb078e}{\mu_{g}}}\hspace{0.1cm}+\hspace{0.1cm} \varepsilon_{i(g)} = \mu + {\color{#eb078e}{\delta_{g}}} \hspace{0.1cm}+\hspace{0.1cm} \varepsilon_{i(g)},\]
where \(g=A,B\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,\color{#eb078e}{\sigma^{2}}\)\()\) and \(\sum n_{g}\delta_{g} =0\).
📝 \(\color{#6A5ACD}{n_A}\) \(=\) sample size of group A, \(\color{#6A5ACD}{\mu_{A} = \mu + \delta_A}\) \(=\) population mean of group A, \(\color{#06bcf1}{n_B}\) and \(\color{#06bcf1}{\mu_{B} = \mu + \delta_B}\) are similarly defined for group B.
Hypotheses:
\(H_0: \color{#6A5ACD}{\mu_A} \hspace{0.1cm}\)\(-\hspace{0.1cm} \color{#06bcf1}{\mu_B} \color{#eb078e}{=} \hspace{0.1cm}\)\(\mu_0\ \ \ \ \text{and} \ \ \ \ H_a: \color{#6A5ACD}{\mu_A}\hspace{0.1cm}\)\(-\hspace{0.1cm} \color{#06bcf1}{\mu_B} \ \hspace{0.1cm}\)\(\hspace{0.1cm} \big[ \color{#eb078e}{>}\hspace{0.1cm}\)\(\hspace{0.1cm} \text{ or } \color{#eb078e}{<}\hspace{0.1cm}\)\(\hspace{0.1cm} \text{ or } \color{#eb078e}{\neq}\)\(\big]\hspace{0.1cm}\mu_0.\)
Test statistic’s distribution under \(H_0\):
\[ \color{#b4b4b4}{T = \frac{(\overline{X}_{A}-\overline{X}_{B})-\mu_0}{s_{p}\sqrt{n_{A}^{-1}+n_{B}^{-1}}} \ {\underset{H_0}{\sim}} \ \text{Student}(n_{A}+n_{B}-2).} \]
Python function:
This test strongly relies on the assumed absence of outliers. If outliers appear to be present the Mann-Whitney-Wilcoxon test (see later) is (probably) a better option.
For moderate and small sample sizes, the sample distribution should be at least approximately normal with no strong skewness to ensure the reliability of the test.
In practice, the assumption of equal variance is hard to verify so we recommend to avoid this test in practice.
This test considers the following assumed model for group A and B
\[X_{i(g)} = {\color{#eb078e}{\mu_{g}}}\hspace{0.1cm}+\hspace{0.1cm} \varepsilon_{i(g)} = \mu + {\color{#eb078e}{\delta_{g}}} \hspace{0.1cm}+\hspace{0.1cm} \varepsilon_{i(g)},\]
where \(g=A,B\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,\color{#eb078e}{\sigma_g^{2}}\)\()\) and \(\sum n_{g}\delta_{g} =0\).
📝 \(\color{#6A5ACD}{n_A}\) \(=\) sample size of group A, \(\color{#6A5ACD}{\mu_{A} = \mu + \delta_A}\) \(=\) population mean of group A, \(\color{#06bcf1}{n_B}\) and \(\color{#06bcf1}{\mu_{B} = \mu + \delta_B}\) are similarly defined for group B.
Hypotheses:
\[H_0: {\color{#6A5ACD}{\mu_A}} \hspace{0.1cm}-\hspace{0.1cm} {\color{#06bcf1}{\mu_B}} {\color{#eb078e}{=}} \hspace{0.1cm}\mu_0\ \ \ \ \text{and} \ \ \ \ H_a: {\color{#6A5ACD}{\mu_A}}\hspace{0.1cm}-\hspace{0.1cm} {\color{#06bcf1}{\mu_B}} \ \hspace{0.1cm}\hspace{0.1cm} \big[ {\color{#eb078e}{>}}\hspace{0.1cm}\hspace{0.1cm} \text{ or } {\color{#eb078e}{<}}\hspace{0.1cm}\hspace{0.1cm} \text{ or } {\color{#eb078e}{\neq}}\big]\hspace{0.1cm}\mu_0.\]
Test statistic’s distribution under \(H_0\):
\[ \color{#b4b4b4}{T = \frac{(\overline{X}_{A}-\overline{X}_{B})-\mu_0}{\sqrt{s^2_A/n_{A} + s^2_B/n_{B}}} \ {\underset{H_0}{\sim}} \ \text{Student}(df).} \]
Python function:
This test strongly relies on the assumed absence of outliers. If outliers appear to be present the Mann-Whitney-Wilcoxon test (see later) is (probably) a better option.
For moderate and small sample sizes, the sample distribution should be at least approximately normal with no strong skewness to ensure the reliability of the test.
This test does not require the variances of the two groups to be equal. If the variances of the two groups are the same (which is rather unlikely in practice), the Welch’s t-test loses a little bit of power compared to the Student’s t-test.
The computation of \(df\) (i.e. the degrees of freedom of the distribution under the null) is beyond the scope of this class.
This test considers the following assumed model for group A and B
\(X_{i(g)} = \color{#eb078e}{\theta_{g}}\hspace{0.1cm}\)\(+\hspace{0.1cm} \varepsilon_{i(g)} = \theta + \color{#eb078e}{\delta_{g}} \hspace{0.1cm}\)\(+\hspace{0.1cm} \varepsilon_{i(g)}\),
where \(g=A,B\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,\color{#eb078e}{\sigma^{2}}\)\()\) and \(\sum n_{g}\delta_{g} =0\).
📝 \(\color{#6A5ACD}{n_A}\) \(=\) sample size of group A, \(\color{#6A5ACD}{\theta_{A} = \theta + \delta_A}\) \(=\) population mean of group A, \(\color{#06bcf1}{n_B}\) and \(\color{#06bcf1}{\theta_{B} = \theta + \delta_B}\) are similarly defined for group B.
Hypotheses:
\(H_0: \color{#6A5ACD}{\theta_A} \hspace{0.1cm}\)\(-\hspace{0.1cm} \color{#06bcf1}{\theta_B} \color{#eb078e}{=} \hspace{0.1cm}\)\(\theta_0\ \ \ \ \text{and} \ \ \ \ H_a: \color{#6A5ACD}{\theta_A}\hspace{0.1cm}\)\(-\hspace{0.1cm} \color{#06bcf1}{\theta_B} \ \hspace{0.1cm}\)\(\hspace{0.1cm} \big[ \color{#eb078e}{>}\hspace{0.1cm}\)\(\hspace{0.1cm} \text{ or } \color{#eb078e}{<}\hspace{0.1cm}\)\(\hspace{0.1cm} \text{ or } \color{#eb078e}{\neq}\)\(\big]\hspace{0.1cm}\theta_0.\)
Test statistic’s distribution under \(H_0\):
\[ \color{#b4b4b4}{Z = \frac{\sum_{i=1}^{n_{B}}R_{i(g)}-[n_{B}(n_{A}+n_{B}+1)/2]}{\sqrt{n_{A}n_{B}(n_{A}+n_{B}+1)/12}},} \] where \(\color{#b4b4b4}{R_{i(g)}}\) denotes the global rank of the \(\color{#b4b4b4}{i}\)-th observation of group \(\color{#b4b4b4}{g}\).
Python function:
This test is “robust” in the sense that (unlike the t-tests) it is not overly affected by outliers.
For the Mann–Whitney–Wilcoxon test to be comparable to the t-tests (i.e. testing for the mean) we need to assume symmetric distributions and equality in variances .
Then, we have \(\color{#6A5ACD}{\theta_A = \mu_A}\) and \(\color{#06bcf1}{\theta_B = \mu_B}\).
Compared to the t-tests, the Mann–Whitney–Wilcoxon test is less powerful if their requirements (Gaussian and possibly same variances) are met.
The distribution of this method under the null is complicated and can be obtained by different methods (e.g. exact, asymptotic normal, …).
The details are beyond the scope of this class.
# Import data
import pandas as pd
diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
n = 5
print(diet.head(n)) #print the first n rows of the dataset id gender age height diet.type initial.weight final.weight
0 1 Female 22 159 A 58 54.2
1 2 Female 46 192 A 60 54.0
2 3 Female 55 170 A 64 63.3
3 4 Female 33 171 A 64 61.1
4 5 Female 50 170 A 65 62.2
In practice, we often even encounter situations where we need to compare the means of more than 2 groups. For example, we want to compare the weight loss efficacy of several diets, say diets A, B, C. Your theory could, for example, be the following: 0 < \(\mu_A\) = \(\mu_B\) < \(\mu_C\). A possible approach to evaluate its validity:
Show that \(\mu_C\) is greater than \(\mu_A\) and \(\mu_B\) (Test 1: \(H_0:\) \(\mu_A\) \(=\) \(\mu_C\), \(H_a:\) \(\mu_A\) \(<\) \(\mu_C\); Test 2: \(H_0:\) \(\mu_B\) \(=\) \(\mu_C\), \(H_a:\) \(\mu_B\) \(<\) \(\mu_C\)). Here we hope to reject \(H_0\) in both cases.
Show that \(\mu_A\) and \(\mu_B\) are greater than 0 (Test 3: \(H_0:\) \(\mu_A\) \(=0\), \(H_a:\) \(\mu_A\) \(>0\); Test 4: \(H_0:\) \(\mu_B\) \(=0\), \(H_a:\) \(\mu_B\) \(>0\)). Here we also hope to reject \(H_0\) in both cases.
Compare \(\mu_A\) and \(\mu_B\) (Test 5: \(H_0:\) \(\mu_A\) \(=\) \(\mu_B\), \(H_a:\) \(\mu_A\) \(\neq\) \(\mu_B\)). Here we hope not to reject \(H_0\). ⚠️ This does not imply that \(\mu_A\) \(=\) \(\mu_B\) is true, but at least the result would not contradict our theory.
Are jelly beans causing acne? Maybe… but why only green ones? 🤨
Source: xkcd
Source: xkcd
Source: xkcd
Source: xkcd
Source: xkcd
👋 If you want to know more about this comic, have a look here.
\[\begin{align} \alpha_k &= \Pr(\text{reject } H_0 \text{ at least once}) \\ &= 1 - \Pr(\text{not reject } H_0 \text{ test 1}) \times \ldots \times \Pr(\text{not reject } H_0 \text{ test k})\\ &= 1 - (1-\alpha) \times \ldots \times (1-\alpha) = 1 - (1-\alpha)^k \end{align}\]
Suppose that we are interested in making \(k\) tests and that we want the probability of rejecting the null at least once (assuming the null hypotheses to be correct for all tests) \(\alpha_k\) to be equal to \(\alpha\) (typically 5%). Instead of using \(\alpha\) for the individual tests we will use \(\alpha_c\) (i.e. a corrected \(\alpha\)). Then, for \(k\) (potentially dependent) tests we have
\[\begin{align} \alpha_k &= \alpha = \Pr(\text{reject } H_0 \text{ at least once}) \\ &= \Pr(\text{reject } H_0 \text{ test 1} \text{ OR } \ldots \text{ OR } \text{reject } H_0 \text{ test k})\\ &{\color{#eb078e}{\leq}} \sum_{i = 1}^k \Pr(\text{reject } H_0 \text{ test i}) = \alpha_c \times k. \end{align}\]
Solving for \(\alpha_c\) we obtain: \(\color{#6A5ACD}{\alpha_c = \alpha/k}\), which is called Bonferroni correction. By making use of the Boole’s inequality, this approach does not require any assumptions about dependence among the tests or about how many of the null hypotheses are true.
\[ \alpha_k = \Pr(\text{reject } H_0 \text{ at least once}) \hspace{0.1cm} {\color{#eb078e}{\leq}} \hspace{0.1cm} 1 - (1 - \alpha_c)^k. \]
Solving for \(\alpha_c\) we obtain: \(\color{#6A5ACD}{\alpha_c = 1 - (1 - \alpha)^{1/k}}\), which is called Dunn–Šidák correction. This correction is (slightly) less stringent than the Bonferroni correction (since \(1 - (1 - \alpha)^{1/k} > \alpha/k\) for \(k \geq 2\)).
There exist many other alternative methods for multiple testing corrections. It is important to mention that when \(\text{\textit{k}}\) is large (say \(>\) 100) the Bonferroni and Dunn–Šidák corrections become inapplicable and methods based on the idea of False Discovery Rate should be preferred. However, these recent methods are beyond the scope of this class.
To compare several means of different populations, a standard approach is to start our analysis by using the multiple-sample location tests. More precisely, we proceed our analysis with the following steps:
We will discuss three multiple-sample location tests:
This test considers the following assumed model for G groups \[X_{i(g)} = {\color{#eb078e}{\mu_{g}}} + \varepsilon_{i(g)} = \mu + {\color{#eb078e}{\delta_{g}}} + \varepsilon_{i(g)},\] where \(g=1,\ldots, G\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,{\color{#eb078e}{\sigma^{2}}})\) and \(\sum n_{g}\delta_{g}=0\).
📝 \(n_i =\) sample size of group \(i\), \(\mu_i = \mu + \delta_i =\) population mean of group \(i\), \(i=1,\ldots, G\).
Hypotheses: \[H_0: {\color{#6A5ACD}{\mu_1}} {\color{#eb078e}{=}} {\color{#06bcf1}{\mu_2}} {\color{#eb078e}{=}} \ldots {\color{#eb078e}{=}} {\color{#8bb174}{\mu_G}} \ \ \ \ \text{and} \ \ \ \ H_a: \mu_i {\color{#eb078e}{\neq}} \mu_j \ \ \text{for at least one pair of} \ \ (i,j).\]
Test statistic’s distribution under \(H_0\):
\(\color{#b4b4b4}{F = \frac{N s^2_{\overline{X}}}{s_p^2} \ {\underset{H_0}{\sim}} \ \text{Fisher}(G-1, N-G),}\) where \(\small \color{#b4b4b4}{s^2_{\overline{X}} = \frac{1}{G-1} \sum_{g=1}^G \frac{n_g}{N} (\overline{X}_g - \overline{X})^2}\),
\(\small \color{#b4b4b4}{s_p^2 = \frac{1}{N-G} \sum_{g=1}^G (n_g-1)s_g^2}\), \(\small \color{#b4b4b4}{N = \sum_{g=1}^G n_g}\), and \(\small \color{#b4b4b4}{\overline{X} = \frac{1}{N} \sum_{g=1}^G n_g \overline{X}_g}\).
Python function:
This test considers the following assumed model for G groups \[X_{i(g)} = {\color{#eb078e}{\mu_{g}}} + \varepsilon_{i(g)} = \mu + {\color{#eb078e}{\delta_{g}}} + \varepsilon_{i(g)},\] where \(g=1,\ldots, G\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,{\color{#eb078e}{\sigma_g^{2}}})\) and \(\sum n_{g}\delta_{g}=0\).
Hypotheses: \[H_0: {\color{#6A5ACD}{\mu_1}} {\color{#eb078e}{=}} {\color{#06bcf1}{\mu_2}} {\color{#eb078e}{=}} \ldots {\color{#eb078e}{=}} {\color{#8bb174}{\mu_G}} \ \ \ \ \text{and} \ \ \ \ H_a: \mu_i {\color{#eb078e}{\neq}} \mu_j \ \ \text{for at least one pair of} \ \ (i,j).\]
Test statistic’s distribution under \(H_0\): \[\color{#b4b4b4}{F^* = \frac{s^{*^2}_{\overline{X}}}{1+\frac{2(G-2)}{3\Delta}} \ {\underset{H_0}{\sim}} \ \text{Fisher}(G-1, \Delta),}\] where \(\color{#b4b4b4}{s^{*^2}_{\overline{X}} = \frac{1}{G-1} \sum_{g=1}^G w_g (\overline{X}_g - \overline{X}^*)^2}\), \(\color{#b4b4b4}{\Delta = [\frac{3}{G^2-1} \sum_{g=1}^G \frac{1}{n_g} (1-\frac{w_g}{\sum_{g=1}^G w_g})]^{-1}}\), \(\color{#b4b4b4}{w_g = \frac{n_g}{s_g^2}}\),
and \(\small \color{#b4b4b4}{\overline{X}^* = \sum_{g=1}^G \frac{w_g\overline{X}_g}{\sum_{g=1}^G w_g}}\).
Python function:
This test strongly relies on the assumed absence of outliers. If outliers appear to be present the Kruskal-Wallis test (see later) is (probably) a better option.
For moderate and small sample sizes, the sample distribution should be at least approximately normal with no strong skewness to ensure the reliability of the test.
This test does not require the variances of the groups to be equal. If the variances of all the groups are the same (which is rather unlikely in practice), the Welch’s one-way ANOVA loses a little bit of power compared to the Fisher’s one-way ANOVA.
This test considers the following assumed model for G groups \[X_{i(g)} = {\color{#eb078e}{\theta_{g}}} + \varepsilon_{i(g)} = \theta + {\color{#eb078e}{\delta_{g}}} + \varepsilon_{i(g)},\] where \(g=1,\ldots, G\), \(i=1,...,n_{g}\), \(\varepsilon_{i(g)} \overset{iid}{\sim} N(0,{\color{#eb078e}{\sigma^{2}}})\) and \(\sum n_{g}\delta_{g}=0\).
📝 \(n_i =\) sample size of group \(i\), \(\theta_i = \theta + \delta_i =\) population location of group \(i\), \(i=1,\ldots, G\).
Hypotheses: \[H_0: {\color{#6A5ACD}{\theta_1}} {\color{#eb078e}{=}} {\color{#06bcf1}{\theta_2}} {\color{#eb078e}{=}} \ldots {\color{#eb078e}{=}} {\color{#8bb174}{\theta_G}} \ \ \ \ \text{and} \ \ \ \ H_a: \theta_i {\color{#eb078e}{\neq}} \theta_j \ \ \text{for at least one pair of} \ \ (i,j).\]
\(\color{#b4b4b4}{\text{Test statistic's distribution under } H_0: \quad H = \frac{\frac{12}{N(N+1)} \sum_{g=1}^G \frac{\overline{R}_g}{n_g}-3(N-1)}{1-\frac{\sum_{v=1}^V{t_v^3 - t_v}}{N^3 - N}} \ {\underset{H_0}{\sim}} \ \mathcal{X}(G-1)}\)
\(\color{#b4b4b4}{\text{where } \overline{R}_g = \frac{1}{n_g} \sum_{i=1}^{n_g} R_{i(g)} \ \text{with } R_{i(g)} \ \text{denoting the global rank of the } i^{th} \ \text{observation of group } g,}\)
\(\color{#b4b4b4}{V \ \text{is the number of different values/levels in } X \ \text{and } t_v \ \text{denotes the number of times a given}}\)
\(\color{#b4b4b4}{\text{value/level occurred in } X.}\)
Python function:
# Import data
import pandas as pd
diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]
# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietB = diet["weight.loss"][diet["diet.type"]=="B"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]
# Create data frame
dat = pd.DataFrame({
"response": list(dietA) + list(dietB) + list(dietC),
"groups": (["A"] * len(dietA)) + (["B"] * len(dietB)) + (["C"] * len(dietC))
})This document was prepared with the help of Lionel Voirol and Filippo Salmaso