Exercises diet data – ELSTE Data Science Master Course

Question 1

How to plot the data?

Complete the code below to obtain the boxplot to compare the empirical distributions of the weight loss of the two diets:

Use the boxplot() function from matplotlib.pyplot:

import pandas as pd
import matplotlib.pyplot as plt

diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")

# compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]

# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]

# Side-by-side boxplot
plt.figure(figsize=(6,4))
1plt.boxplot([dietA, dietC],
            labels=["Diet A", "Diet C"],
            patch_artist=True,  # color the boxes
            boxprops=dict(facecolor="lightblue", color="blue"),
            medianprops=dict(color="red"))

plt.ylabel("Weight loss (kg)")
plt.title("Comparison of Weight Loss for Diets A and C")
plt.show()

1: Use plt.boxplot()

Question 2

Which test should we use?

Based on the graph you produce, which test appears to be the most appropriate to test if diet C leads to a larger average weight loss than diet A?

Student's t-test. Welch's t-test. Wilcoxon test.

Question 3

How to perform the test you selected?

Complete the code below to test if diet C leads to a larger average weight loss than diet A. Remember that we are interested in testing if diet C leads to a larger weight loss than diet A.

import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
# compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]

# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]

# Wilcoxon rank-sum test (Mann–Whitney U)
stat, p_value = stats.mannwhitneyu(dietA, dietC, alternative ="less")

print("Wilcoxon statistic:", round(stat, 4))
print("p-value:", round(p_value, 4))

Question 4

What can we conclude? Based on the test you performed and considering a significance level of \(5\%\), we can conclude that:

We can reject the null hypothesis at the significance level of \(5\%\) and conclude that diet A leads to a larger average weight loss than diet C. We can be sure that diet C leads to a larger average weight loss than diet A. We can reject the null hypothesis at the significance level of \(5\%\) and conclude that diet C leads to a larger average weight loss than diet A. We cannot reject the null hypothesis at the significance level of \(5\%\).

Question 5

Is diet A really so bad?

A consultant who is hired by the company and promotes diet A is unhappy about your analysis as it indicates that diet C (promoted by a competing firm) is better in terms of weight loss. Thus the consultant constructs the following argument: to claim that a diet is more effective in terms of weight loss, we should have at least an average difference larger than one kg, otherwise the statistical difference is meaningless. Below we perform the test that the consultant is interested in:

import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
# compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]

# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]


stat, p_value = stats.mannwhitneyu(dietA+1, dietC, alternative="less")
print(f"Mann-Whitney U test statistic: {stat}")
print(f"P-value: {p_value}")

Question 6

Based on this test, we obtain a p-value of \(5.747\%\) (which could make you think that the consultant picked one kg for a good reason…🤔). Therefore, the consultant claims that it is statistically proven that the two diets are equally effective in terms of weights loss. What do you think of this argument?

The consultant showed that the average difference in terms of weight loss is equal to one kg, so his claim is incorrect. The consultant is correct, there is indeed not much difference between the two diets. The claim of the consultant is incorrect for the following reasons. First, there is no good reason to compare a difference of one kg and second, we can never accept the null hypothesis! The consultant showed that the average difference in terms of weight loss is equal to one kg, so his claim is correct.

Question 7

The competing company that promotes diet C is unhappy about the claims made by the consultant. So they hire their own consultant, who selects a difference of \(950\) grams. Please complete the following test this second consultant wants to perform:

import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
# compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]

# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]


stat, p_value = stats.mannwhitneyu(dietA+0.95, dietC, alternative="less")
print(f"Mann-Whitney U test statistic: {stat}")
print(f"P-value: {p_value}")

Question 8

Then, the (second) consultant claims that diet C leads to an average weight loss that is significantly larger by \(950\) grams than diet A. What do you think of this argument:

The claim is correct because the p-value is smaller than 5%. The claim is incorrect because we can't accept the null hypothesis. The claim is incorrect because the hypothesis was constructed after looking at the known results.