Exercises diet data
In the previous section, we compared diets A and B. In this section, we will consider the difference of the average weight loss between diets A and C. In particular, we are interested in testing whether diet C leads to a larger average weight loss than diet A.
Question 1
How to plot the data?
Complete the code below to obtain the boxplot to compare the empirical distributions of the weight loss of the two diets:
Consider using the boxplot() function from matplotlib.pyplot.
Use the boxplot() function from matplotlib.pyplot:
import pandas as pd
import matplotlib.pyplot as plt
diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
# compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]
# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]
# Side-by-side boxplot
plt.figure(figsize=(6,4))
1plt.boxplot([dietA, dietC],
labels=["Diet A", "Diet C"],
patch_artist=True, # color the boxes
boxprops=dict(facecolor="lightblue", color="blue"),
medianprops=dict(color="red"))
plt.ylabel("Weight loss (kg)")
plt.title("Comparison of Weight Loss for Diets A and C")
plt.show()- 1
-
Use
plt.boxplot()
Question 2
Which test should we use?
Based on the graph you produce, which test appears to be the most appropriate to test if diet C leads to a larger average weight loss than diet A?
Question 3
How to perform the test you selected?
Complete the code below to test if diet C leads to a larger average weight loss than diet A. Remember that we are interested in testing if diet C leads to a larger weight loss than diet A.
Consider using the mannwhitneyu() function from scipy.stats. Make sure to specify the correct alternative hypothesis.
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
# compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]
# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]
# Wilcoxon rank-sum test (Mann–Whitney U)
stat, p_value = stats.mannwhitneyu(dietA, dietC, alternative ="less")
print("Wilcoxon statistic:", round(stat, 4))
print("p-value:", round(p_value, 4))Question 4
What can we conclude? Based on the test you performed and considering a significance level of \(5\%\), we can conclude that:
Question 5
Is diet A really so bad?
A consultant who is hired by the company and promotes diet A is unhappy about your analysis as it indicates that diet C (promoted by a competing firm) is better in terms of weight loss. Thus the consultant constructs the following argument: to claim that a diet is more effective in terms of weight loss, we should have at least an average difference larger than one kg, otherwise the statistical difference is meaningless. Below we perform the test that the consultant is interested in:
Consider using the mannwhitneyu() function from scipy.stats. In Python, scipy.stats.mannwhitneyu() does not include an argument that allows to specify the hypothesized difference between group median, but you can reproduce the same test by shifting one sample by that hypothesized difference.
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
# compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]
# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]
stat, p_value = stats.mannwhitneyu(dietA+1, dietC, alternative="less")
print(f"Mann-Whitney U test statistic: {stat}")
print(f"P-value: {p_value}")Question 6
Based on this test, we obtain a p-value of \(5.747\%\) (which could make you think that the consultant picked one kg for a good reason…🤔). Therefore, the consultant claims that it is statistically proven that the two diets are equally effective in terms of weights loss. What do you think of this argument?
Question 7
The competing company that promotes diet C is unhappy about the claims made by the consultant. So they hire their own consultant, who selects a difference of \(950\) grams. Please complete the following test this second consultant wants to perform:
Consider using the mannwhitneyu() function from scipy.stats. In Python, scipy.stats.mannwhitneyu() does not include an argument that allows to specify the hypothesized difference between group median, but you can reproduce the same test by shifting one sample by that hypothesized difference.
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
diet = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/diet.csv")
# compute weight loss
diet["weight.loss"] = diet["initial.weight"] - diet["final.weight"]
# Variable of interest
dietA = diet["weight.loss"][diet["diet.type"]=="A"]
dietC = diet["weight.loss"][diet["diet.type"]=="C"]
stat, p_value = stats.mannwhitneyu(dietA+0.95, dietC, alternative="less")
print(f"Mann-Whitney U test statistic: {stat}")
print(f"P-value: {p_value}")Question 8
Then, the (second) consultant claims that diet C leads to an average weight loss that is significantly larger by \(950\) grams than diet A. What do you think of this argument: