
Exercises abundance data
In this section, we use data from Environmental Data Initiative (EDI) on Coastal Giant Salamanders collected in HJ Andrews Experimental Forest, Oregon. Measurements were taken in two sections of Mack Creek: one located in old-growth forest (OG) and the other in a clear-cut area (CC) harvested in 1963. The dataset is extracted from the and_vertebrates dataset of the R package lterdatasampler.
Forest harvesting can influence stream conditions, which may affect salamander growth and body size. A scientist is interested in testing whether there is a difference in average weight of Coastal Giant Salamanders between OG and CC. We will use a two-sample t-test to assess whether salamander size differs significantly between the two sites.
Question 1
Define \(\mu_{OG}\) and \(\mu_{CC}\) as the expectation of the weight of the salmanders from old-growth forest and from clear-cut area respectively. Which hypotheses should you consider to assess the validity of the previously mentioned claim:
Question 2
As previously mentioned, we want to compare the two locations (OG and CC) to assess whether the average size of Coastal Giant Salamanders differs between forest types. Using this information, please complete the code below to construct the variables of interest:
Select the weights corresponding to each section using boolean indexing.
import pandas as pd
import matplotlib.pyplot as plt
df_salamander = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/salamander_weights.csv")
df_salamander.head()
# Variable of interest
weight_CC = df_salamander["weight_g"][df_salamander["section"]=="CC"]
weight_OG = df_salamander["weight_g"][df_salamander["section"]=="OG"]Question 3
Next, we wish to visualize the empirical distribution of the data in order to select a suitable test. Complete the code below to construct boxplots comparing the two groups:
Consider using the function boxplot() from matplotlib.pyplot.
import pandas as pd
import matplotlib.pyplot as plt
df_salamander = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/salamander_weights.csv")
df_salamander.head()
# Variable of interest
weight_CC = df_salamander["weight_g"][df_salamander["section"]=="CC"]
weight_OG = df_salamander["weight_g"][df_salamander["section"]=="OG"]
# Side-by-side boxplot
plt.figure(figsize=(6,4))
plt.boxplot([weight_CC, weight_OG],
labels=["CC", "OG"],
patch_artist=True, # color the boxes
boxprops=dict(facecolor="lightblue", color="blue"),
medianprops=dict(color="red"))
plt.ylabel("Weight (g)")
plt.show()Question 4
One of your colleague run the following test to compare the two locations (OG and CC):
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
df_salamander = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/salamander_weights.csv")
# Variable of interest
weight_CC = df_salamander["weight_g"][df_salamander["section"]=="CC"]
weight_OG = df_salamander["weight_g"][df_salamander["section"]=="OG"]
# Perform independent two-sample t-test (assuming unequal variances)
t_stat, p_val = stats.ttest_ind(weight_CC, weight_OG, equal_var=False)
print(f"T-statistic: {t_stat:.3f}")T-statistic: 1.078
print(f"p-value: {p_val:.3f}")p-value: 0.282
He concludes that there is no evidence from the data to suggest that the average weight of Coastal Giant Salamanders differs between the two locations. What do you think of this procedure?
Question 5
Based on the empirical distribution of the weights of salamander for the two locations, which of the following test appears to be the most suitable?
Question 6
Complete the code below to test whether the average size of Coastal Giant Salamanders differs between the two locations (OG and CC) using the appropriate test you selected in Question 5.
Consider using the function mannwhitneyu() from scipy.stats.
import pandas as pd
import scipy.stats as stats
df_salamander = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/salamander_weights.csv")
df_salamander.head()
# Variable of interest
weight_CC = df_salamander["weight_g"][df_salamander["section"]=="CC"]
weight_OG = df_salamander["weight_g"][df_salamander["section"]=="OG"]
stat, p_value = stats.mannwhitneyu(weight_CC, weight_OG)
print(f"Mann-Whitney U test statistic: {stat}")
print(f"P-value: {p_value}")Question 7
Based on the p-value you obtained, what can you conclude (assuming \(\alpha=0.05\))