Exercises abundance data

In this section, we use data from Environmental Data Initiative (EDI) on Coastal Giant Salamanders collected in HJ Andrews Experimental Forest, Oregon. Measurements were taken in two sections of Mack Creek: one located in old-growth forest (OG) and the other in a clear-cut area (CC) harvested in 1963. The dataset is extracted from the and_vertebrates dataset of the R package lterdatasampler.

Forest harvesting can influence stream conditions, which may affect salamander growth and body size. A scientist is interested in testing whether there is a difference in average weight of Coastal Giant Salamanders between OG and CC. We will use a two-sample t-test to assess whether salamander size differs significantly between the two sites.

Question 1

Define \(\mu_{OG}\) and \(\mu_{CC}\) as the expectation of the weight of the salmanders from old-growth forest and from clear-cut area respectively. Which hypotheses should you consider to assess the validity of the previously mentioned claim:

\( H_0: \mu_{OG} = \mu_{CC} \), \( H_a: \mu_{OG} \neq \mu_{CC} \) \( H_0: \mu_{OG} \neq \mu_{CC} \), \( H_a: \mu_{OG}= \mu_{CC} \) \( H_0: \mu_{OG} = \mu_{CC} \), \( H_a: \mu_{OG} > \mu_{CC} \) \( H_0: \mu_{OG} = \mu_{CC} \), \( H_a: \mu_{OG} < \mu_{CC} \)

Question 2

As previously mentioned, we want to compare the two locations (OG and CC) to assess whether the average size of Coastal Giant Salamanders differs between forest types. Using this information, please complete the code below to construct the variables of interest:

import pandas as pd
import matplotlib.pyplot as plt

df_salamander = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/salamander_weights.csv")
df_salamander.head()

# Variable of interest
weight_CC = df_salamander["weight_g"][df_salamander["section"]=="CC"]
weight_OG = df_salamander["weight_g"][df_salamander["section"]=="OG"]

Question 3

Next, we wish to visualize the empirical distribution of the data in order to select a suitable test. Complete the code below to construct boxplots comparing the two groups:

import pandas as pd
import matplotlib.pyplot as plt

df_salamander = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/salamander_weights.csv")
df_salamander.head()

# Variable of interest
weight_CC = df_salamander["weight_g"][df_salamander["section"]=="CC"]
weight_OG = df_salamander["weight_g"][df_salamander["section"]=="OG"]

# Side-by-side boxplot
plt.figure(figsize=(6,4))
plt.boxplot([weight_CC, weight_OG],
            labels=["CC", "OG"],
            patch_artist=True,  # color the boxes
            boxprops=dict(facecolor="lightblue", color="blue"),
            medianprops=dict(color="red"))

plt.ylabel("Weight (g)")
plt.show()

Question 4

One of your colleague run the following test to compare the two locations (OG and CC):

import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
df_salamander = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/salamander_weights.csv")

# Variable of interest
weight_CC = df_salamander["weight_g"][df_salamander["section"]=="CC"]
weight_OG = df_salamander["weight_g"][df_salamander["section"]=="OG"]
# Perform independent two-sample t-test (assuming unequal variances)
t_stat, p_val = stats.ttest_ind(weight_CC, weight_OG, equal_var=False)

print(f"T-statistic: {t_stat:.3f}")

T-statistic: 1.078

print(f"p-value: {p_val:.3f}")

p-value: 0.282

He concludes that there is no evidence from the data to suggest that the average weight of Coastal Giant Salamanders differs between the two locations. What do you think of this procedure?

The procedure is appropriate. The t-test is robust to outliers and non-normality. The t-test may not be appropriate because there are outliers. The t-test is wrong because it assumes equal variances, which is not the case here. The test should compare the variances instead of the means.

Question 5

Based on the empirical distribution of the weights of salamander for the two locations, which of the following test appears to be the most suitable?

Student's t-test. Welch's t-test. Wilcoxon test.

Question 6

Complete the code below to test whether the average size of Coastal Giant Salamanders differs between the two locations (OG and CC) using the appropriate test you selected in Question 5.

import pandas as pd
import scipy.stats as stats

df_salamander = pd.read_csv("https://raw.githubusercontent.com/ELSTE-Master/Data-Science/main/Data/salamander_weights.csv")
df_salamander.head()

# Variable of interest
weight_CC = df_salamander["weight_g"][df_salamander["section"]=="CC"]
weight_OG = df_salamander["weight_g"][df_salamander["section"]=="OG"]

stat, p_value = stats.mannwhitneyu(weight_CC, weight_OG)
print(f"Mann-Whitney U test statistic: {stat}")
print(f"P-value: {p_value}")

Question 7

Based on the p-value you obtained, what can you conclude (assuming \(\alpha=0.05\))

We accept the null hypothesis that the median sizes are equal between OG and CC. We reject the null hypothesis and conclude that the median salamander size differs between OG and CC. We conclude that salamanders from OG are heavier than those from CC. We cannot conclude anything because the test only works for normal data.