Confidence Intervals
Tibbles print numbers with three significant digits by default. More significant digits are needed in this activity so we can match calculations across computed values and hand calculated values. Therefore, we will increase the default to 7 significant digits.
Press Run Code on the code chunk below to set this for the session.
Building on our candy example, earlier we defined the population parameter of interest as the weight of all 100 candies in the bag. And, since we have the whole population, we calculated the population parameter using the data.
We simulated the sampling distribution for this parameter (n = 25).
We then plotted the sampling distribution using a density plot.
What sample estimates mark the 2.5th and 97.5th percentiles of the sampling distribution?
The quantile()
function in R calculates the value(s) at specified percentiles of a given variable. A quantile represents the point in the data below which a certain percentage of the data falls. For the 2.5th percentile, it first computes the value below which 2.5% of the estimated weights fall. Similarly, for the 97.5th percentile, it calculates the value below which 97.5% of the estimated weights fall.
Let’s overlay our computed percentiles onto the graph of the sampling distribution. Here, we see that 95% of the random samples drawn from the population produce an estimated weight for the bag of candies between about 1119 grams and 2216 grams.
Using our simulated sampling distribution, we can also calculate the standard deviation of the estimated weights.
The standard deviation of the sampling distribution is called the standard error (SE). The standard error tells us how much we expect our sample estimates to vary from sample to sample.
With the standard error, we can also compute an estimate of the middle 95% of the distribution. We expect 95% of samples to produce an estimate within ~1.96 standard errors of the population parameter:
\[ 1623.9 \pm 1.96 \times 283.9 = 1067.5, 2180.3 \] The same result can be computed using qnorm():
In our candy example, we have access to the whole population and we can sample from it to derive the sampling distribution.
But, that’s not typical. What if we only had access to one single sample?
Bootstrap resampling is a statistical technique used to estimate the variability (such as the standard error) of a sample statistic by generating many random resamples from the original data. Here’s how it works:
Resampling with Replacement: From your original sample, you randomly draw new samples (called bootstrap resamples), but with replacement, meaning each data point can be chosen multiple times or not at all in any given resample. Each resample is the same size as your original sample.
Computing Statistics: For each bootstrap resample, you compute the statistic of interest (e.g., mean, sum, regression coefficient).
Repeating the Process: This process is repeated many times (often 1000s of resamples), resulting in a distribution of the computed statistic.
Estimating Variability: The variability of this bootstrap distribution provides an estimate of the standard error, confidence intervals, or other measures of uncertainty for the original sample statistic. It’s mimicking the sampling distribution.
Notice here that some candies are included multiple times in this resample, while other candies are not included in this resample.
Using our bootstrap resamples we can compute the mean, standard error, and the middle 95% of the distribution.
The interval between the 2.5th and 97.5th percentiles gives us a 95% confidence interval (CI) for the total weight of the 100 candies (our population parameter). This means that, based on the resampling process:
95% of the time, we can expect the confidence interval constructed through the bootstrap method will contain the true population parameter. In other words, if you were to repeat the sampling process many times, approximately 95% of the resulting intervals would include the true parameter, while 5% would not.
This is a frequentist approach. Here, the probability is interpreted as the long-run frequency of events. It assumes that the parameter is fixed but unknown, and that data are random and come from repeated sampling. In frequentist statistics, all conclusions are based on the idea of how frequently certain events or outcomes would occur in repeated trials of the study.
When calculating a 95% confidence interval for a numeric variable, we use the Student’s t-distribution instead of the normal distribution. This is because the t-distribution accounts for the additional uncertainty in estimating the population standard deviation.
For a 95% CI, we need the 2.5th and 97.5th percentiles of the Student’s t distribution for df = 24. When estimating the mean we lose 1 df ; therefore, df for this example are calculated as n of sample - 1 = 25 - 1 = 24.
Now, we can compute the estimated bag weight and the corresponding standard error, then calculate the 95% CI.
Compute the Standard Error (SE):
\[ SE_{\text{bag}} = \frac{\text{SD}_{\text{bag}}}{\sqrt{n}} = \frac{1518.3}{\sqrt{25}} = 303.7 \]
Compute the 95% Confidence Interval (CI):
\[ CI = Weight_{\text{bag}} \pm t_{\alpha/2, \, df} \times SE_{\text{bag}} \]
Substituting the values:
\[ CI = 1548.4 \pm 2.0639 \times 303.7 \]
Lower and Upper Bounds of the CI:
\[ \text{Lower Bound} = 1548.4 - 2.0639 \times 303.7 = 922 \, \text{grams} \]
\[ \text{Upper Bound} = 1548.4 + 2.0639 \times 303.7 = 2175 \, \text{grams} \]
Let’s imagine that using our theory-based/parametric approach, we drew 100 random samples of size 25 and with each sample we computed the 95% CI using the parametric approach. The plot below is a simulation of this proposal:
In this code we are simulating the process of taking 100 random samples of size 25 from a dataset called bag_of_candies, calculating the estimated total weight of 100 candies, and constructing 95% confidence intervals (CIs) for each sample. The mutate()
function creates a new column, called “included”, which checks whether the true population bag weight (population_weight) falls within the calculated confidence interval. If it does, the sample is marked with “yes”; otherwise, it is marked with “no.” This new variable is then used to color the CIs based on whether the true population parameter is included in the CI.
On average, we expect 95% of the 95% CIs to include the true population parameter.
If normality assumptions are not valid, or you’re dealing with outliers or skewed distributions, bootstrap is typically more reliable.
If normality assumptions are valid and sample sizes are large, the parametric approach will be more efficient.
Population Characteristics: Given the right skew of the population data, I would recommend drawing a larger sample to better capture the variability in the population, especially with the presence of larger candy weights.
Stratification: To account for potential differences in candy sizes, I would consider stratifying by candy type(e.g., full-size candy, mid-size candy, small pieces). This stratification would help ensure that all candy types are properly represented in the sample and would reduce bias in the estimates. This means dividing the population into distinct groups (strata) based on candy size, then drawing samples from each group proportionally. Stratification ensures that all candy types are represented in the sample, which would help reduce bias and ensure more accurate estimates of the total candy weight.
Bootstrap for Uncertainty: Due to the skewness and potential outliers in the data, I would use a bootstrap approach to estimate the standard error and construct the confidence interval (CI). Bootstrapping is more robust to non-normality and provides a more accurate representation of the uncertainty in the estimates.
Larger sample sizes help mitigate the effects of skewness and provide more reliable estimates.
Stratification improves representation across different candy types, leading to more accurate inferences.
Bootstrapping handles non-normal distributions well, making it a suitable choice for estimating variability and constructing CIs in this context.