The Central Limit Theorem
Building on our candy example, earlier we defined the population parameter of interest as the weight of all 100 candies in the bag. And, since we have the whole population, we simply calculated the population parameter.
In the last activity, we discovered that we could do a better job of estimating the population parameter through random sampling than if we employed a biased sampling approach.
Now, let’s dig into the concept of a sampling distribution.
The sampling distribution is the distribution of a statistic (like a sample mean, or a sum, or a proportion) calculated from multiple random samples of the same size, taken from the same population.
Let’s create a sampling distribution, we’ll simulate 5000 random samples of size 25.
Replace XXX with the appropriate value.
Centered on the Population Parameter: The sampling distribution tends to cluster around the true population parameter. The mean of the parameter estimates across random samples equals the true parameter (provided samples are sufficiently large).
Variability: The spread of the sampling distribution reflects how much the statistic fluctuates around the population value.
Normality (Central Limit Theorem): With a sufficiently large sample size, the sampling distribution is approximately normal, regardless of the population’s shape.
Sample Size Effect: Larger samples reduce variability, making the sample statistic a more accurate estimate of the population parameter.
To guide our explorations, click Run Code on the code chunk below to create a function that allows us to explore how the sampling distribution changes as we vary the sample size.
Use the function to vary the sample size. See what happens as you change the sample size to:
2, 5, 10, 20, 30, 50
Convergence to Normal Distribution: As the sample size increases, the distribution of the sample estimates will increasingly resemble a normal distribution, even if the underlying population distribution is not normal. This is a direct consequence of the Central Limit Theorem (CLT), which states that with a large enough sample size, the distribution of the sample estimates across many, many random samples approaches normality.
Less Variability in Sample Estimates: With larger sample sizes, the variability (spread) in the estimates of the samples will decrease. This means the density plot of the estimated bag weights will become more concentrated around the true population weight, resulting in a sharper and narrower distribution.
The Central Limit Theorem (CLT) is a fundamental principle in probability theory and statistics that describes the behavior of sample estimates drawn from a population. It states that:
If you repeatedly take independent random samples of a given size from a population, regardless of the population’s underlying distribution (whether it’s normal, skewed, or something else), the distribution of the sample estimates will approach a normal (Gaussian) distribution as the sample size becomes large.
This holds true even if the original population is not normally distributed, provided the sample size is sufficiently large.
The mean of the sampling distribution of the sample estimates will be equal to the population parameter and the standard deviation of the sampling distribution (often called the standard error) will be the population standard deviation (\(\sigma\)) divided by the square root of the sample size (\(n\)), or:
\[ \text{Standard Error (SE)} = \frac{\sigma}{\sqrt{n}} \]
The CLT is crucial because it allows statisticians to make inferences about a population based on sample data, even when the population distribution is unknown. It is the theoretical foundation for many statistical methods, including confidence intervals and hypothesis tests, as it justifies the use of the normal distribution for estimating sample means, sample sums, sample proportions — and many other statistics.
Let’s define an “extreme” sample in our candy example as one that produces a sample statistic that is 50% or more from the true population parameter.
The graph below presents the proportion of random samples that are classified as “extreme” as a function of sample size.
Very small sample sizes can lead to high variability in the sample estimates, increasing the likelihood of drawing extreme samples that deviate significantly from the true population parameter. As demonstrated in this candy example, when the sample size is small, a large proportion of the samples produce estimates that are 50% or more away from the true population weight. This variability decreases as the sample size increases, but for small sample sizes, the sampling distribution is much wider and can lead to misleading conclusions about the population.
Therefore, using small sample sizes can be risky because they often result in highly inaccurate estimates. Larger sample sizes provide more stable and reliable estimates that are closer to the population parameter, reducing the likelihood of extreme results and enhancing the precision of our inferences about the population.