PSY 652: Research Methods in Psychology I

The Central Limit Theorem

Kimberly L. Henry: kim.henry@colostate.edu

The population parameter

Building on our candy example, earlier we defined the population parameter of interest as the weight of all 100 candies in the bag. And, since we have the whole population, we simply calculated the population parameter.

Simple random sampling

In the last activity, we discovered that we could do a better job of estimating the population parameter through random sampling than if we employed a biased sampling approach.

What is the sampling distribution?

Now, let’s dig into the concept of a sampling distribution.

The sampling distribution is the distribution of a statistic (like a sample mean, or a sum, or a proportion) calculated from multiple random samples of the same size, taken from the same population.

Simulate random sampling from the population

Let’s create a sampling distribution, we’ll simulate 5000 random samples of size 25.

Create a density plot of the parameter estimates

Replace XXX with the appropriate value.

Our sampling distribution

Some key properties of the sampling distribution

  • Centered on the Population Parameter: The sampling distribution tends to cluster around the true population parameter. The mean of the parameter estimates across random samples equals the true parameter (provided samples are sufficiently large).

  • Variability: The spread of the sampling distribution reflects how much the statistic fluctuates around the population value.

  • Normality (Central Limit Theorem): With a sufficiently large sample size, the sampling distribution is approximately normal, regardless of the population’s shape.

  • Sample Size Effect: Larger samples reduce variability, making the sample statistic a more accurate estimate of the population parameter.

Let’s explore the Sample Size Effect

To guide our explorations, click Run Code on the code chunk below to create a function that allows us to explore how the sampling distribution changes as we vary the sample size.

Explore!

Use the function to vary the sample size. See what happens as you change the sample size to:

2, 5, 10, 20, 30, 50

A summary of the effect of sample size

What does changing sample size do?

  • Convergence to Normal Distribution: As the sample size increases, the distribution of the sample estimates will increasingly resemble a normal distribution, even if the underlying population distribution is not normal. This is a direct consequence of the Central Limit Theorem (CLT), which states that with a large enough sample size, the distribution of the sample estimates across many, many random samples approaches normality.

  • Less Variability in Sample Estimates: With larger sample sizes, the variability (spread) in the estimates of the samples will decrease. This means the density plot of the estimated bag weights will become more concentrated around the true population weight, resulting in a sharper and narrower distribution.

The Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) is a fundamental principle in probability theory and statistics that describes the behavior of sample estimates drawn from a population. It states that:

  • If you repeatedly take independent random samples of a given size from a population, regardless of the population’s underlying distribution (whether it’s normal, skewed, or something else), the distribution of the sample estimates will approach a normal (Gaussian) distribution as the sample size becomes large.

  • This holds true even if the original population is not normally distributed, provided the sample size is sufficiently large.

  • The mean of the sampling distribution of the sample estimates will be equal to the population parameter and the standard deviation of the sampling distribution (often called the standard error) will be the population standard deviation (\(\sigma\)) divided by the square root of the sample size (\(n\)), or:

\[ \text{Standard Error (SE)} = \frac{\sigma}{\sqrt{n}} \]

Why care about the Central Limit Theorem?

The CLT is crucial because it allows statisticians to make inferences about a population based on sample data, even when the population distribution is unknown. It is the theoretical foundation for many statistical methods, including confidence intervals and hypothesis tests, as it justifies the use of the normal distribution for estimating sample means, sample sums, sample proportions — and many other statistics.

The potential danger of small sample sizes

Let’s define an “extreme” sample in our candy example as one that produces a sample statistic that is 50% or more from the true population parameter.

How many samples are extreme given sample size?

The graph below presents the proportion of random samples that are classified as “extreme” as a function of sample size.

Take home message

Very small sample sizes can lead to high variability in the sample estimates, increasing the likelihood of drawing extreme samples that deviate significantly from the true population parameter. As demonstrated in this candy example, when the sample size is small, a large proportion of the samples produce estimates that are 50% or more away from the true population weight. This variability decreases as the sample size increases, but for small sample sizes, the sampling distribution is much wider and can lead to misleading conclusions about the population.

Therefore, using small sample sizes can be risky because they often result in highly inaccurate estimates. Larger sample sizes provide more stable and reliable estimates that are closer to the population parameter, reducing the likelihood of extreme results and enhancing the precision of our inferences about the population.