A webR tutorial
The potential danger of small sample sizes
Introduction
You and your colleague work for a professional polling firm and have been contacted by a presidential candidate to conduct a poll in very remote, rural areas of the United States. The candidate’s overall polling numbers at the national level show 58% support, with a margin of error (MOE) of 3% (the MOE tells us how much the results of a poll could differ from the true support in the population). However, polling in these remote areas is difficult due to logistical challenges and sparse populations and the candidate wants assurance that they are doing well in these remote areas as well.
Simulate the population
For this example, we’ll assume that the true proportion of support for the candidate in these remote, rural areas is 55%. To begin, we’ll simulate a data frame representing all of the potential respondents from the rural remote regions of interest.
Press Run Code on the code chunk below to create the population data frame.
This script simulates a population of 100,000 voters from rural remote areas, with each voter polled having a 55% chance of supporting the candidate. The variable “SupportsCandidate” is a binary variable, where 1 indicates support and 0 indicates no support.
The true proportion of remote rural voters who support the candidate is the population parameter of interest. Press Run Code to compute this proportion.
This script calculates the true proportion of remote rural voters who support the candidate.
Two sampling strategies
You and your colleague have two different approaches to conducting this poll. One of you is considering a small sample size due to the difficulty of sampling, while the other is considering a larger sample size as the critical importance of getting the poll right is paramount. You decide to conduct a simulation study to understand how sample size affects the sampling distribution. In particular, you want to determine the proportion of “extreme” polls that could be drawn where the estimated support significantly deviates from the true population proportion.
Within your pair, one of you should work through Scenario 1 (small sample) and the other through Scenario 2 (large sample).
Scenario 1
You decide to conduct a small poll with just 50 respondents due to the difficulty of reaching voters in these remote areas.
Simulating the Sampling Distribution:
You will simulate 1,000 small-sample polls (n = 50), calculate the estimated support in each poll, and create a plot of the sampling distribution.
You will also calculate the proportion of polls where the support deviates substantially from the true population parameter — we’ll define a substantial deviation as less than 0.50 or greater than 0.60 (i.e., fewer than 50% of sampled participants support the candidate or more than 60% of sampled participants support the candidate).
Press Run Code on the code chunk below to simulate the sampling distribution for small samples.
Using the 1000 samples, you estimate the proportion of samples that produce an extreme proportion — defined as either less than 0.50 in support or greater than 0.60 support.
This indicates that if you drew 1000 random samples of size 50 and calculated the proportion who support the candidate in each of them, in 378 of them (i.e., about 38%) the proportion estimated would be classified as an “extreme sample” given your definition.
Scenario 2
You decide to conduct a larger poll with 500 respondents. You believe this is important despite the difficulty in conducting the poll in these remote rural areas.
Simulating the Sampling Distribution:
You will simulate 1,000 polls of size 500, calculate the estimated support in each poll, and create a plot of the sampling distribution.
You will also calculate the proportion of polls where the support deviates substantially from the true population parameter — we’ll define a substantial deviation as less than 0.50 or greater than 0.60 (i.e., fewer than 50% of sampled participants support the candidate or more than 60% of sampled participants support the candidate).
Press Run Code on the code chunk below to simulate the sampling distribution for larger samples.
Using the 1000 samples, you estimate the proportion of samples that produce an extreme proportion — defined as either less than 0.50 in support or greater than 0.60 support.
This indicates that if you drew 1000 random samples of size 500 and calculated the proportion who support the candidate in each of them, in 20 of them (i.e., about 2%) the proportion estimated would be classified as an “extreme sample” given your definition.
Discuss your results
Discussion Guide: Comparing Small and Large Sample Polls
Once both students have completed their scenario, share the results with one another. Please use the following as a guide for additional discussion:
Impact of Sample Size on Variability:
How did the shape of the sampling distribution differ between the small sample (n = 50) and the large sample (n = 500)?
Why do you think the sampling distribution for the larger sample was narrower and more centered around the true population proportion (55%)?
Real-World Implications for Polling in Remote Areas:
Given the logistical difficulties of conducting large polls in remote areas, what are the potential risks of using small samples to estimate voter support in these regions?
How might pollsters address these risks when conducting real-world polls?
Trade-offs in Polling:
- In practice, what are the trade-offs between using a smaller sample (which may be easier to collect) and a larger sample (which is more resource-intensive)?
Relevance of Sample Size to Your Own Work:
- How does the consideration of adequate sample size apply to your own research or professional work?