A webR tutorial
A very brief intro to statistical power
Recall the presidential polling example
You and your colleague work for a professional polling firm and have been contacted by a presidential candidate to conduct a poll in very remote, rural areas of the United States. For this example, we’ll assume that the true proportion of support for the candidate in these remote, rural areas is 55%. To begin, we’ll simulate a data frame representing all of the potential respondents from the rural remote regions of interest.
Press Run Code on the code chunk below to create the population data frame.
Draw a random sample
Let’s draw a single random sample of 1000 voters and compute the proportion of voters who support the candidate.
In this random sample, 56.3% of the respondents support the candidate.
The Standard Error (SE)
Knowing just the proportion of people who support the candidate isn’t enough. Because this estimate comes from a single random sample, it’s important to recognize that the proportion could vary across different random samples. To account for this sample-to-sample variability, we need a method to quantify the uncertainty in our estimate. The solution is to compute the standard error.
There is a formula that we can use to determine the standard error for this scenario, we need the proportion who support the candidate in the sample (0.563) and the sample size:
\[ SE = \sqrt{\frac{{p}(1 - {p})}{n}} = \sqrt{\frac{0.563 \times 0.437}{1000}} \approx 0.016 \]
Where:
\({p}\) is the sample proportion who support the candidate.
\(n\) is the sample size.
The Margin of Error (MOE)
In polling (as well as other applications), you’ll often hear the term “Margin of Error” or MOE. The margin of error represents the maximum expected difference between the sample statistic (like a sample proportion) and the true population parameter, at a specific confidence level.
The margin of error for a proportion can be calculated using the following formula:
\[ \text{Margin of Error} = z \times SE = 1.96 \times .016 = .031 \]
Where:
\(z\) is the z-value from a standard normal distribution corresponding to the desired confidence level (for a 95% confidence level, \(z = 1.96\)).
\(SE\) is the standard error.
Compute SE, MOE, and CI for our example
Using the information described above, compute the standard error, MOE, and 95% CI for this example in the code chunk below.
Description of parameter estimates
Based on the sample of 1000 voters, and a confidence level of 95%, the margin of error is 0.031, or expressed as a percentage, 3.1%. By adding and subtracting the MOE from the estimate, we arrive at the 95% confidence interval:
\[ 0.563 \pm 0.031 = (0.532, 0.594) \]
Frequentist interpretation
How should we interpret this CI?
\[ 0.563 \pm 0.031 = (0.532,0.594) \]
If we were to take many, many random samples of size 1000 from the population and compute a confidence interval for each sample, 95% of those confidence intervals would contain the true population proportion of voters who support the candidate.
MOE and CI for a proportion
The MOE represents the maximum expected difference between the sample statistic (e.g., the sample proportion) and the true population parameter.
The confidence interval is the range around the sample estimate, and it is computed as the sample estimate plus or minus the MOE.
\[ CI = {p} \pm MOE \]
BONUS MATERIAL
If you want to stretch a bit — please check out the rest of this activity.
How big a sample do you need?
Let’s imagine that the candidate you are working for returns to your office a few weeks after participating in a big debate. They want you to conduct another poll. The candidate wants to ensure that the 95% CI doesn’t include 50% (signifying a tie) — but because money and time are tight — they want to know if they can confidently get away with a smaller sample.
Use the formula for the CI to solve this problem
To determine the smallest sample size for this scenario, we can use the formula for the confidence interval (CI) for a proportion:
\[ CI = {p} \pm z \times \sqrt{\frac{{p}(1-{p})}{n}} \]
Where:
\({p}\) is the sample proportion.
\(z\) is the z-score corresponding to the desired confidence level for a standard normal distribution (for 95%, \(z = 1.96\)).
\(n\) is the sample size.
Note that the margin of error (MOE) is \(z \times \sqrt{\frac{{p}(1-{p})}{n}}\).
The candidate wants to ensure that the 95% CI doesn’t include 50%. This means we need to calculate the sample size \(n\) such that the margin of error keeps the CI far enough from 0.50.
Solve for necessary MOE
To ensure that the confidence interval does not include 50%, we need the margin of error to be small enough. Given that \({p} = 0.55\), the margin of error should be at most the difference between \({p}\) and 0.50:
\[ \text{MOE} = {p} - 0.50 = 0.55 - 0.50 = 0.05 \]
Solving for sample size:
Rearranging the margin of error formula to solve for \(n\):
\[ n = \frac{z^2 \times {p}(1-{p})}{\text{MOE}^2} \]
Plugging in values:
- \(z = 1.96\),
- \({p} = 0.55\),
- \(\text{MOE} = 0.05\).
\[ n = \frac{(1.96)^2 \times (0.55 \times 0.45)}{(0.05)^2} \]
Simplifying:
\[ n = \frac{3.8416 \times 0.2475}{0.0025} = \frac{0.9503}{0.0025} = 380.12 \]
Therefore, according to this calculation, the minimum sample size needed is approximately 381 respondents. If you like, you can obtain the same value using a commonly-used online power calculator.
Checking the adequacy of this sample size
It’s interesting to study this sample size assertion with a simulation study. I don’t expect you to be able to code this now — you’ll get the chance to study power analysis thoroughly in PSY653. Here’s a little preview that might help grow your intuition for sample size estimation now.
Press Run Code to set up a function to determine the proportion of 95% CIs that would include the target that we want to avoid (e.g., 0.50). That is, the candidate wants to ensure that the 95% CI excludes 0.50.
Now, let’s use the function
The specifications below set up a version of the experiment that matches our initial specifications. That is, we expect 381 people are needed, that the proportion in the population who support the candidate is 0.55, and that we want a 95% CI that doesn’t include 0.50.
Press Run Code to see what proportion of CIs given those specifications would include the target that we desire to avoid.
Yikes! About half of the simulated samples produce a 95% CI that includes 0.50.
What’s happening? The formula-based method for sample size calculation assumes that one confidence interval based on one poll is enough. The simulation method shows how variability in the data leads to some intervals still including 50%. Real-world data is messy. Each sample is different due to randomness, and that’s why the simulation study results question whether a sample size of 381 is really adequate. While the formula-based method gives us a good starting point (and it’s faster), simulations give us a deeper understanding of how results vary in practice. In this case, we can see that a larger sample size might be necessary to ensure that our candidate can confidently rely on the poll results to exclude 50% from the CI in most scenarios.
I invite you to experiment with the specifications using the simulation to see how changing parameters like sample size, true proportion, and the confidence level affects the results. You can modify these in the code chunk above and then rerun the simulation.
Try increasing the sample size: What happens to the proportion of CIs that include 50% when we bump up the sample size from 381 to, say, 500? How about 1,000?
Change the true proportion: What if the true proportion in the population is closer to 60% instead of 55%?
Adjust the confidence level: What happens if we change the confidence level from 95% to 90%? (Note: lower confidence means narrower intervals but with less certainty.)
If you’re interested, here’s an interesting article by the Pew Research Center on some of the mechanics of Presidential polling.