Estimating Uncertainty in Descriptive Statistics
In a 2017 study, Dr. Raj Chetty and colleagues analyzed intergenerational income mobility at 2,202 U.S. colleges using data from over 30 million students
Key variables for our analysis today:
College name (name): The name of the institution. Each row of data represents a college.
Student’s median income (k_median): Median individual earnings at age 34 in 2014 for the students who attended the college.
Our parameter of interest: The average median income across these 2,202 institutions.
We will define the population parameter of interest as the average of the 2,202 median incomes across colleges. Since, in this unique setting, we have data from all colleges, we can calculate the population parameter directly.
Let’s imagine that we were conducting this study on our own. We don’t have the resources to determine the median income at age 34 for all 2,202 colleges — so instead we draw a random sample of size 250.
This is our parameter estimate based on the drawn sample.
Key questions we need to answer:
How close is our sample estimate to the true population parameter?
If we drew a different sample of 250 colleges, how much would our estimate change?
What range of values might reasonably contain the true population parameter?
How can we quantify our uncertainty about this estimate?
A confidence interval will help us:
Acknowledge the uncertainty inherent in sampling.
Provide a range of plausible values for the population parameter, based on our drawn sample.
Bootstrap resampling lets us simulate drawing many samples by resampling from our original sample with replacement. This produces a bootstrap distribution that approximates the sampling distribution.
Let’s see what one bootstrap resample looks like:
We need many, many bootstrap resamples to simulate the sampling distribution.
Calculate a 95% confidence interval
Using our 1000 bootstrap estimates, we find the middle 95% of the distribution by calculating the 2.5th and 97.5th percentiles with the quantile() function.
Think of it this way: If 95% of our bootstrap estimates fall between these bounds, then these represent the typical range of values we’d expect. The outer 5% (2.5% on each tail) represent unusual estimates that are less likely to occur.
This interval captures the range of plausible values for our population parameter, based on our sample.
Now that you understand how bootstrap confidence intervals work, let’s explore what happens when we repeat this process multiple times. I created a custom function called bootstrap_and_plot() that automates the entire process we just learned.
What the function does:
Draws a fresh random sample from our college population (default is n = 250)
Creates bootstrap resamples from that sample (default is 1000 bootstrap resamples)
Calculates a 95% confidence interval using the percentile method (default is 95%)
Checks whether the CI captures the true population parameter
Shows you a histogram of the bootstrap distribution with the CI and population parameter marked
Run the function 20 times with default settings and keep a simple tally of whether or not the CI includes the true population parameter.
95% of trials should capture the population parameter
Individual Student Expectations (each student conducts 20 trials)
Expect 19 captures (but 15-20 is normal)
Expect about 1 miss
Some students may get all 20 captures, others might get only 16-17
Individual variation is expected and completely normal
Collated Across Students (we collate about 400 trials (20 students × 20 trials each)
Expect ~380 captures
Expect ~ 20 misses
Law of Large Numbers The collated class results will be much closer to exactly 95% than individual student results.
So far we’ve used bootstrapping to estimate confidence intervals. But when certain conditions are met, we can use mathematical shortcuts based on known probability distributions.
Key difference:
Bootstrap: Uses resampling to discover what the sampling distribution looks like
Parametric: Uses mathematical theory that tells us what the sampling distribution should look like
When can we use the shortcut? When the sampling distribution follows a t-distribution (which happens when our data is roughly normal or we have a large sample).
The standard error (SE) measures how much our sample mean would typically vary if we took many samples of the same size.
Think of it as: “How precise is our sample mean as an estimate?”
Formula: \[SE = \frac{s}{\sqrt{n}}\]
For estimating a population mean: \(df = n - 1\)
Why lose 1 degree of freedom? When we estimate the population standard deviation using our sample, we “use up” one piece of information. Think of it as the price we pay for not knowing the true population standard deviation.
For our example: \(df = 250 - 1 = 249\)
For a 95% interval, we need the t-score that puts 2.5% in each tail.
Intuition: We’re asking “how many standard errors away from the center do we need to go to capture 95% of possible sample means?”
The recipe: \[\text{Sample Mean} \pm \text{(t-score)} \times \text{(Standard Error)}\]
In symbols: \[ CI = \bar{x} \pm t_{df} \times SE \]
What this means: Based on our sample, we’re 95% confident that the true mean income at age 34 across all colleges is between $35,896 and $39,096.
The fine print: If we repeated this process 100 times with different samples, about 95 of our intervals would capture the true population mean. We don’t know if this particular interval does, but the method works 95% of the time.
Try changing the conf_level argument to 0.90 and 0.99.
Notice how the 90% CI is narrower, but it comes with less certainty that the interval captures the true parameter.
Notice how the 99% CI is wider, but it provides greater certainty that the interval contains the true parameter.