PSY 652: Research Methods in Psychology I

Estimating Uncertainty in Descriptive Statistics

Kimberly L. Henry: kim.henry@colostate.edu

The College Mobility Study

In a 2017 study, Dr. Raj Chetty and colleagues analyzed intergenerational income mobility at 2,202 U.S. colleges using data from over 30 million students

Key variables for our analysis today:

College name (name): The name of the institution. Each row of data represents a college.
Student’s median income (k_median): Median individual earnings at age 34 in 2014 for the students who attended the college.

Our parameter of interest: The average median income across these 2,202 institutions.

Import the data

Create a histogram of the population data

The population parameter

We will define the population parameter of interest as the average of the 2,202 median incomes across colleges. Since, in this unique setting, we have data from all colleges, we can calculate the population parameter directly.

Draw a random sample

Let’s imagine that we were conducting this study on our own. We don’t have the resources to determine the median income at age 34 for all 2,202 colleges — so instead we draw a random sample of size 250.

Estimate the parameter in the sample

This is our parameter estimate based on the drawn sample.

How confident can we be in this number?

Key questions we need to answer:

How close is our sample estimate to the true population parameter?
If we drew a different sample of 250 colleges, how much would our estimate change?
What range of values might reasonably contain the true population parameter?
How can we quantify our uncertainty about this estimate?

Confidence intervals to the rescue!

A confidence interval will help us:

Acknowledge the uncertainty inherent in sampling.
Provide a range of plausible values for the population parameter, based on our drawn sample.

Bootstrap Resampling

What is bootstrapping?

Bootstrap resampling lets us simulate drawing many samples by resampling from our original sample with replacement. This produces a bootstrap distribution that approximates the sampling distribution.

Create one bootstrap sample

Let’s see what one bootstrap resample looks like:

Create 1000 bootstrap resamples

We need many, many bootstrap resamples to simulate the sampling distribution.

Construct a Percentile Bootstrap Confidence Interval

Calculate a 95% confidence interval

Using our 1000 bootstrap estimates, we find the middle 95% of the distribution by calculating the 2.5th and 97.5th percentiles with the quantile() function.

Think of it this way: If 95% of our bootstrap estimates fall between these bounds, then these represent the typical range of values we’d expect. The outer 5% (2.5% on each tail) represent unusual estimates that are less likely to occur.

This interval captures the range of plausible values for our population parameter, based on our sample.

Add the CI to the graph

Is the population parameter inside this CI?

Interactive bootstrap CI explorer

Now that you understand how bootstrap confidence intervals work, let’s explore what happens when we repeat this process multiple times. I created a custom function called bootstrap_and_plot() that automates the entire process we just learned.

What the function does:

Draws a fresh random sample from our college population (default is n = 250)
Creates bootstrap resamples from that sample (default is 1000 bootstrap resamples)
Calculates a 95% confidence interval using the percentile method (default is 95%)
Checks whether the CI captures the true population parameter
Shows you a histogram of the bootstrap distribution with the CI and population parameter marked

Class-wide exploration activity:

Run the function 20 times with default settings and keep a simple tally of whether or not the CI includes the true population parameter.

Expectations for a 95% CI

95% of trials should capture the population parameter

Individual Student Expectations (each student conducts 20 trials)

Expect 19 captures (but 15-20 is normal)
Expect about 1 miss
Some students may get all 20 captures, others might get only 16-17
Individual variation is expected and completely normal

Collated Across Students (we collate about 400 trials (20 students × 20 trials each)

Expect ~380 captures
Expect ~ 20 misses

Law of Large Numbers The collated class results will be much closer to exactly 95% than individual student results.

A Theory-Based (Parametric) Approach

Parametric approach to CIs

So far we’ve used bootstrapping to estimate confidence intervals. But when certain conditions are met, we can use mathematical shortcuts based on known probability distributions.

Key difference:

Bootstrap: Uses resampling to discover what the sampling distribution looks like
Parametric: Uses mathematical theory that tells us what the sampling distribution should look like

When can we use the shortcut? When the sampling distribution follows a t-distribution (which happens when our data is roughly normal or we have a large sample).

Step 1: Calculate the Standard Error

The standard error (SE) measures how much our sample mean would typically vary if we took many samples of the same size.

Think of it as: “How precise is our sample mean as an estimate?”

Formula: \[SE = \frac{s}{\sqrt{n}}\]

Smaller SE = more precise estimate
Larger sample size = smaller SE = more precision

Step 2: Determine the Degrees of Freedom

For estimating a population mean: $df = n - 1$

Why lose 1 degree of freedom? When we estimate the population standard deviation using our sample, we “use up” one piece of information. Think of it as the price we pay for not knowing the true population standard deviation.

For our example: $df = 250 - 1 = 249$

Step 3: Find the cutoff from the $t$-distribution

For a 95% interval, we need the t-score that puts 2.5% in each tail.

Intuition: We’re asking “how many standard errors away from the center do we need to go to capture 95% of possible sample means?”

Step 4: Calculate the Confidence Interval

The recipe: \[\text{Sample Mean} \pm \text{(t-score)} \times \text{(Standard Error)}\]

In symbols: \[ CI = \bar{x} \pm t_{df} \times SE \]

Interpretation

What this means: Based on our sample, we’re 95% confident that the true mean income at age 34 across all colleges is between $35,896 and $39,096.

The fine print: If we repeated this process 100 times with different samples, about 95 of our intervals would capture the true population mean. We don’t know if this particular interval does, but the method works 95% of the time.

Use infer for a quick approach

Try changing the conf_level argument to 0.90 and 0.99.
- Notice how the 90% CI is narrower, but it comes with less certainty that the interval captures the true parameter.
- Notice how the 99% CI is wider, but it provides greater certainty that the interval contains the true parameter.