PSY 652: Research Methods in Psychology I

Uncertainty for Regression Estimates

Kimberly L. Henry: kim.henry@colostate.edu

The data

In a 2017 study, Dr. Raj Chetty and colleagues analyzed intergenerational income mobility at 2,202 U.S. colleges using data from over 30 million students. They calculated the median parental income for students attending college in the early 2000s and the median income of these students at age 34. Parental income was defined as the average pre-tax household income when the child was 15-19, adjusted to 2015 dollars. Children’s income was based on individual earnings in 2014, ranked within their birth cohort, as were parents’ incomes. Each row of data represents a college/university.

The variables

The variables in the data frame that we’ll consider include:

The name of the college.
The median income of parents (as described above and called par_median).
The median income of children/students at age 34 (called k_median).

The linear relationship

We will explore parent median income as a predictor of child median income across the colleges and universities. The scatter plot below presents the data and the best fit line.

Let’s pretend these data are “The Population”

The population equation that relates a predictor variable (X) to an outcome variable (Y) may be written as follows:

\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]

Here:

$Y_i$ is the value of the response variable for the $i^{th}$ case in the population.
$\beta_0$ is the true intercept of the population regression line, representing the expected value of $Y$ when $X = 0$.
$\beta_1$ is the true slope of the population regression line, indicating how much $Y$ is expected to change with a one-unit increase in $X$.
$X_i$ is the value of the predictor variable for the $i^{th}$ case in the population.
$\epsilon_i$ represents the error term for the $i^{th}$ case, capturing all other factors affecting $Y$ besides $X$ and inherently includes the variability in $Y$ not explained by $X$. The error term, $\epsilon_i$, follows a normal distribution with a mean of 0 and a standard deviation of $\sigma$, denoted as $\epsilon_i \sim N(0, \sigma^2)$. This represents the randomness or noise in the relationship between $X$ and $Y$.

The population model

population_model <- 
  lm(k_median ~ par_median, data = college_mobility)

population_model |> 
  tidy() |> 
  select(term, estimate) |> 
  mutate(across(where(is.numeric), ~ round(.x, 3)))

population_model |> 
  glance() |> 
  select(r.squared, sigma) |> 
  mutate(across(where(is.numeric), ~ round(.x, 3)))

Sampling from the Population

In practice, we rarely have access to population data. Instead, we draw a random sample and use the sample to make inferences about the population.
Different samples can lead to different estimates due to sampling variability.
We must quantify this sample to sample variability and express the uncertainty in our regression parameter estimates.

Sample regression line

In practice, we don’t know the true values of $\beta_0$, $\beta_1$, or $\epsilon_i$, so we estimate them using sample data. The sample equation, derived from our data, is formulated as:

\[ \hat{y_i} = b_0 + b_1x_i + e_i \]

$\hat{y_i}$ is the predicted value of the response variable for the $i^{th}$ case in the sample.
$b_0$ and $b_1$ are the estimates of the intercept and slope, respectively, obtained from the sample.
$x_i$ is the observed value of the predictor variable for the $i^{th}$ case in the sample.
$e_i$ is the residual for the $i^{th}$ case, representing the difference between the observed value and the predicted value of the response variable: $e_i = y_i - \hat{y_i}$.

Simulate random sampling from the population

We can simulate the process of collecting a random sample from our population of colleges. To begin, I drew one random sample of size 500. Then, fit the regression model within this sample.

set.seed(123)  # For reproducibility

# Draw a random sample of size 500
sample <- college_mobility |>
  slice_sample(n = 500, replace = FALSE)

# Fit a regression model to the sample
sample_model <- lm(k_median ~ par_median, data = sample)

# View the coefficients
sample_model |> 
  tidy() |> 
  select(term, estimate) |> 
  mutate(across(where(is.numeric), ~ round(.x, 3)))

Comparing Sample Estimates to Population Parameters

The table below compares the estimates from our sample to the population parameters obtained earlier.

Visualize the difference

What is the Sampling Distribution?

The Sampling Distribution for Regression Estimates

We can also imagine drawing many random samples — and thereby simulating the Sampling Distribution of the regression parameter estimates.

Draw many random samples from the population

First, take a look at the results from 9 random samples of size 500 drawn from the population.

Regression estimates for 9 samples of size 500
resample	intercept	slope	r.squared	sigma
1	9038.694	0.3575964	0.6204930	8290.792
2	11009.120	0.3360804	0.5468148	8610.070
3	9475.741	0.3494894	0.5454179	9761.214
4	13586.836	0.3007477	0.4406880	9854.353
5	8887.693	0.3664174	0.6003932	8377.026
6	9805.564	0.3439857	0.6029556	7802.560
7	10976.750	0.3285892	0.5199490	8882.052
8	10637.709	0.3366391	0.5825492	7477.149
9	9162.389	0.3585532	0.6213307	8414.825

Visualize the fitted models across samples

Here, I produced the scatter plots and best fit lines based on the 9 drawn random samples.

Let’s go big!

Next, I went big, drawing 1000 random samples of size 500. In each sample, I fit the regression model, and collated the results in a data frame.

Visualize the sampling distribution for the slope

I extracted the 1000 regression slopes (term == “par_median” in the data frame of estimates), and plotted them in a density plot to visualize the sampling distribution for the regression slope.

Describe the sampling distribution of the slope

I queried the sampling distribution estimates for the slope across the 1000 samples — computing the mean, sd, and the 2.5th and 97.5 percentiles (to encompass the middle 95%).

middle95 <-
  sampling_distribution |> 
  filter(term == "par_median") |> 
  summarize(n = n(),
            mean = mean(estimate),
            sd = sd(estimate),
            lower = quantile(estimate, probs = .025),
            upper = quantile(estimate, probs = .975))

middle95

95% of the samples produce a slope estimate that is between the lower and upper bounds reported in this table. Notice that the mean slope across the samples is equal to the true population slope that we fit earlier ($\beta_1 = 0.334$).

Overlay the mean and percentiles on the density plot

Then, I overlaid the true population slope and the percentiles onto the density graph of the sampling distribution for the regression slope.

In thinking about how much the effect of parent median income on child median income varies from sample to sample — this plot shows that 95% of the samples produce an estimate of the slope that is between the two dashed lines.

Plot the 1000 regression lines

I also plotted the 1000 regression lines from the samples onto a single plot.

A slightly different view

Sampling distribution for the intercept

I also queried and visualized the sampling distribution for the intercept of the regression model.

Middle 95% of the sampling distribution for the intercept

Lower and upper bounds when intercept is defined at par_median = $0 (the initial model).

Lower and upper bounds when intercept is defined at par_median = $77695 (the mean of par_median).

Two different intervals

Uncertainty for predicted scores

Last, I used the fitted regression model to calculate predicted values.

If a college’s median parent income is $100,000, what is the expected median income of children at age 34?”

Calculate the predicted value

The first step is to calculate the predicted scores for each of the 1000 samples.

predicted_scores <- 
  sampling_distribution |>
  filter(term %in% c("(Intercept)", "par_median", "sigma")) |>
  pivot_wider(names_from = term, values_from = estimate) |>
  mutate(yhat = `(Intercept)` + par_median * 100000)

predicted_scores |> summarize(mean_yhat = mean(yhat))

Across the 1000 samples, the average predicted score is $44,372. But, of course, there is sample to sample variability in this estimate.

Sample to sample variability in the predicted score

Two ways to express uncertainty for predicted scores

Comparison of CI and PI for predicted scores

The CI gives us an estimate of where the average median child income lies, while the PI broadens this to account for college to college variability in median child income given a certain parent median income. This distinction is crucial:

Confidence Interval (CI): Reflects the uncertainty around the mean outcome, giving us a range for the average predicted score that we might expect.
Prediction Interval (PI): Incorporates both the uncertainty around the mean and the variability of individual data points (i.e., the college in our example), providing a broader range that accounts for the fact that the individual college medians vary widely even among those who have the same parent median income.

Calculation of CI and PI for predicted scores

Calculation of 95% Confidence Interval

ci_bounds <- predicted_scores |> 
  select(yhat) |> 
  summarize(lower = quantile(yhat, probs = .025), 
            upper = quantile(yhat, probs = .975))

ci_bounds

Calculation of 95% Prediction Interval

# Add extra variability to yhat to account for individual variability
adjusted_predictions <- 
  predicted_scores |> 
  rowwise() |> 
  mutate(adjusted_yhat = yhat + rnorm(1, mean = 0, sd = sigma)) |> 
  ungroup()

# Calculate the 95% PI for individual child income at par_median = 100000
pi_bounds <- 
  adjusted_predictions |> 
  select(adjusted_yhat) |> 
  summarize(lower = quantile(adjusted_yhat, probs = .025), 
            upper = quantile(adjusted_yhat, probs = .975))

pi_bounds