Uncertainty for Regression Estimates
In a 2017 study, Dr. Raj Chetty and colleagues analyzed intergenerational income mobility at 2,202 U.S. colleges using data from over 30 million students. They calculated the median parental income for students attending college in the early 2000s and the median income of these students at age 34. Parental income was defined as the average pre-tax household income when the child was 15-19, adjusted to 2015 dollars. Children’s income was based on individual earnings in 2014, ranked within their birth cohort, as were parents’ incomes. Each row of data represents a college/university.
The variables in the data frame that we’ll consider include:
The name of the college.
The median income of parents (as described above and called par_median).
The median income of children/students at age 34 (called k_median).
We will explore parent median income as a predictor of child median income across the colleges and universities. The scatter plot below presents the data and the best fit line.
The population equation that relates a predictor variable (X) to an outcome variable (Y) may be written as follows:
\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]
Here:
In practice, we rarely have access to population data. Instead, we draw a random sample and use the sample to make inferences about the population.
Different samples can lead to different estimates due to sampling variability.
We must quantify this sample to sample variability and express the uncertainty in our regression parameter estimates.
In practice, we don’t know the true values of \(\beta_0\), \(\beta_1\), or \(\epsilon_i\), so we estimate them using sample data. The sample equation, derived from our data, is formulated as:
\[ \hat{y_i} = b_0 + b_1x_i + e_i \]
We can simulate the process of collecting a random sample from our population of colleges. To begin, I drew one random sample of size 500. Then, fit the regression model within this sample.
set.seed(123) # For reproducibility
# Draw a random sample of size 500
sample <- college_mobility |>
slice_sample(n = 500, replace = FALSE)
# Fit a regression model to the sample
sample_model <- lm(k_median ~ par_median, data = sample)
# View the coefficients
sample_model |>
tidy() |>
select(term, estimate) |>
mutate(across(where(is.numeric), ~ round(.x, 3)))
The table below compares the estimates from our sample to the population parameters obtained earlier.
We can also imagine drawing many random samples — and thereby simulating the Sampling Distribution of the regression parameter estimates.
First, take a look at the results from 9 random samples of size 500 drawn from the population.
Regression estimates for 9 samples of size 500 | ||||
---|---|---|---|---|
resample | intercept | slope | r.squared | sigma |
1 | 9038.694 | 0.3575964 | 0.6204930 | 8290.792 |
2 | 11009.120 | 0.3360804 | 0.5468148 | 8610.070 |
3 | 9475.741 | 0.3494894 | 0.5454179 | 9761.214 |
4 | 13586.836 | 0.3007477 | 0.4406880 | 9854.353 |
5 | 8887.693 | 0.3664174 | 0.6003932 | 8377.026 |
6 | 9805.564 | 0.3439857 | 0.6029556 | 7802.560 |
7 | 10976.750 | 0.3285892 | 0.5199490 | 8882.052 |
8 | 10637.709 | 0.3366391 | 0.5825492 | 7477.149 |
9 | 9162.389 | 0.3585532 | 0.6213307 | 8414.825 |
Here, I produced the scatter plots and best fit lines based on the 9 drawn random samples.
Next, I went big, drawing 1000 random samples of size 500. In each sample, I fit the regression model, and collated the results in a data frame.
I extracted the 1000 regression slopes (term == “par_median” in the data frame of estimates), and plotted them in a density plot to visualize the sampling distribution for the regression slope.
I queried the sampling distribution estimates for the slope across the 1000 samples — computing the mean, sd, and the 2.5th and 97.5 percentiles (to encompass the middle 95%).
middle95 <-
sampling_distribution |>
filter(term == "par_median") |>
summarize(n = n(),
mean = mean(estimate),
sd = sd(estimate),
lower = quantile(estimate, probs = .025),
upper = quantile(estimate, probs = .975))
middle95
95% of the samples produce a slope estimate that is between the lower and upper bounds reported in this table. Notice that the mean slope across the samples is equal to the true population slope that we fit earlier (\(\beta_1 = 0.334\)).
Then, I overlaid the true population slope and the percentiles onto the density graph of the sampling distribution for the regression slope.
In thinking about how much the effect of parent median income on child median income varies from sample to sample — this plot shows that 95% of the samples produce an estimate of the slope that is between the two dashed lines.
I also plotted the 1000 regression lines from the samples onto a single plot.
I also queried and visualized the sampling distribution for the intercept of the regression model.
Lower and upper bounds when intercept is defined at par_median = $0 (the initial model).
Lower and upper bounds when intercept is defined at par_median = $77695 (the mean of par_median).
Last, I used the fitted regression model to calculate predicted values.
The first step is to calculate the predicted scores for each of the 1000 samples.
predicted_scores <-
sampling_distribution |>
filter(term %in% c("(Intercept)", "par_median", "sigma")) |>
pivot_wider(names_from = term, values_from = estimate) |>
mutate(yhat = `(Intercept)` + par_median * 100000)
predicted_scores |> summarize(mean_yhat = mean(yhat))
Across the 1000 samples, the average predicted score is $44,372. But, of course, there is sample to sample variability in this estimate.
The CI gives us an estimate of where the average median child income lies, while the PI broadens this to account for college to college variability in median child income given a certain parent median income. This distinction is crucial:
Confidence Interval (CI): Reflects the uncertainty around the mean outcome, giving us a range for the average predicted score that we might expect.
Prediction Interval (PI): Incorporates both the uncertainty around the mean and the variability of individual data points (i.e., the college in our example), providing a broader range that accounts for the fact that the individual college medians vary widely even among those who have the same parent median income.
Calculation of 95% Confidence Interval
Calculation of 95% Prediction Interval
# Add extra variability to yhat to account for individual variability
adjusted_predictions <-
predicted_scores |>
rowwise() |>
mutate(adjusted_yhat = yhat + rnorm(1, mean = 0, sd = sigma)) |>
ungroup()
# Calculate the 95% PI for individual child income at par_median = 100000
pi_bounds <-
adjusted_predictions |>
select(adjusted_yhat) |>
summarize(lower = quantile(adjusted_yhat, probs = .025),
upper = quantile(adjusted_yhat, probs = .975))
pi_bounds