A webR tutorial

A parametric/theory-based approach to regression inference

Background

In a 2017 study, Dr. Raj Chetty and colleagues analyzed intergenerational income mobility at 2,202 U.S. colleges using data from over 30 million students. They calculated the median parental income for students attending college in the early 2000s and the median income of these students at age 34. Parental income was defined as the average pre-tax household income when the child was 15-19, adjusted to 2015 dollars. Children’s income was based on individual earnings in 2014, ranked within their birth cohort, as were parents’ incomes. In the data frame, each row of data represents one of the 2,202 colleges.

We’ll explore Chetty and colleague’s data in this WebR activity.

Import the data

We’ll use data provided by the authors, downloaded from the Opportunity Insights data repository. We’ll focus on the following key variables within the data frame:

  • The name of the college.

  • The median income of parents (as described above and called par_median).

  • The median income of children/students at age 34 (called k_median).

Press Run Code on the code chunk below to import the data.

Goal of this activity

In this activity, we’ll treat the 2,202 colleges in Chetty’s dataset as our “population.” To mimic the process of a real-world study, we’ll start by imagining that we’ve randomly selected a single sample of 100 colleges from this population. Using this sample, our goal will be to examine the relationship between parent median income and child median income among the students.

To estimate the effect of median parent income on median child income, we’ll apply a regression model to our sample data. In doing so, we will use a parametric (or theory-based) approach to assess the uncertainty in our parameter estimates. This approach relies on statistical theory to derive estimates of standard errors and confidence intervals, allowing us to quantify the uncertainty around our point estimates within this sample.

Draw a single random sample

Press Run Code on the code chunk below to draw a random sample of size 100. We’ll set the seed so everyone in the class obtains the same estimates.

Fit the linear regression model

Please perform the following tasks in the code chunk below:

  • First add a new variable to the df data frame that centers par_median at the mean in the sample. Call this new variable par_median_centered.

  • Then, fit the linear regression model to regress child median income across the 100 colleges (k_median) on the centered version of parent median income par_median_centered. Name the model object my_model.

  • Finally, request the tidy() output as usual, but, enhance the tidy() output to include the 95% Confidence Interval (CI) for the parameter estimates. This is accomplished with the addition of the following arguments to the tidy() function: my_model |> tidy(conf.int = TRUE, conf.level = 0.95).

Here’s what the results will look like in RStudio. I’ve intentionally suppressed the printing of two quantities in the tidy() output — specifically, the statistic and p-value. We’ll revisit these metrics in Module 16, but for now, let’s focus on quantifying sampling variability (which is the point of the standard error and CI). This variability, or uncertainty in our estimates, is captured by the std.error (standard error) and the confidence interval bounds (conf.low and conf.high), which give us a range of plausible values for our population parameters.

Where are these confidence interval bounds coming from?

When we calculate a confidence interval (CI) for a regression parameter, we’re defining a range within which we can be reasonably confident that the true population parameter lies. In this case, we are constructing a 95% CI for both the intercept and the slope of our regression model. This means that if we were to repeat the random sampling process many times, 95% of the 95% confidence intervals we construct would contain the true population parameter.

This approach aligns with what we observed in simulations, offering an efficient way to estimate where the true parameter likely falls in the population.

Finding the Critical t-Value

To build a 95% CI by hand, we need to determine the critical t-value that reflects the variability of our sample estimates due to sampling error. For a 95% CI, we want to capture the central 95% of the t-distribution, leaving 2.5% in each tail. This t-value depends on our sample size and, specifically, on our degrees of freedom (df), which is calculated as follows for a regression model:

\[ \text{df} = n - 1 - \text{number of predictors} \]

For our sample of 100 colleges and a single predictor, we have:

\[ \text{df} = 100 - 1 - 1 = 98 \]

In R, we can obtain these critical t-values using the command:

This command will yield the t values needed for a 95% CI, which are approximately ±1.984.

Building the Confidence Interval

The confidence interval is based on the idea that our sample estimate (e.g., the slope or intercept) has some error, expressed as the standard error (SE). The general formula for a confidence interval for a parameter estimate, like the intercept or slope, is:

\[ CI = \hat{\beta} \pm t_{\alpha/2, \text{df}} \times SE_{\hat{\beta}} \]

where:

  • \(\hat{\beta}\) is the estimated parameter (intercept or slope),

  • \(t_{\alpha/2, \text{df}}\) is the critical t-value for our chosen confidence level (95%),

  • \(SE_{\hat{\beta}}\) is the standard error of the parameter estimate.

CI for the Intercept

For our model’s intercept, the confidence interval calculation becomes1:

\(95\% \text{ CI for } \beta_0 = 36648.000 \pm 1.984 \times 745.876\)

This results in:

\(95\% \text{ CI for } \beta_0 = (35167.833, 38128.167)\)

CI for the Slope

Similarly, for the slope of the model, the confidence interval is calculated as:

\(95\% \text{ CI for } \beta_1 = 0.358 \pm 1.984 \times 0.024\)

This results in:

\(95\% \text{ CI for } \beta_1 = (0.312, 0.405)\)

Interpreting these Confidence Intervals

Each confidence interval represents a plausible range for our parameter estimates based on our sample data. Importantly, in the frequentist framework, a 95% confidence interval means that if we were to repeat this study and draw new samples many times, then estimate the 95% CI in each of those samples, 95% of those intervals would contain the true population parameter.

For instance:

  • The CI for the intercept suggests that if this sampling process were repeated many times, 95% of the resulting 95% confidence intervals would capture the true average child median income (when parent median income is at the mean). Based on our single sample, this plausible range is approximately 35,168 to 38,128 USD.

  • Similarly, the CI for the slope implies that, on average, each additional unit of parent median income is associated with an increase in child median income, with a plausible range from 0.312 to 0.405 USD. If we repeated the study and recalculated this slope interval many times, 95% of those intervals would capture the true slope in the population.

These confidence intervals help us quantify the precision of our estimates, reflecting the sampling variability that arises from analyzing only a sample rather than the entire population. Thus, they provide insight into the range in which the true parameter values would lie across repeated samples, assuming our sample is representative of the population and meets certain assumptions. In particular, the validity of these intervals relies on assumptions tied to the Central Limit Theorem (CLT): namely, that the errors are independently and identically distributed and that the sample size is large enough for the sampling distribution of the estimates to approximate a normal distribution. When these assumptions hold, our confidence intervals provide a reliable estimate of the uncertainty around the true population parameters. Will dig further into assumption checking and remediation measures for regression models in Module 17.

Uncertainty for the regression line

When fitting a regression line, we not only get an estimate of the relationship between our predictor (parent median income) and outcome (child median income), but we can also assess the uncertainty of this relationship across the entire range of predictor values. The shaded area around the regression line is called the confidence band.

In other words, this confidence band provides an area depicting a plausible range for where the true regression line could lie. In frequentist terms, a 95% confidence band means that if we were to repeatedly draw samples and refit the model, 95% of those confidence bands would contain the true population regression line.

Press Run Code on the code chunk below to create the confidence band.

If the confidence band is narrow, we have a higher certainty about the predicted means (i.e., the y-hats) along the regression line. If it’s wider, there’s more uncertainty about the line’s true position.

Uncertainty for predicted scores

In regression analysis, we can predict outcomes based on our model and examine both confidence intervals (CI) and prediction intervals (PI) for those predictions. A confidence interval provides a range within which the average predicted outcome is expected to fall for a given value of the predictor variable. In contrast, a prediction interval gives a wider range that accounts for individual variation, estimating where 95% of individual outcomes will likely fall.

In this example, we’ll predict the child median income for a college where the median parent income is one standard deviation below the mean of parent median income in our sample. We’ll use the predict() function to generate both the confidence and prediction intervals for this prediction.

Step 1: Calculate Standard Deviation of parent median income

First, we calculate the standard deviation of par_median in our sample and then define the exact predictor values at which we want predictions from the model. In this case, we’re creating predict_df to include a specific value of par_median_centered — specifically, one standard deviation below the mean. (The mean is implicit because par_median_centered is centered at zero, which represents the sample mean.)

Step 2: Generate the 95% Confidence Interval (CI)

We’ll use the predict() function to obtain the predicted value (called fit in the output) and the lower and upper bounds of the Confidence Interval (CI) (called lwr and upr in the output).

Press Run Code to obtain these estimates.

The 95% confidence interval (CI) is approximately (23162, 27,359), centered around a predicted value of 25,261. This means that if we were to repeatedly take samples of 100 colleges and recalculate this CI, about 95% of those intervals would contain the true average median child income for colleges where the median parent income is one standard deviation below the mean. Importantly, this CI doesn’t guarantee that the true mean lies within this specific interval — it just reflects the interval’s reliability across repeated sampling.

Step 3: Generate the 95% Prediction Interval (PI)

We also use the predict() function to obtain the predicted value and the lower and upper bounds of the Prediction Interval (PI).

Press Run Code to obtain these estimates.

The 95% prediction interval (PI) is approximately (10,311, 40,211). This wider range accounts for the additional uncertainty in predicting specific outcomes for individual colleges. Here, if we repeatedly sampled from the population and recalculated this PI, 95% of such intervals would contain the child median income for a single college where the median parent income is one standard deviation below the mean.

The broader range of the PI compared to the CI is because it incorporates not only the uncertainty in estimating the mean income but also the natural variability around that mean for individual colleges. This is crucial in understanding that while we can be relatively confident about the mean outcome within a narrower range, individual outcomes (in this case individual colleges) will have a wider range due to case to case (i.e, college to college) variability.

Your turn

In the code chunk below, please write the code to calculate the 99% CI and PI for the predicted outcome when parent median income is 1/2 of a standard deviation above the mean in the sample.

Footnotes

  1. The CIs were calculated using all decimal places in the R model objects and may not precisely match a hand calculation.↩︎