Estimating Uncertainty in Regression Models
We simulated data on sleep efficiency predicted by:
We fit a SLR and MLR using data from a sample.
SLR: \(\hat{se}_i = b_0 + b_1 \times \text{alcohol}_i + e_i\)
MLR: \(\hat{se}_i = b_0 + (b_1 \times \text{alcohol}_i) + (b_2 \times \text{shi}_i) + e_i\)
Today’s focus: Those parameter estimates came from one sample. How much uncertainty surrounds them?
Let’s recreate the SLR (here are the results from the prior activity)
Every parameter estimate from your model is just one realization from a sampling distribution.
Parameters that vary sample-to-sample:
Question: If we collected new data tomorrow, how different would these estimates be?
Let’s demonstrate this uncertainty by simulating multiple samples and fitting the SLR model repeatedly.
For now, we’ll focus on the simple model first (just alcohol predicting sleep efficiency).
Press Run Code on the code chunk below to set up a function to create not one, but many, simulated data sets. This will let us visualize sample-to-sample variability that could be expected.
Now, we can use the function to create 1000 simulated data frames, fit the model (i.e., regression of se on alcohol) in each data frame, and retain the parameter estimates.
Let’s visualize the sampling distribution for each parameter estimate.
Our single sample gave us one set of estimates.
But there’s an entire sampling distribution for each parameter.
The spread of these distributions = uncertainty
Now: How do we quantify this uncertainty in practice?
Today we’ll focus on #2: Parametric approach
(You studied all three in Module 12!)
For the slope in SLR:
\[ SE_{b_1} = \sqrt{ \frac{ \text{SSE} / (n - 2) }{ \sum_{i=1}^{n} (x_i - \bar{x})^2 } } \]
For the intercept:
\[ SE_{b_0} = \sqrt{ \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \right) \cdot \frac{ \text{SSE} }{ n - 2 } } \]
Where:
The standard error is an estimate of the standard deviation of the sampling distribution for that parameter.
Smaller SE → More precise estimate
Larger SE → More uncertain estimate
Key: SEs allow us to construct confidence intervals and conduct hypothesis tests (a task we’ll take on in Module 16)!
Intercept standard error (1.53): The standard error of 1.53 tells us that if we repeatedly sampled 225 people and fit this model each time, the intercept estimates would typically vary by about 1.53 percentage points from sample to sample. This represents our uncertainty about the true average sleep efficiency when alcohol consumption is zero.
Slope standard error (0.076): The standard error of 0.076 tells us that if we repeatedly sampled 225 people and fit this model each time, the slope estimates (the effect of alcohol on sleep efficiency) would typically vary by about 0.076 percentage points per gram of alcohol from sample to sample. This represents our uncertainty about the true relationship between alcohol consumption and sleep efficiency.
Intercept standard error (1.68): The standard error of 1.68 tells us that if we repeatedly sampled 225 people and fit this model each time, the intercept estimates would typically vary by about 1.68 percentage points from sample to sample. This represents our uncertainty about the true average sleep efficiency when both alcohol consumption and shi are zero.
Alcohol slope standard error (0.055): The standard error of 0.055 tells us that if we repeatedly sampled 225 people and fit this model each time, the slope estimates for alcohol would typically vary by about 0.055 percentage points per gram of alcohol from sample to sample. This represents our uncertainty about the relationship between alcohol consumption and sleep efficiency, holding sleep hygiene constant. Notice this SE (0.055) is smaller than in the simple linear regression (0.076), suggesting that accounting for sleep hygiene has helped us estimate the alcohol effect more precisely.
Sleep hygiene slope standard error (0.050): The standard error of 0.050 tells us that if we repeatedly sampled 225 people and fit this model each time, the slope estimates for sleep hygiene would typically vary by about 0.050 percentage points per unit of the SHI from sample to sample. This represents our uncertainty about the relationship between sleep hygiene and sleep efficiency, holding alcohol consumption constant.
General formula for a regression coefficient:
\[ \text{CI} = \hat{\beta} \pm t_{\alpha/2, \, df} \times SE(\hat{\beta}) \]
Where:
For a 95% confidence interval, \(\alpha = 0.05\), so \(\alpha/2 = 0.025\), representing the probability in each tail of the \(t\) distribution. The \(t\) value depends on the chosen confidence level and the degrees of freedom (\(df\)).
\(\alpha\) represents the total probability that the confidence interval does not contain the true population parameter — in other words, the proportion of times we’d expect the interval to miss the true value if we repeated the study many times.
For our MLR with 225 observations and 2 predictors:
\[df = 225 - 2 - 1 = 222\]
For a 95% CI, we want the value that captures the middle 95%:
From tidy() output:
95% CI:
\[-0.3176377 \pm 1.970707 \times 0.05471130\]
tidy() with conf.int = TRUEProcedure-based (most technically correct):
Practical shorthand (commonly used):
When predicting \(y\) at a specific value of \(x\):
1. Confidence Interval (CI) for the mean response
2. Prediction Interval (PI) for a new observation
\(\hat{y} = 60.4131667 + (-0.3176377 \times \text{alcohol}) + (0.7907787 \times \text{shi})\)
\(\hat{y} = 60.4131667 + (-0.3176377 \times 30) + (0.7907787 \times 10)\)
\(\hat{y} = 60.4131667 - 9.529131 + 7.907787\)
\(\hat{y} = 58.7918227\)
So the predicted sleep efficiency is approximately 58.8% for someone who consumed 30 grams of alcohol and has a sleep hygiene index score of 10.
Specify desired levels of the predictors:
Use model results and new_data to compute prediction:
Interpretation: We are 95% confident that the average sleep efficiency for people with alcohol = 30 grams and a SHI score of 10 lies between 57.4% and 60.2%.
Interpretation: We are 95% confident that one individual with alcohol = 30 grams and SHI = 10 will have sleep efficiency between 49.5% and 68.0%.
Much wider! Accounts for person-to-person variability.
Confidence Interval
Prediction Interval
Sampling variability: Parameter estimates change from sample to sample
Standard errors: Quantify uncertainty in coefficient estimates
Confidence intervals:
tidy(model, conf.int = TRUE)Prediction intervals:
predict() with interval = "confidence" or "prediction"