Lecture Materials for Week 13

November 17, 2025

Exam Day

A decorative picture that says "Good Luck"

Study Guide for Final Exam

R Functions for Model Building & Summarizing

Be able to:

Define what each function does:
- lm() - Fits linear models
- tidy() - Extracts parameter estimates (coefficients, standard errors, p-values)
- glance() - Provides model-level statistics (R², sigma, F-statistic, etc.)
- augment() - Adds fitted values and residuals to the original data
- set_variable_labels() - Assigns descriptive labels to variables
- tbl_summary() - Creates publication-ready descriptive tables
- Functions from marginaleffects package: predictions(), plot_slopes()

Study Design Types

Understand the difference between:

Descriptive studies: Summarizing patterns, prevalence, or characteristics in a population
Predictive studies: Forecasting future outcomes using existing data
Causal inference/explanation: Establishing cause-and-effect relationships (typically RCTs or well-designed observational studies)

Core Regression Concepts

Review:

Confidence Interval (CI) vs. Prediction Interval (PI):
- CI: estimates the mean of Y|X (narrower)
- PI: estimates an individual Y|X (wider, accounts for both mean uncertainty and individual variability)
- Both are narrowest at the mean of X
sigma: The standard deviation of residuals (from glance())
- sigma is typically smaller than SD(Y) because the model explains some variance
- Lower sigma indicates better model fit
Train/test splits: Provide unbiased estimates of out-of-sample predictive accuracy
Standard error: The standard deviation of a sampling distribution of a parameter estimate
Adding confidence bands in ggplot: geom_smooth(method="lm", se=TRUE, level=.95)

Null Hypothesis Significance Testing (NHST)

Know:

Null hypothesis (H₀): Baseline assumption of no effect/difference
Alternative hypothesis (Hₐ): Statement of an effect or difference
Alpha (α): Pre-set Type I error rate (e.g., .05)
p-value: P(data or more extreme | H₀)
- If p < α, reject H₀
- p-value is NOT the probability that H₀ is true
Type I error: Rejecting a true H₀ (false positive)
Type II error: Failing to reject a false H₀ (false negative, missing a real effect)
Two-tailed tests: Used when no directional hypothesis is specified
Statistical vs. practical significance: A result can be statistically significant but not meaningful in practice
Rejection regions: If test statistic falls in rejection region, p < α
Permutation tests: Approximate the null distribution by shuffling/permuting labels
Overall F-test: Tests whether the model explains a significant portion of variance in Y

Confidence Intervals from Model Output

Be able to:

Calculate a 95% CI using: estimate ± (critical t-value × SE)
Use qt() to find critical values for the appropriate degrees of freedom
Interpret CIs: If CI excludes the null value (typically 0), reject H₀

Variance Decomposition & R²

Understand:

SST (Total Sum of Squares): Total variance in Y
SSR (Regression Sum of Squares): Variance explained by the model
SSE (Error Sum of Squares): Unexplained variance (residual variance)
R² = SSR/SST: Proportion of variance explained
Venn diagrams: Visual representation of shared and unique variance
- In multiple regression, predictors can share variance (overlap) and have unique contributions

Review:

R² increases (or stays the same) when adding predictors
sigma decreases when model fit improves
SSE is minimized when fitted values equal observed values

Bootstrap Hypothesis Testing

Be able to:

Compute a two-sided p-value from bootstrap results
Interpret the standard deviation of the bootstrap distribution as the standard error
Calculate how many SEs the observed statistic is from the null
Make decisions based on p-value and α

Interaction Models

Understand:

Interaction term: Tests whether the effect of one predictor depends on another predictor
Simple slopes: The effect of X₁ on Y at specific values of X₂
Centering predictors: Makes interpretation easier (intercept = predicted Y when all predictors are at their mean)
Interpreting coefficients:
- Main effects when interaction is present represent “conditional” effects
- Interaction coefficient shows how the slope changes

Be able to:

Compute simple slopes from model output
Determine if an interaction is statistically significant
Interpret what an interaction means substantively

Transformed Outcomes

Log-transformed Y:

Slope interpretation: A 1-unit increase in X is associated with a (slope × 100)% change in Y
Use 100 * (exp(slope) - 1) to convert to percentage change
Back-transform: exp(log Y) = Y
Intercept: predicted log(Y) when X = 0

Quadratic models (Y ~ X + X²):

Model curvilinear relationships
Vertex: The x-value where the curve reaches its maximum or minimum
- Vertex x-coordinate = -b₁/(2 × b₂) where b₁ is the linear term and b₂ is the quadratic term
U-shaped: Positive quadratic term
Inverted U-shaped: Negative quadratic term
Interpret: The effect of X on Y changes across the range of X

Multiple Regression & Confounding

Key concepts:

Confounders: Variables that affect both the predictor and outcome, creating spurious associations
Adjusted effects: The effect of X on Y, holding other variables constant
Change in R²: Additional variance explained by adding predictors
- ΔR² = R²(full model) - R²(reduced model)
Parallel slopes model: Model with multiple predictors but no interaction
- Lines for different groups are parallel (same slope, different intercepts)
Residualized gain: Using baseline as a covariate when analyzing change

In RCTs:

Randomization ensures confounders are balanced across groups (in expectation)
Treatment effect can be interpreted causally
Baseline covariates improve precision but aren’t confounders

Be able to:

Compare unadjusted vs. adjusted effects
Identify whether a variable is a confounder
Interpret whether controlling for a variable changes conclusions

Using qt() and Degrees of Freedom

Know:

For simple regression: df = n - 2
For multiple regression: df = n - k - 1 (where k = number of predictors)
Use qt(c(.025, .975), df) for two-sided 95% CI critical values

Computing Predictions from Models

Be able to:

Use regression equation to predict Y for given X values
Predicted Y = intercept + (slope₁ × X₁) + (slope₂ × X₂) + …
For categorical predictors coded 0/1, the coefficient is the difference between groups

Practice Tips

Work through regression outputs: practice interpreting coefficients, SEs, R², and sigma
Calculate confidence intervals by hand using estimate ± (critical t × SE)
Practice identifying study types (descriptive, predictive, causal)
Draw Venn diagrams to understand variance decomposition
Interpret interaction models: compute simple slopes and understand what the interaction means
Practice transformations: know how to interpret log-transformed outcomes and quadratic terms
Work through examples of confounding: compare models with and without potential confounders
Use the bootstrap examples to practice computing p-values and making decisions
Understand the relationship between CIs, hypothesis tests, and p-values

What Won’t be on the Exam?

Bayesian statistics
Write R code from scratch
Perform complex derivations
The exam focuses on interpreting output and applying concepts, not programming

Test Structure

100 questions total
Mix of True/False, Multiple Choice, and Matching
Bring a pencil and basic calculator
You can use four 3×5 index cards with handwritten notes (You’ll receive these in lecture)
Focus on understanding concepts and being able to interpret statistical output