Multiple Linear Regression

Module 11

Artwork by @allison_horst

Learning objectives

  • Understand the benefits of multiple predictors in enhancing predictions
  • Formulate and understand the Multiple Linear Regression (MLR) equation
  • Implement MLR in R
  • Comprehend and properly interpret the meaning of intercept and slope values in MLR
  • Calculate and interpret predicted outcomes and residuals using the MLR model
  • Grasp the significance of \(R^2\) in a MLR context

Overview

Building on your foundational knowledge of Simple Linear Regression (SLR), this Module will expand your understanding to scenarios where more than one predictor variable is used to explain variation in an outcome variable.

In social and behavioral sciences, our phenomena of interest are often predicted or influenced by a combination of factors. Multiple linear regression allows us to explore these complex relationships by including multiple predictor (i.e., X) variables in our models. This enables us to control for potential confounding variables, assess the relative importance of different predictors, and/or improve the accuracy of our predictions.

In this Module we will continue working with Hibb’s Bread and Peace Model which asserts that in a US presidential election, the likelihood that the incumbent party maintains power is dependent on the economic growth experienced during the prior term and the loss of military troops due to war. The former increases favor for the incumbent party, while the latter decreases favor.

In Module 10, we fit a SLR (i.e., one predictor) to determine how well growth in income of US residents during the preceding presidential term predicts the share of the vote that the incumbent party receives. The growth predictor constitutes the bread component of the Bread and Peace Model. In this Module, we will build on the SLR and add fatalities as an additional predictor. The fatalities predictor constitutes the peace component of the Bread and Peace Model.

Recap of the data

The data frame for this Module, compiled by Drew Thomas (2020), is called bread_peace.Rds. Each row of data considers a US presidential election. The following variables are in the data frame:

Variable Description
year Presidential election year
vote Percentage share of the two-party vote received by the incumbent party’s candidate
growth The quarter-on-quarter percentage rate of growth of per capita real disposable personal income expressed at annual rates
fatalities The cumulative number of American military fatalities per million of the US population
wars A list of the wars of the term if fatalities > 0
inc_party_candidate The name of the incumbent party candidate
other_party_candidate The name of the other party candidate
inc_party An indicator of the incumbent party (D = Democrat, R = Republican)

To get started, we need to load the following packages:

library(gtsummary)
library(skimr)
library(here)
library(broom)
library(tidyverse)

Now, we can import the data called bread_peace.Rds.

bp <- read_rds(here("data", "bread_peace.Rds"))
bp

Each row of the data frame represents a different presidential election — starting in 1952 and ending in 2016. There are 17 elections in total to consider.

Let’s obtain some additional descriptive statistics with skim().

bp |> skim()
Data summary
Name bp
Number of rows 17
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
wars 0 1 4 18 0 6 0
inc_party_candidate 0 1 4 11 0 15 0
other_party_candidate 0 1 4 11 0 17 0
inc_party 0 1 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1984.00 20.20 1952.00 1968.00 1984.00 2000.00 2016.00 ▇▆▆▆▇
vote 0 1 51.99 5.44 44.55 48.95 51.11 54.74 61.79 ▅▇▃▁▃
growth 0 1 2.34 1.28 0.17 1.43 2.19 3.26 4.39 ▆▆▇▇▆
fatalities 0 1 23.64 62.89 0.00 0.00 0.00 4.30 205.60 ▇▁▁▁▁

Bread and Peace Model 1 — A simple linear regression

Let’s take a moment to revisit the model and results that we fit in Module 10, which we called bp_mod1. Ih this model, vote was regressed on growth via a SLR.

bp_mod1 <- lm(vote ~ growth, data = bp)

bp_mod1 |> tidy() |> select(term, estimate)
bp_mod1 |> glance() |> select(r.squared, sigma)
bp |> 
  ggplot(mapping = aes(x = growth, y = vote)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x,  se = FALSE, color = "#2F9599") +
  theme_minimal() +
  labs(title = "Bread and Peace Voting in US Presidential Elections 1952 - 2016", 
       "Annualized per capita real income growth over the term (%)", 
       x = "Annualized per capita real income growth over the term (%)", 
       y = "Incumbent party share of two-party vote (%)")

Recall that the model with only growth in income explains about 49% of the variability in the share of the vote garnered by the incumbent party. That’s a sizable proportion of variance, but we might be able to boost that variance explained, and thus our ability to predict the outcome, by considering the second element of the Bread and Peace model. The second element considers the number of war-related fatalities of American troops that took place during the prior term. This new predictor is called fatalities.

Building a MLR for Bread and Peace

Building on the SLR model, let’s begin our exploration by taking a look at the plot below. I’ve changed a couple of things from the scatter plot of vote and growth that we just observed. First, I added an additional aesthetic by mapping the number of war-related fatalities (per million US population) to the size of the points (the variable is called fatalities). The larger the point, the more fatalities. For years in which fatalities was greater than 0, I also labeled the point with the corresponding war(s).

It’s informative to evaluate this augmented plot alongside the residuals from our first regression model (the model with growth as the only predictor — called bp_mod1). In the code below I extract these scores (using the augment() function) and then sort them by the size of the residual. Take a look at the data table below.

pts <- bp_mod1 |> 
  augment(data = bp) 

pts |> 
  select(year, vote, growth, fatalities, wars, .resid) |> 
  arrange(.resid) 

Recall that the residual is the difference between the case’s observed score and the case’s predicted score. If the model perfectly predicts the case’s score, the predicted and observed values are identical, and the residual is zero. The further the residual is from zero (in either the positive or negative direction), the larger the discrepancy between what our model predicts the score will be and the actual observed score.

Notice that the two largest residuals (which are for 1952 and 1968) are both negative — indicating the model based just on growth in income over-predicts the vote share that the incumbent party garnered in these years. I colored these points in yellow in the figure above to highlight them. Viewing the figure and the table above, we see that these two elections had something in common — during the preceding presidential term, there was a large loss of American military troops to war. Specifically, in the Korean war in 1952 and the Vietnam war in 1968.

The Bread and Peace model asserts that this large loss of American troops will drive down the public’s affinity for the incumbent party during the next election. Thus, consideration of both growth in income and loss of life of American soldiers should be considered.

We’re going to build on our SLR model (bp_mod1), by adding an additional predictor — fatalities (we’ll call this model bp_mod2). In order to compare the predicted and residual values across models, I am going to request these values from the first model, but I am going to save them in a new data frame, and rename the values so that they have a name that is specific to model 1. Specifically, I will give them a mod1 extension, and save the result as a data frame called bp_mod1_pred to be used later.

bp_mod1_pred <- 
  bp_mod1 |> 
  augment(data = bp) |> 
  rename(.fitted_mod1 = .fitted, .resid_mod1 = .resid) |> 
  select(year, .fitted_mod1, .resid_mod1)

bp_mod1_pred

Bread and Peace Model 2 — A multiple linear regression

Now, let’s fit a second model that adds fatalities to the regression model as an additional predictor. Adding fatalities to our regression model changes our equation. The equations for Model 1 (without fatalities), and Model 2 (with fatalities) are presented below, where \(x_1\) is growth and \(x_2\) is fatalities.

Model 1: \(\hat{y_i} = b_{0} + b_{1}x_{1i}\)

Model 2: \(\hat{y_i} = b_{0} + b_{1}x_{1i} + b_{2}x_{2i}\)

Notice that in Model 2 we now have two predictors \(x_1\) and \(x_2\) — representing growth and fatalities respectively. Therefore, Model 2 has three regression parameter estimates:

  • \(b_{0}\) is the intercept — the predicted value of Y when all X variables equal 0 (i.e., both growth and fatalities)
  • \(b_{1}\) is the fitted slope for the first predictor, growth
  • \(b_{2}\) is the fitted slope for the second predictor, fatalities

Fit a multiple linear regression model

In order to add an additional predictor to our lm() model in R, we just list the new predictor after the first with a + sign in between.

bp_mod2 <- lm(vote ~ growth + fatalities, data = bp)

Obtain the overall model summary

To begin, let’s examine the \(R^2\), printed in the glance() output. Recall that the \(R^2\) indicates the proportion of the variability in the outcome that is explained by the predictors in the model.

bp_mod2 |> glance() |> select(r.squared, sigma)

Notice that in comparing the SLR model (bp_mod1) to the MLR model (bp_mod2), the \(R^2\) has increased substantially with the addition of fatalities, going from about 49% in the SLR model to about 77% in the MLR model. That is, the model with fatalities explains quite a bit more of the variability in the vote shares than growth in income alone.

Obtain the regression parameter estimates

Now, let’s take a look at the estimates of the intercept and slope for Model 2.

bp_mod2 |> tidy() |> select(term, estimate)


With one predictor, our SLR model yields a regression line which is described by an intercept and a slope. With two predictors, our multiple linear regression model yields a regression plane. Each case has a score on both predictor variables, and these are used together to predict the outcome.

\[ \hat{y_i} = b_{0} + b_{1} \text{growth}_{i} + b_{2} \text{fatalities}_{i} \]

\[ \hat{y_i} = 44.952 + 3.477 \text{growth}_{i} + (-.047) \text{fatalities}_{i} \]

  • The intercept (44.952) represents the predicted vote share for the incumbent party when all predictors are zero. Thus, if there was 0 growth in income during the preceding term and 0 war-related fatalities (i.e., growth = 0 and fatalities = 0), then our model predicts that the incumbent party will win 44.95% of the votes.

With more than one predictor in the regression model, each slope is interpreted as the effect of the corresponding predictor variable, holding constant all other predictors in the model.

  • Consider the slope estimate for the effect of growth. Holding constant fatalities (i.e., comparing elections when fatalities are the same), each 1 unit increase in growth is associated with a 3.477 unit increase in the vote share for the incumbent party. In Model 1, our model without fatalities, the slope for growth was 2.974. Once fatalities are taken into account, we anticipate more of a boost in votes for the incumbent party with each 1 unit increase in income growth.

  • Now consider the slope estimate for the effect of fatalities. Holding constant income growth (i.e., comparing elections when growth is the same), each 1 unit increase in fatalities is associated with a .047 unit decrease in the vote share for the incumbent party. In other words, with growth in income being equal, more war-related fatalities tends to have a negative impact on votes garnered by the incumbent party — more deaths leads to fewer votes.

Obtain the predicted values and residuals

We can also examine the predicted values and residuals to see how they’ve shifted with the addition of fatalities. As with Model 1, I will save the predicted and residual values, and rename them to be specific to Model 2 (i.e., specify a mod2 extension and store them in a data frame called bp_mod2_pred).

bp_mod2_pred <- 
  bp_mod2 |> 
  augment(data = bp) |> 
  rename(.fitted_mod2 = .fitted, .resid_mod2 = .resid) |> 
  select(year, .fitted_mod2, .resid_mod2)

Note that the calculation of the predicted score (i.e., .fitted_mod2) requires plugging in the case’s score for both predictor variables. For example, to obtain the the predicted score for 1952 (a year where growth = 3.03 and fatalities = 206), we use:

\[ \hat{y_i} = 44.952 + (3.477\times3.03) + (-.047\times205.6) = 45.85 \]

As with a SLR, the residual is simply the difference between the predicted and observed score. For 1952, that is:

\[ 44.55 - 45.85 = -1.30. \]

To make sure you understand where these numbers come from, practice calculating the predicted score and residual for a few of the other years.

Now that the predicted and residual scores are saved for each model, we can merge them together. The first two columns represent the observed variables in the data frame, the last four represent the predicted and residual scores for Models 1 and 2 (those ending in mod1 are from Model 1, and those ending in mod2 are from Model 2).

compare_bp <- 
  bp |> 
  left_join(bp_mod1_pred, by = "year") |> 
  left_join(bp_mod2_pred, by = "year") |> 
  select(year, vote, .fitted_mod1, .fitted_mod2, .resid_mod1, .resid_mod2) 

compare_bp 

Take a look at 1952 and 1968, the two years with the largest residuals in Model 1. In both cases, the predicted value from Model 2 (labeled .fitted_mod2) is much closer to the observed value (vote), and thus the residuals are also drastically reduced as compared to Model 1. That is, the residuals move closer to zero — e.g., -9.49 to -1.30 for 1952, and -5.14 to 1.46 for 1968). This indicates that once we add fatalities to the model, we can more accurately predict the outcome. Comparison of the two models clearly indicates that both growth in income and fatalities of American troops should be included in the regression model to predict the incumbent party’s vote share for presidential elections.

Consider adjusted \(R^2\)

The \(R^2\) in a MLR model can be calculated using the same approach described for a SLR model in Module 10. It can also be calculated by squaring the correlation between the observed outcome values (vote) and the values predicted by the model (.fitted).

compare_bp |> 
  select(vote, .fitted_mod2) |> 
  corrr::correlate()


This correlation squared yields the \(R^2\) that glance() gave us for Model 2: \(.88^2 = .77\).

In a MLR, there is another version of \(R^2\) that glance() provides that is of interest — it’s the adjusted-\(R^2\).

bp_mod2 |> glance() |> select(r.squared, adj.r.squared)

Adjusted-\(R^2\) is a modification of the \(R^2\) statistic that takes into account the number of predictors in a model.

\[ Adjusted-R^2 = 1 - [((1 - R^2) \times (n - 1)) / (n - k - 1)] \]

where n is the sample size and k is the number or predictors.

While \(R^2\) represents the proportion of the variance in the outcome variable that’s explained by the predictor variables in a linear regression model, simply adding more predictors to the model can artificially inflate the \(R^2\) value, even if those predictors don’t have a genuine relationship with the outcome. The adjusted-\(R^2\) penalizes the inclusion of unnecessary or irrelevant predictors, making it a more robust measure especially in models with many predictors. If a predictor doesn’t improve the model substantially, the adjusted-\(R^2\) may decrease, signaling that the predictor might not be a meaningful addition to the model. A couple of caveats to note, first adjusted-\(R^2\) is not in the metric of variance explained — like regular \(R^2\). Second, the adjusted-\(R^2\) may not be accurate with small sample sizes (including the sample size for this example).

Substantive interpretation (fit with the theory)

Returning to the premise of the Bread and Peace Model, we see that our fitted model is an excellent match to the theory: In a US presidential election, the likelihood that the incumbent party maintains power is dependent on the economic growth experienced during the prior term and the loss of military troops due to war. The former increases favor for the incumbent party, while the latter decreases favor.

Create a publication-ready table of results

We can use the tbl_regression() function from the gtsummary package to present the results of our model in a publication-ready table. I’m mostly using the defaults here, but there are a few changes that I specify in the code below. First, I request intercept = TRUE, which prints the intercept of the model (the default is to suppress it from the output). Second, I use the modify_header() function to update two items — I change the label of terms in the model to “Term” (the default is “Characteristic”) and I change the header for the regression estimate to “Estimate” (the default is “Beta”). As we progress through the next few Modules, you’ll learn about Confidence Intervals (CI) and p-values — for now, we’ll ignore those aspects of the table.

bp_mod2 |> tbl_regression(
  intercept = TRUE,
  label = list(growth ~ "Per capita income growth",
               fatalities ~ "Military fatalities per million US population")) |> 
  modify_header(label ~ "**Term**", estimate ~ "**Estimate**") |> 
  as_gt() |> 
  tab_header(
    title = md("Table 1. Fitted regression model to predict percent of votes going to incumbent candidate in US presidential elections")
    )
Table 1. Fitted regression model to predict percent of votes going to incumbent candidate in US presidential elections
Term Estimate 95% CI1 p-value
(Intercept) 45 42, 48 <0.001
Per capita income growth 3.5 2.3, 4.7 <0.001
Military fatalities per million US population -0.05 -0.07, -0.02 0.001
1 CI = Confidence Interval

Consider linearity in a MLR

Understanding the linearity of relationships in a regression model is fundamental because the primary assumption of linear regression, whether simple or multiple, is that there exists a straight-line relationship between the predictor(s) and the outcome variable. This assumption of linearity informs how we estimate the model parameters, make predictions, and interpret the results. If this assumption is violated, our estimates may be biased or inefficient, leading to incorrect predictions and misleading interpretations. Therefore, checking and ensuring linearity is a crucial step in regression analysis.

In a SLR model, where we have one predictor variable, the relationship between the predictor and the outcome is visually straightforward to understand. We can create a scatter plot of the predictor variable against the outcome variable and see if a straight line effectively captures the relationship between the two.

However, in a MLR model, where we have more than one predictor variable, visualizing the relationship becomes more complex. This is because we are trying to understand a relationship in a multi-dimensional space that cannot be easily plotted or visualized. The assumption of linearity in MLR refers to the linearity in parameters, which means each predictor variable is linearly related to the outcome variable, holding other predictors constant.

In a MLR analysis, an Added Variable Plot offers a powerful way to visualize the relationship between the outcome variable (Y — vote in our example) and one of the predictors (say, X1 — growth in our example), while accounting for the effects of other predictor(s) (X2 — in this case, fatalities). Also known as partial regression plots, these plots essentially show the residuals of Y from regressing Y on X2 against the residuals of X1 from regressing X1 on X2. If the relationship between Y and X1, adjusting for X2, is linear, the points on the added variable plot will roughly follow a straight line.

The process of creating an added variable plot involves the following steps:

  1. Perform a simple linear regression of Y on X2, and save the residuals (we’ll name these y_resid). This is often denoted as Y|X2, meaning Y given X2 (i.e., Y after accounting for X2).

  2. Perform a simple linear regression of X1 on X2, and save the residuals (we’ll name these x1_resid). This is often denoted as X1|X2, meaning X1 given X2 (i.e., X1 after accounting for X2).

  3. Plot y_resid against x1_resid. If the linearity assumption is valid, the points should scatter around a straight line.

Recall that a residual captures the variability in the outcome that is not accounted for by the predictor. Therefore, y_resid captures the variability in Y that is NOT ACCOUNTED for by X2, and x_resid captures the variability in X1 that is NOT ACCOUNTED for by X2.

Let’s see how this works with our example. First, let’s create an Added Variable Plot for vote and growth.

y_resid <- lm(vote ~ fatalities, data = bp) |> augment(data = bp) |> select(year, .resid) |> rename(y_resid = .resid)
x_resid <- lm(growth ~ fatalities, data = bp) |> augment(data = bp) |> select(year, .resid) |> rename(x_resid = .resid)

check <- 
  y_resid |> 
  left_join(x_resid, by = "year") 

check |> 
  ggplot(mapping = aes(y = y_resid, x = x_resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "#A7226E") +
  theme_minimal() +
  labs(title = "Added Variable Plot for vote and growth controlling for fatalities",
       x = "(growth | fatalities)",
       y = "(vote | fatalities)")

The added variable plot confirms that the linear relationship between vote and growth holds even after controlling for fatalities.

Now, let’s create an Added Variable Plot for vote and fatalities.

y_resid <- lm(vote ~ growth, data = bp) |> augment(data = bp) |> select(year, .resid) |> rename(y_resid = .resid)
x_resid <- lm(fatalities ~ growth, data = bp) |> augment(data = bp) |> select(year, .resid) |> rename(x_resid = .resid)

check <- 
  y_resid |> 
  left_join(x_resid, by = "year") 

check |> 
  ggplot(mapping = aes(y = y_resid, x = x_resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "#A1C181") +
  theme_minimal() +
  labs(title = "Added Variable Plot for vote and fatalities controlling for growth")

The Added Variable Plot also confirms that the linear relationship between vote and fatalities holds after controlling for growth. The presence of extreme outliers on fatalities makes assessment of the linearity of this Added Variable Plot a bit more difficult though.

Another useful plot for examining the assumption of linearity is to examine a scatter plot of the .fitted values and the .resids from your fitted model. Here’s an example for our Bread and Peace MLR model with both predictors:

bp_mod2 |> 
  augment(data = bp) |> 
  ggplot(mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1, color = "#EC2049") +
  labs(x = "Fitted Values", y = "Residuals", title = "Fitted Values vs Residuals") +
  theme_bw()

A graph plotting the fitted values against the residuals helps us understand the linearity assumption in multiple linear regression (MLR). If the residuals appear randomly scattered around zero with no clear pattern, it indicates that the linearity assumption is reasonably met. This randomness suggests that the model captures the linear relationship well. However, if the residuals display a systematic pattern or trend, such as a curve, it indicates that the relationship between the predictors and the outcome may be nonlinear, and the linearity assumption is violated. Such patterns suggest a misspecified model, indicating that a more complex, nonlinear model may better describe the relationship between the predictors and the outcome variable.

Based on the Added Variable Plots, it appears that both the relationship between vote and growth, adjusting for fatalities, and the relationship between vote and fatalities, adjusting for growth, exhibit a linear trend. This suggests that there is evidence to support the linearity assumption in the multiple linear regression model, indicating that the predictor variables growth and fatalities have linear relationships with the outcome variable when accounting for the effects of the other predictor. Thus, we can feel confident in our fitted model.

Non-linear relationships

If the relationship between the variables is not linear — for example, if the data points form a curve — then a straight line would not accurately represent the relationship and an alternative type of model is necessary. Moreover, forcing a straight line through a scatter plot with a nonlinear relationship can lead to incorrect conclusions and predictions. Therefore, it’s critical to properly assess the pattern of the data before deciding on the appropriate model to describe the relationship.

For example, consider the following graph that depicts the relationship between X and Y.

The best linear fit for these data is depicted in the graph below.

This best fit line suggests there is no relationship between X and Y, that is, the slope of the line is 0. However, there is clearly a relationship between X and Y — it’s just not a linear one.

In fact, there is a strong curvilinear relationship between X and Y.

Fitting a straight line to a curvilinear relationship can lead to several issues and misconceptions, the most fundamental being a poor fit. This misrepresentation of the data often results in inaccurate predictions and can also obscure meaningful relationships between variables. Linear regression models assume a constant rate of change, which is not the case with curvilinear relationships.

Additionally, as you will learn in later Modules, it can undermine the validity of statistical tests and confidence intervals derived from the model because these are based on the premise that the model is a good fit for the data. It also leaves much of the variability in the outcome variable unexplained, diminishing the model’s explanatory power.

Fitting a straight line to curvilinear data can even lead to incorrect inferences about the relationship between the variables. For example, if a relationship is quadratic, a linear model might suggest a positive or negative trend where there is actually a peak or trough, or even suggest no relationship where one exists. Therefore, proper exploratory data analysis is crucial before deciding the nature of the relationship and type of regression model to be used.

Non-linear models can be fit in a regression framework with a few tweaks. We’ll learn about these models and how to fit them in Module 15.

Prediction vs. Causation

When working with Multiple Linear Regression (MLR), it’s crucial to distinguish between prediction and causation. Although MLR is a powerful tool for understanding relationships between variables and making predictions, it is essential to understand its limitations regarding causal inferences.

Prediction in MLR

Prediction involves using a statistical model to estimate the future or unknown values of an outcome variable based on known values of predictor variables. The primary goal of prediction is to achieve accuracy in forecasting or estimating these unknown values. In the context of MLR, prediction focuses on developing a model that can reliably predict the outcome variable (Y) using a combination of predictor variables (X1, X2, X3, … Xn).

  • Example: In our Bread and Peace Model, we use economic growth and military fatalities to predict the vote share for the incumbent party in presidential elections. Here, the primary objective is to build a model that can accurately forecast future vote shares based on these predictors.

Causation in MLR

Causation goes beyond prediction to explain why changes in one variable lead to changes in another. Establishing causation requires demonstrating that changes in the predictor variable directly cause changes in the outcome variable. This involves ruling out alternative explanations, such as confounding variables, and demonstrating a consistent and replicable relationship.

  • Example: If we wanted to establish that economic growth directly causes changes in the vote share, we would need to conduct a more rigorous analysis to control for other potential confounders and establish a causal link. For example, general public sentiment about the government’s performance could impact both economic growth (through consumer confidence and spending) and vote share. That is, public sentiment may act as a confounder — causing both economic growth and vote share garnered by the incumbent party.

Key Differences Between Prediction and Causation

  1. Objective:

    • Prediction: The goal is to accurately forecast or estimate the outcome variable based on the predictor variables.

    • Causation: The goal is to determine whether changes in the predictor variable directly cause changes in the outcome variable.

  2. Methodology:

    • Prediction: Relies on fitting the best possible model to the data to minimize prediction error. It focuses on the model’s predictive power.

    • Causation: Requires more rigorous analysis, including controlling for confounders, establishing temporal relationships, and often using experimental or quasi-experimental designs to isolate causal effects.

  3. Interpretation:

    • Prediction: A strong predictive relationship does not imply causation. For instance, if economic growth predicts vote share well, it doesn’t mean economic growth causes changes in vote share.

    • Causation: Demonstrating causation means providing evidence that changes in the predictor variable directly lead to changes in the outcome variable.

  4. Use Cases:

    • Prediction: Useful for forecasting and making informed decisions based on anticipated outcomes.

    • Causation: Crucial for understanding the underlying mechanisms and for making interventions or policy decisions aimed at changing the outcome by manipulating the predictor.

Importance of the Distinction

Understanding the distinction between prediction and causation is vital for interpreting the results of MLR models correctly. Misinterpreting predictive relationships as causal can lead to incorrect conclusions and potentially harmful decisions. While MLR can provide valuable insights into how variables are related and help make accurate predictions, establishing causation requires careful and rigorous analysis.

Looking Ahead

Later in this course, we will delve deeper into causal inference methods, which are designed to address the challenges of establishing causation. These methods will introduce you to the tools and techniques necessary to make stronger causal claims and understand the true drivers of changes in your outcome variables.

For now, remember that while MLR is a powerful tool for making predictions, caution is needed when interpreting these predictions as causal relationships. By maintaining this distinction, we can make more informed and accurate interpretations of our statistical analyses.

What predictors should I include in my model?

Determining the type and number of predictor variables to include in a model requires careful consideration to strike a balance between model complexity and model performance. Including too few predictor variables may lead to an oversimplified model that fails to capture important relationships, while including too many predictor variables may result in overfitting and reduced generalizability. Here are some guidelines to help you with variable selection:

  1. Prior knowledge and theory: Start by considering the theoretical underpinnings of your research and existing knowledge in the field. Identify the variables that are conceptually and theoretically relevant to the phenomenon you are studying. This can provide a foundation for selecting a set of meaningful predictor variables to include in your model.

  2. Purpose of the model: Clarify the specific goals and objectives of your model. Are you aiming for prediction, or are you trying to identify the causal effect of a particular variable? For prediction, you may consider including a broader set of variables that are potentially predictive. We’ll work on an Apply and Practice Activity soon that demonstrates model buiding for prediction. For identification of a causal effect, very careful consideration of potential confounders that need to be controlled in order to get an unbiased estimate of the causal effect of interest is critical. We’ll consider models for estimation of causal effects in observational studies in Module 18.

  3. Sample size: Consider the size of your data frame. As a general rule, the number of predictor variables should not exceed a certain proportion of the number of observations. A commonly used guideline is to have at least 10 observations per predictor variable to ensure stability and reliable estimation.

  4. Collinearity: Assess the presence of collinearity or multicollinearity among your predictor variables. High correlation between predictors can lead to unstable parameter estimates and make it challenging to interpret the individual effects of each variable. Consider excluding or combining variables that exhibit strong collinearity to avoid redundancy.

  5. Model performance and simplicity: Evaluate the performance of your model using appropriate evaluation metrics such as goodness-of-fit measures, prediction accuracy, or model fit statistics. Strive for a balance between model complexity and simplicity. Adding additional predictor variables should substantially improve the model’s performance or explanatory power, rather than introducing unnecessary noise or complexity.

  6. Practical considerations: Consider the practicality and cost-effectiveness of including additional predictor variables. Are the additional variables easily measurable or obtainable in practice? Including variables that are difficult or expensive to collect may not be practical unless they have a compelling theoretical or practical justification.

  7. Cross-validation and external validation: Validate your model using techniques such as cross-validation or external validation on independent data sets. This can help assess the stability and generalizability of the model’s performance and provide insights into the robustness of the selected predictor variables. You’ll get a chance to practice this in a later Apply and Practice Activity.

Remember that the determination of the best set of predictor variables is not a fixed rule but rather a decision based on careful judgment and considerations specific to your research context. It is advisable to consult with domain experts, statisticians, or research peers to gain additional insights and perspectives on selecting the most reasonable and informative set of predictor variables for your model.

Failed and successful predictions

To finish up this module, please watch the following two videos from Crash Course Statistics on Prediction.



Wrap up

In this Module on Multiple Linear Regression (MLR), we built on the concepts from simple linear regression to introduce and explore the benefits of using multiple predictors in statistical modeling. We covered a number of important topics, including:

  1. Understanding the Benefits of Multiple Predictors: We began by discussing how multiple predictors can enhance the accuracy and explanatory power of our models by accounting for various factors that influence the outcome variable.

  2. Formulating and Understanding the MLR Equation: We introduced the MLR equation, highlighting how it extends simple linear regression to include multiple independent variables. This was illustrated using the Bread and Peace Model, where both economic growth and war-related fatalities were used to predict the incumbent party’s vote share in U.S. presidential elections.

  3. Implementing and Interpreting MLR in R: We provided step-by-step instructions on how to build MLR models using R and specialty packages, focusing on practical implementation and interpretation of the results. This included fitting the model, evaluating the \(R^2\) value, and carefully interpreting the the regression coefficients.

  4. Calculating and Interpreting Predicted Outcomes: We demonstrated how to calculate predicted outcomes using the MLR model and interpret these predictions in relation to the observed data.

  5. Importance of \(R^2\) in MLR: We discussed the importance of \(R^2\) in assessing the proportion of variability in the outcome variable explained by the predictors, and introduced the concept of adjusted-\(R^2\) to account for the number of predictors in the model.

Throughout this module, we used the Bread and Peace Model to illustrate these concepts, demonstrating how adding war-related fatalities to a model that already included economic growth significantly improved the model’s explanatory power.

In the next Module, you will learn how to quantify uncertainty in the key regression parameters and to construct confidence intervals for the estimates derived from you model.

Additionally, in future Modules, we will explore more advanced techniques in regression analysis, including:

  • Incorporation of Categorical Predictors: Learning how to include and interpret categorical variables in our models.

  • Non-Linear Effects: Investigating polynomial terms and log transformations to capture non-linear relationships.

  • Interaction Effects: Examining how the relationship between variables changes at different levels of other variables.

  • Diagnostics and Remediation: Thoroughly examining your fitted model to ensure all assumptions are met.

These advanced topics will further expand your analytical toolkit, enabling you to handle more complex data and research questions in the social and behavioral sciences.