Simple Linear Regression

Module 10

Artwork by @allison_horst

Learning objectives

  • Understand the concept and utility of using a straight line for relating two variables
  • Grasp the least squares criterion for determining the best fit line in simple linear regression (SLR)
  • Construct and understand the SLR equation as defined by an intercept and slope
  • Interpret the meaning of intercept and slope values
  • Define residuals in SLR and their importance
  • Calculate and interpret predicted outcomes using the SLR model
  • Implement and interpret SLR in R, including the intercept, slope, and \(R^2\) values
  • Plot and visually interpret SLR results
  • Define and comprehend the correlation coefficient and its relation to the SLR slope

Overview of Simple Linear Regression (SLR)

In the social sciences, we often want to know whether two variables are associated with one another. By associated we mean that knowing the value of one variable tells us something about the value of the other variable. In this Module, we will focus on modeling the relationship between two continuous variables. We will use a technique called simple linear regression (SLR) to study this type of relationship.

Recall this example from Module 3, The Centers for Disease Control and Prevention compiled these data in order to determine if countries who adopted more COVID-19 mitigation policies had fewer deaths.

Simple linear regression (or SLR) helps us to understand the extent to which two variables are associated by estimating the straight line that relates them to one another. SLR is a statistical technique that allows the analyst to predict scores for some continuous outcome Y using a single predictor X. Each case (denoted by subscript \(i\)) in our data frame has a score for both Y (i.e., \(y_i\)) and X (i.e., \(x_i\)). In the CDC COVID-19 example the cases represent countries, Y represents the mortality from COVID-19 on June 30, 2020 and X represents the stringency index for mitigation policies and practices.

In the plot below, there appears to be a negative relationship, more mitigation strategies were associated with fewer deaths in the early months of the COVID-19 pandemic.

Before we dig into the content, to build your intuition about regression, please watch the following Crash Course Statistics video on Regression. Please note that the video includes an overview of fitting regression models, as well as inference testing (with topics such as degrees of freedom, test statistics, and test distributions). You don’t need to worry about the inference parts for now — we’ll revisit these topics and techniques later in the course.



In this Module, we will use linear models to describe the relationship between two variables. Though you may not have studied or used linear models for prediction before, we often use models to predict outcomes in everyday life. In particular we often use mathematical models to solve problems or make conversions. Mathematical models are deterministic. Once the “rule” is known, the mathematical model can be used to perfectly fit the data. That is, we can perfectly predict the outcome.

Examples of mathematical models include:

  1. Perimeter of a square \(= 4 \times (length\,of\,side)\)

  2. Area of a square \(= (length\,of\,side)^2\)

  3. Convert Fahrenheit to Celsius: \(C = (F - 32)\cdot\frac{5}{9}\)

Let’s explore the first example. The data below represent 25 squares. For each square, the length of a side, and the perimeter are recorded.

Now, let’s use the formula for finding the perimeter based on the length of a side to predict the perimeter for each square.

squares <- squares  |> 
  mutate(predicted_perimeter = side * 4)

squares

Not surprisingly, we see that the predicted perimeter that we obtain by applying the formula is precisely the same as the perimeter measurement. This is because this formula is a mathematical model.

While mathematical models are extremely useful, in this course, we will focus on statistical models. Unlike mathematical models, statistical models are not deterministic. They take into account that we usually don’t have all important predictors, that we rarely perfectly measure the variables that we do have, and that people (or organizations, schools, animals, etc.) act differently.

Statistical models allow for:

  1. Excluded variables

  2. Measurement error

  3. Individual variation

The formula for a statistical model has a residual to account for variation from these sources:

Outcome = Systematic Component + Residual

Statistical techniques serve as a powerful tool to decompose the variation in our results into two key components — one that can be explained by our selected predictors, often referred to as the systematic component, and another that remains unexplained, typically known as the residual.

For example, in the scatterplot of COVID-19 mitigation policies and COVID-19 deaths across European countries, we observed a pattern where more mitigation policies were associated with fewer deaths. This is the systematic component. However, not all data points fell on the best-fit line, indicating that mitigation policies were not perfect predictors of deaths. There was significant variability in deaths that policies did not account for, captured by the residual. Other factors (excluded variables) likely contributed to this variability, such as the age of residents and healthcare quality. Additionally, both policy and death measurements could have errors. Each country also has unique characteristics that might influence both the policies implemented and the death rate, leading to variability.

In this Module, we will explore the fundamental concepts of Simple Linear Regression (SLR) to understand the relationship between two continuous variables. By examining the association between variables, such as the correlation between COVID-19 mitigation policies and cumulative mortality, SLR enables us to predict outcomes and identify patterns. Unlike deterministic mathematical models, statistical models in SLR account for excluded variables, measurement errors, and individual variations, making them more realistic and applicable to social sciences. Through statistical techniques, we can decompose variation in our data into systematic components explained by our predictors and residuals that capture unexplained variability, offering a comprehensive understanding of the factors influencing our results and the inherent uncertainties.

Introduction to the data

In this Module, we will examine a predictive model for forecasting outcomes of US Presidential elections. The individuals we elect to public offices wield substantial influence on the health, well-being, and prosperity of our society. They hold the reins of power that can either foster or dismantle programs, practices, and policies aimed at ensuring health, welfare, and prosperity for all.

Considering the profound impact of governance and policy on our nation’s status, deciphering the motivations behind voters’ choices in electing public officials becomes a matter of paramount importance. In this Module, we will delve into the understanding and application of a well-acknowledged, albeit relatively straightforward, model for predicting election results: the Bread and Peace Model, formulated by Douglas A. Hibbs. He describes his model in detail here, the gist of his model is described below:

Postwar US presidential elections can for the most part be interpreted as a sequence of referendums on the incumbent party’s record during its four-year mandate period. In fact aggregate two-party vote shares going to candidates of the party holding the presidency during the postwar era are well explained by just two fundamental determinants: (1) Positively by weighted-average growth of per capita real disposable personal income over the term. (2) Negatively by cumulative US military fatalities (scaled to population) owing to unprovoked, hostile deployments of American armed forces in foreign wars.

In other words, in a US presidential election, the likelihood that the incumbent party maintains power is dependent on the economic growth experienced during the prior term and the loss of military troops due to war. The former increases favor for the incumbent party, while the latter decreases favor.

We’ll start simple. In this Module we will consider a single outcome (election results of US presidential elections) and a single predictor (growth in personal income over the preceding presidential term). To estimate the Bread and Peace model, we will use data compiled by Drew Thomas (2020).

To get started, we need to import the libraries for this Module:

library(gtsummary)
library(skimr)
library(here)
library(broom)
library(tidyverse)

In the data frame that we will use, called bread_peace.Rds, each row of data considers a US presidential election. The following variables are in the data frame:

Variable Description
year Presidential election year
vote Percentage share of the two-party vote received by the incumbent party’s candidate.
growth The quarter-on-quarter percentage rate of growth of per capita real disposable personal income expressed at annual rates
fatalities The cumulative number of American military fatalities per million of the US population
wars A list of the wars of the term if fatalities > 0
inc_party_candidate The name of the incumbent party candidate
other_party_candidate The name of the other party candidate
inc_party An indicator of the incumbent party (D = Democrat, R = Republican)

In this Module, we will estimate how well growth in income of US residents during the preceding presidential term predicts the share of the vote that the incumbent party receives. That is, we will determine if growth is predictive of vote. In the equations and descriptions below, I will refer to the predictor (growth) as the \(x_i\) variable, and the outcome (vote) as the \(y_i\) variable.

Let’s begin by importing the data called bread_peace.Rds.

bp <- read_rds(here("data", "bread_peace.Rds"))
bp
bp |> write_csv((here("data", "bread_peace.csv")))

Each row of the data frame represents a different presidential election — starting in 1952 and ending in 2016. There are 17 elections in total to consider.

Let’s obtain some additional descriptive statistics with skim().

bp |> skim()
Data summary
Name bp
Number of rows 17
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
wars 0 1 4 18 0 6 0
inc_party_candidate 0 1 4 11 0 15 0
other_party_candidate 0 1 4 11 0 17 0
inc_party 0 1 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1984.00 20.20 1952.00 1968.00 1984.00 2000.00 2016.00 ▇▆▆▆▇
vote 0 1 51.99 5.44 44.55 48.95 51.11 54.74 61.79 ▅▇▃▁▃
growth 0 1 2.34 1.28 0.17 1.43 2.19 3.26 4.39 ▆▆▇▇▆
fatalities 0 1 23.64 62.89 0.00 0.00 0.00 4.30 205.60 ▇▁▁▁▁


The key predictor (growth) for our initial exploration ranges from 0.17% (in 2008, just after the 2008 financial crisis) to 4.4% (in 1964, one of the most prosperous times in US history — particularly for the wealthy). Second the outcome (vote) ranges from 44.6% (in 1952, the US was in the middle of the Korean War) and 61.8% (in 1972, riding a strong economy and moving toward the end of the Vietnam War, Richard Nixon won by a landslide).

Bread and Peace Model 1 - A simple linear regression

Let’s take a look at the relationship between growth (our primary predictor) and vote (our primary outcome) using a scatterplot. We will map growth (the predictor) to the x-axis, and vote (the outcome) to the y-axis.

bp |> 
  ggplot(mapping = aes(x = growth, y = vote)) +
  geom_point() +
  ggrepel::geom_label_repel(mapping = aes(label = year),
                            color = "grey35", fill = "white", size = 2, box.padding =  0.4, 
                            label.padding = 0.1) +
  theme_minimal() +
  labs(title = "Bread and Peace Voting in US Presidential Elections 1952 - 2016", 
       x = "Annualized per capita real income growth over the term (%)", 
       y = "Incumbent party share of two-party vote (%)")

When two continuous variables are considered, a scatterplot is an excellent tool for visualization. It’s important to determine if the relationship appears linear (as opposed to curvilinear) and to note any possible outliers or other strange occurrences in the data.

If we want to summarize the relationship between the variable on the x-axis (growth) and the variable on the y-axis (vote), one option is to draw a straight line through the data points. This is accomplished in the plot below.

bp |> 
  ggplot(mapping = aes(x = growth, y = vote)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "#2F9599") +
  ggrepel::geom_label_repel(mapping = aes(label = year),
                            color = "grey35", fill = "white", size = 2, box.padding =  0.4, 
                            label.padding = 0.1) +
  theme_minimal() +
  labs(title = "Bread and Peace Voting in US Presidential Elections 1952 - 2016", 
       x = "Annualized per capita real income growth over the term (%)", 
       y = "Incumbent party share of two-party vote (%)")

This straight line does a reasonable job of describing the relationship. Many relationships can be explained with a straight line, and this is the primary purpose of a linear regression model. However, it is critical to note that imposing a straight line to describe the relationship between two variables is only appropriate if the relationship between the variables is indeed linear. A linear relationship means that a change in one variable is consistently associated with a proportional change in the other variable. In other words, if the points on the scatterplot appear to rise or fall at a steady rate, a straight line can adequately describe the trend. Otherwise, other, more complex, models need to be considered.

For examples of scatterplots that show non-linear relationships between X and Y, please see the graphs below. For each of these, we can visually see that X and Y are related to one another, but a linear model is not a good fit.

If a straight line to relate the \(x_i\) scores to the \(y_i\) scores is viable, and we draw a straight line through a set of data points, the line can be defined by an intercept and a slope.

The intercept is the value of Y when X is zero, representing the point where the line crosses the y-axis.

In looking at our graph, there were actually no years during the observation period in which there was 0 growth. The closest is 2008, which has about .2% growth. By extrapolating back a bit, it looks like this straight line to relate growth and vote (i.e., the linear model) would predict a score of about 45 (incumbent party receives 45% of vote share) if percent growth was 0. See the red dot on the graph above.

While the intercept marks the predicted score for vote when growth equals 0, the slope indicates the rate of change in the Y variable for each unit increase in the X variable.

Specifically, the slope is defined as the “rise over the run.” That is, the rise (i.e., change) in vote share for the incumbent party for a one unit increase (i.e., run) in economic growth. Since it’s a straight line, we can pick any one-unit increase along the x-axis.

Let’s choose a “run” from 2 to 3 — that is, going from a 2 to a 3 on the x-axis. Take a look at the graph below, when growth (the value on the x-axis) is 2, the corresponding value of vote looks to be about 51. When growth is 3, the corresponding value of vote looks to be about 54.

That means as we go from an income growth of 2 to an income growth of 3 (a run of 1), our line relating the two variables asserts that vote share tends to increase by about 3 percentage points (54 minus 51 or a rise of 3 with respect to the y-axis). So the rise over the run = 3/1, corresponding to a slope of about 3. For each one unit increase in growth, we predict the percent vote share going to the incumbent party to increase by about 3 percentage points.

You might be wondering — “How was the location of the line on this graph determined?” If I asked you to manually draw the best fit line through these points, each student would likely draw a line that was close, but not exactly the same to one another, and not exactly the same line that ggplot2 drew onto the graph for us. It is possible to fit an infinite number of straight lines to the data points in the scatterplot. However, we want to find the best fitting line.

The method that we use to find the best fit line is called the least squares criterion. We will study this method in the next part of this module. If you’d like to build your intuition a bit more before studying the mathematics behind estimation of the best fit line — check out this engaging application.

Estimate the best fit line

We can imagine many reasonable lines that could be drawn through the data points to relate the predictor to the outcome. For example, each of the lines on the graph below seems reasonable.

How do we determine the best fit line for the data? For any given X and Y variable that are linearly related, there is indeed one best fitting line. The best fitting line is the line that allows the \(x_i\) scores (growth) to predict the \(y_i\) scores (vote) with the highest amount of accuracy.

To find the best fit line, we use the least squares criterion. The least squares criterion dictates that the best fitting line is the line that results in the smallest sum of squared errors (or residuals) (i.e., each case’s residual is squared and then all squared residuals are summed across cases). Here’s a simplified explanation of how least squares works, and we will dig into these ideas later in the Module.

  1. Begin by plotting the data: Plot your two continuous variables on a scatterplot. One variable goes on the y-axis, and the other goes on the x-axis. Each point on the scatterplot represents an observation in your data.

  2. Draw a line: Imagine drawing a line through the scatterplot. The line can be placed in an infinite number of ways, but we want to find the “best” way.

  3. Calculate the distance from each point to the line you drew: For each point, calculate the vertical distance from the point to the line. See the graph below for a depiction. These distances are called residuals or prediction errors. Some of these will be positive (when the point is above the line), and some will be negative (when the point is below the line). This distance represents how “off” our prediction is for each point (i.e., the difference between a case’s observed score, and what we predict the case’s score to be given the drawn line).

  4. Square each distance and add them up: Because negative and positive residuals could cancel each other out, we square each distance, which makes all the distances positive, then add them together. This sum is called the Sum of Squared Errors (or Sum of Squared Residuals). You’ll also hear it referred to as the Error Sum of Squares (or the Residual Sum of Squares).

  5. Adjust the line to minimize the sum: The best fit line is the one where this sum of squared errors/residuals is as small as possible. The method of finding this line is called least squares because it minimizes the sum of the squares of the errors/residuals.

In summary, the best fit line is found by minimizing the sum of the squared distances from each point to the line. This line represents the best linear approximation of the relationship between the two variables, assuming that the relationship is indeed linear.

Let’s consider a few candidate lines. First, consider the line on the graph below as a candidate best fit line. This line has an intercept of 46.500 and a slope of 2.750.

Given this line, for each data point — we can calculate the election’s predicted score (the data point on the orange line given the election’s score on X (i.e., growth)). The distance between each election’s predicted score and observed score represents it’s residual (the dashed lines on the graph). These are recorded in the table below the graph under residual. If we squared each election’s residual in the table we’d get the value in the table recorded as residual_sq. Then, if we summed them across all 17 elections, we would obtain the sum of squared errors (we’ll call this SSE) for our candidate line.

year residual residual_sq
1952 -10.289375 105.8712379
1956 4.145100 17.1818540
1960 1.833800 3.3628224
1964 2.776400 7.7083970
1968 -5.887100 34.6579464
1972 3.810125 14.5170525
1976 -1.489100 2.2174188
1980 -3.871125 14.9856088
1984 1.341375 1.7992869
1988 -0.549425 0.3018678
1992 -3.735100 13.9509720
1996 3.070825 9.4299662
2000 -5.400975 29.1705310
2004 -1.271500 1.6167122
2008 -0.655025 0.4290578
2012 0.580400 0.3368642
2016 -0.601525 0.3618323


The SSE for this line is 257.9, which is recorded on the graph below.


Now, let’s consider the mean of Y as a potential line. Here the intercept is the mean of vote (51.991) and the slope is 0.

Clearly, this line is not so great in explaining the relationship between the two variables — the SSE is very large.


How about the best fit line recorded by ggplot2 using the geom_smooth() function? This is depicted in the graph below. The intercept is 45.021 and the slope is 2.975.

This is the best one yet. As expected, using method = "lm" in our ggplot() call to regress vote on growth (formula = y ~ x) displays the best fit line. That is, the geom_smooth() function in R specifies a linear model, method = "lm" for y regressed on x (y ~ x), and will overlay the best fit line to a scatterplot. This is depicted in the graph below.

bp |> 
  ggplot(mapping = aes(x = growth, y = vote)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x,  se = FALSE, color = "#2F9599") +
  ggrepel::geom_label_repel(mapping = aes(label = year),
                            color = "grey35", fill = "white", size = 2, box.padding =  0.4, 
                            label.padding = 0.1) +
  theme_minimal() +
  labs(title = "Bread and Peace Voting in US Presidential Elections 1952 - 2016", 
       "Annualized per capita real income growth over the term (%)", 
       x = "Annualized per capita real income growth over the term (%)", 
       y = "Incumbent party share of two-party vote (%)")

An equation for the best fit line

The best fitting line is defined by an equation consisting of:

  • An intercept — the predicted value of vote (the Y variable) when growth (the X variable) = 0.

  • A slope — the predicted change in vote (the Y variable) for each one unit increase in growth (the X variable). This equation is depicted below. For this depiction, the Y and X scores for each election year are subscripted with an i to denote that each case (i.e., election year in our example) has a score for \(y_i\) and a score for \(x_i\).

The best fit line for our Bread and Peace data is represented by the following equation. The intercept, that is, the predicted score for vote when growth = 0, is 45.021. The slope, that is, the predicted change in vote for each one unit increase in growth, is 2.975.

\[ \hat{y_i} = b_{0}+b_{1}x_i \]

\[ \hat{y_i} = 45.021+2.975x_i \]

Note that sometimes you will also see the equation of a slope of a line written as \(\hat{y_i} = m{x_i} + b\), were b is the intercept (i.e., \({b_0}\)) and m is the slope (i.e., \({b_1}\)). Notice also that the \(y_i\) scores in the equation have a hat over them (\(\hat{y_i}\)) — as we’ll learn about in just a bit — this denotes the predicted (or fitted) scores for \(y_i\) given the equation. So, for any given score on growth, we can compute the predicted score on vote. We’ll produce predicted scores in the next section.

Predicted scores based on the equation

Let’s use this equation to obtain a predicted score. We’ll use the 1988 election as an example.

In this election, George Bush Sr. ran against Michael Dukakis. Coming off of Ronald Reagon’s second term, Bush was the incumbent party candidate. The economy was in pretty good shape, income growth was 2.89%. Using our equation, we’d predict Bush to garner 53.6% of the vote share.

\[ \hat{y_i} = b_{0}+b_{1}x_i \] \[ \hat{y_i} = 45.021+2.975x_i \]

\[ \hat{y_i} = 45.021+2.975\times2.89=53.62 \] Take a look at the graph below, we see that this prediction (marked with an orange dot) is very close to what Bush actually garnered (53.90% — see the dot marked with the label 1988). Note that I changed the best fit line to a dashed gray line to help you see the orange dot.

We can calculate the predicted score (also referred to as the fitted value or y-hat) for every case in the data frame. Doing so yields the table below. The column labeled .fitted provides the predicted score for every election year.

To practice and solidify this concept, please use the equation above to calculate the predicted score by hand for a few of the years.


Let’s consider another example — the 1968 election between Hubert Humphrey (Democrat) and Richard Nixon (Republican).

Humphrey was the incumbent and the US was in the midst of the Vietnam War. During Humphrey’s term, economic growth was 3.26%, he only secured 49.59% of the vote, and ultimately lost to Nixon. Based on our best fit line, we would have predicted Humphrey to receive 54.73% of the vote.

\[ \hat{y_i} = 45.021+2.975\times3.26=54.73 \]

This predicted score is marked by the orange dot in the graph below and labeled \(\hat{y_i}\). Humphrey’s actual vote total (see the point marked with the label 1968) was much poorer than our best fit line would have predicted — that is, given the relatively good economy, he did not perform nearly as well as we would have expected based on our linear model. To illustrate this point, I connected the predicted and observed score for vote for the 1968 election with a dotted black line in the graph below.

Residual scores based on the equation

Using the observed score and the predicted score for each case, we can calculate each case’s residual. The residual, denoted \(e_i\), is simply the difference between a case’s observed score and their predicted score:

\[ e_i = y_i - \hat{y}_i \]

Specifically, we subtract each case’s predicted score from their observed score. The residual is denoted by \({e_i}\) in the figure below. For the 1968 election, the residual is -5.14 (calculated as 49.59 - 54.73).


In the same way that we calculated the predicted score (.fitted) for each case, we can also calculate the residual for each case. These are all displayed in the table below, under the column .resid. Note the calculated predicted/fitted value and the residual for the 1968 election — and match them up to the work we just did to compute them.

To solidify these concepts, please calculate a few predicted scores and residuals for election years yourself, and map them onto the graph above.



Each case has a residual — this represents the difference between what our model predicts the score will be and the actual observed score. With the residual, we can write an alternative version of our regression equation that incorporates the residual:

\[ {y_i} = b_0 + b_1x_i + e_i \]

Here, we replace \(\hat{y_i}\) with the actual observed score (i.e., \({y_i}\)), and we add the case’s residual (represented as \({e_i}\)). In this way, the actual observed score for Y for each case can be calculated using the intercept and slope for the best fine line (to obtain \(\hat{y_i}\)) and adding the case’s residual score. For example, to reproduce the observed score for vote (\({y_i}\)) for 1952, we use the following equation: \[{y_i} = 45.021 + (2.975\times3.03) + (-9.49) = 44.55\]

Please use the same technique to recover vote for a couple additional years to be sure you understand where the numbers come from.

At this point, you might be worried about the time and energy needed to test a whole lot of lines in order to find the one that results in the smallest sum of squared residuals. Fortunately, R can solve for the best fine line. So, we don’t have to do the tedious work of fitting and testing candidate lines. Let’s see how this is done in the next section.

Fit a simple linear regression model

We use the lm() function in R to fit a linear model. In the code below I first define the name of the R object that will store our linear model results (bp_mod1). Then, I specify the desired equation in the lm() function.

We want to regress the outcome (i.e., the \(y_i\) scores) on the predictor (i.e., the \(x_i\) scores), that is, we want to regress vote on growth to determine if growth is a predictor of vote. This is coded in the lm() function by writing vote ~ growth. Think of the tilde in this instance as saying “regressed on.” We also need to indicate the data frame to use.

bp_mod1 <- lm(vote ~ growth, data = bp)

Obtain the regression parameter estimates

The tidy() function from the broom package can be used to view the regression parameter estimates (i.e., the intercept and the slope). The broom package has a number of functions that we’ll use — they focus on converting the output of various statistical models in R into tidy data frames. That is, broom provides functions that tidy up the results of model fitting, making it easier to work with and analyze the results using other tidyverse tools.

To use the tidy() function to see the regression results, we just need to feed in the R object that we designated to store our linear model results (bp_mod1).

bp_mod1 |> tidy() |> select(term, estimate)

For now, we are only going to be concerned about the first two columns of the tidy output — the columns labeled term and estimate. To avoid distractions, I’ve extracted just these two columns (we’ll consider the other columns produced by tidy() later in the course). Let’s map these estimates onto our equation. The term labeled intercept provides the intercept of the regression line and the term labeled growth provides the slope of the regression line.

\[ \hat{y_i} = b_{0}+b_{1}x_i \]

\[ \hat{y_i} = 45.021+2.975x_i \]

As defined earlier, the intercept is the predicted value of Y when X = 0, therefore, it is the predicted vote percentage for the incumbent party when growth in income equals 0.

The slope is the predicted change in Y for each one unit increase in X. That is, for each one percentage point increase in growth (e.g., going from 1% to 2%, or 2% to 3%), we expect the percent of vote share garnered by the incumbent party, vote, to increase by 2.975 points.

Obtain the predicted values and residuals for each case

We can obtain the predicted values and residuals using the augment() function from the broom package. In the code snippet, bp_mod1 (the object holding our model results) is piped into the augment() function — this function augments the data frame (bp) with additional information from the linear model — including the predicted values (.fitted) and the residuals (.resid). As you’ll learn later in the course, the augment() function adds more information than just the predicted values and residuals — it also add numerous statistics for assessing the assumptions of the linear regression model. Since we won’t be concerned about these until later in the course, the select() function from dplyr is used to select only the columns that we need for now.

bp_mod1 |> 
  augment(data = bp) |> 
  select(year, vote, growth, .fitted, .resid)

Obtain the overall model summary

We can also obtain additional summary information about the model using the glance() function from broom.

bp_mod1 |> glance() |> select(r.squared)

For now, we’ll only be concerned about the r.squared column. This value is .491. It represents a commonly considered term in regression modeling called R-squared or \(R^2\). \(R^2\) is the proportion of the variability in the outcome (vote in our example) that is explained by the predictor(s) (growth in our example).

It is calculated by considering the amount of variability explained by our model (the systematic part of the model), and the amount not explained by our model (the error, also called the residual).

  • The systematic part is often called the Sum of Squares Regression (SSR) — and is calculated by taking the difference between the predicted Y for each case (\(\hat{y_i}\)) and the mean of Y (\(\bar{y}\)), squaring it, and then summing across all cases in the sample. In equation form this is \(SSR = Σ(\hat{y_i} – \bar{y})^2\). Note that Σ, the Greek letter called sigma, means to sum. SSR represents how much our predictor(s) can move our prediction away from simply predicting the average outcome (\(\bar{y}\)) for every observation (which is what we would do if we had no predictors at all). In other words, SSR captures how much better our model’s predictions are than just predicting the average outcome (e.g., average vote, (\(\bar{vote}\))).

  • The error part is often called the Sum of Squares Error (SSE) — and is calculated by taking the difference between the predicted Y for each case (\(\hat{y_i}\)) and the observed Y for each case (\(y_i\)) (which is the residual), squaring it, and then summing across all cases in the sample. In equation form this is \(SSE = Σ({y_i} - \hat{y_i})^2\). As mentioned earlier, the goal of a linear regression model is to create a line that best fits the data, which means it minimizes the differences between observed values and predicted values. However, no model is perfect, and there will always be some level of error. These errors in prediction are the differences between the observed values and the fitted/predicted values. The SSE gives us a single measure of the total error of the model’s predictions. So, the SSE essentially represents the unexplained variance in the outcome variable. You will also here this quantity referred to as Sum of Squares Residual or Residual Sum of Squares.

To further build our intuition, let’s calculate these quantities with some code.

First, lets get the needed quantities for each case — that is, the squared difference between the predicted \(y_i\) and the mean of of the \(y_i\) scores for SSR and the squared residual (for SSE).

get_rsq <- 
  bp_mod1 |> 
  augment(data = bp) |> 
  mutate(for_SSR = .fitted - mean(vote), # calculate the difference between each predicted score and the mean of y
         for_SSR2 = for_SSR^2) |>  # square the difference
  mutate(for_SSE2 = .resid^2) |>  # square the residual 
  select(year, vote, growth, .fitted, .resid, for_SSR, for_SSR2, for_SSE2)
get_rsq       

Now, we can sum the squared quantities that we just calculated across all cases (all 17 elections).

SSR_SSE <- get_rsq |> 
  summarize(SSR = sum(for_SSR2),
            SSE = sum(for_SSE2))

SSR_SSE

Finally, to calculate the \(R^2\), we take SSR divided by the sum of SSR and SSE. This denominator is also called Sum of Squares Total (SST) as it represents the total variability in the outcome.

\[ R^2 = \frac{\text{SSR}}{\text{SSR} + \text{SSE}} \]

SSR_SSE |> 
  mutate(r_squared = SSR/(SSR + SSE))

Of course, this is the same value of \(R^2\) that glance() calculated for us. You can multiply the proportion by 100 to express it as a percentage, which indicates that about 49% of the variability in the share of the vote received by the incumbent party is explained by the income growth that US residents achieved during the prior term. The remaining 51% might be just random error in the model, or this remaining variability might be accounted for by other variables. For example, in Module 11 we will determine if additional variability in vote can be accounted for by adding fatalities (i.e., loss of U.S. military troops) as an additional predictor to the regression model. Hibb’s Bread and Peace Model asserts that it should.

Recall from the beginning of the Module, we learned that an outcome can be modeled as:

Outcome = Systematic Component + Residual

And, that statistical techniques allow us to explain variation in our outcome (the systematic component) in the context of what remains unexplained (the residual). The \(R^2\) gives us the quantity of the systematic component — the proportion of the variance in vote that can be predicted by growth. The pie chart below depicts this for our example.


The \(R^2\) is very useful, but it’s important to recognize that it has limitations. While a higher \(R^2\) value indicates more of the variability in the outcome can be predicted by the predictor(s), it does not necessarily mean the model is good. A model can have a high \(R^2\) value but still be a poor predictive model if it’s overly complex (overfitting) or if key assumptions of the regression analysis are violated. These are topics we will discuss later in the course.

The standard error of the residual

There is one other output from glance() that is of interest to us now — and that is sigma, also referred to as the residual standard error.

\[SD_y = \sqrt{\frac{\sum (y_i - \bar{y})^2}{n-1}} = 5.44\]

\[SD_{y|x} = \sqrt{\frac{\sum (y_i - \hat{y})^2}{n-2}} = 4.01\] Sigma is available in the glance() output.

bp_mod1 |> glance() |> select(sigma)

Sigma (\(\sigma\)) represents the standard deviation of the residuals, which are the differences between the observed values and the predicted values of the outcome variable, Y. Ideally, we want to compare \(\sigma\) to the standard deviation of the outcome variable, Y. Our goal is that after accounting for the predictors, the standard deviation of the residuals (\(\sigma\)) is smaller than the standard deviation of Y. This reduction would indicate that the predictor(s) provide meaningful information to predict the outcome.

The illustration above vividly captures this concept. On the graph’s left side, you’ll see a large distribution titled “full distribution of Y.” Taking the outcome vote, as an example, this wide spread showcases the complete variability of the variable vote. The standard deviation of Y, which is the square root of its variance, measures this spread and is calculated without considering any predictors.

Contrasting this, when we introduce X as a predictor, we aim for the smaller distributions of the \(\hat{y_i}\) scores given distinct values of \(x_i\) to be narrower than the overall distribution of Y. If these smaller distributions are indeed narrower, it reaffirms that X is providing significant insights into the prediction of Y.

Use the equation to forecast

We can use our equation to forecast “out of sample” predictions. That is, predict what might happen for years not included in the data frame based on the fitted model. For example, 2020 data was not included in Thomas’s data frame that we analyzed.

Let’s consider the 2020 election between Donald Trump (Republican - the incumbent) and Joe Biden (Democrat). I applied the same method in the Thomas paper to calculate growth for 2020 — and obtained an estimate of 3.39. We can plug this number into the equation that we obtained using years 1952 to 2016 to get a predicted score for vote.

\[\hat{y_i} = 45.021+2.975\times3.39 = 55.11\]

This forecast means that, based on income growth alone, we would have expected Donald Trump to garner about 55.1% of the vote share. There were 81,285,571 votes cast for Biden and 74,225,038 votes cast for Trump in the 2020 election. Thus, the actual vote score for the 2020 election (i.e., the percentage of votes garnered by the incumbent candidate) was 47.7%, producing a residual of about -7.4. The graph below overlays the 2020 data point onto our graph.

Building a regression model and using it to forecast future outcomes can be an incredibly valuable technique. By identifying and quantifying the relationship between variables based on historical data, we can make informed predictions about what might happen under similar conditions in the future. This is the cornerstone of many fields, including finance, economics, medicine, and machine learning, to name just a few.

Correlation

For two numeric variables (e.g., vote and growth in the Bread and Peace data frame) we can also calculate the correlation coefficient. A correlation coefficient, often abbreviated as \(r\), quantifies the strength of the linear relationship between two numerical variables using a standardized metric. A correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative relationship (as one value increases, the other decreases), and +1 indicates a perfect positive relationship (as one value increases, the other increases). A correlation coefficient of 0 indicates no relationship between the two variables. A correlation coefficient captures how tightly knit the data points are about the best fit line and the direction of the line.

To grow your intuition for correlation coefficients, check out this interactive application.

The correlate() function in the corrr package calculates the correlation coefficient and puts it in a nice matrix form. Note that a correlation matrix displays the correlations between each pair of variables in a data frame. The correlations are listed in both the upper and lower diagonals to provide a symmetrical and redundant representation. This redundancy allows for easier interpretation, as the user can find the correlation between any two variables regardless of their position in the matrix, ensuring clarity and reducing the likelihood of errors in reading the values.

bp |> 
  select(growth, vote) |> 
  corrr::correlate()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'

The correlation between vote and growth is .70, denoting a strong, positive association. Of note, the square of the correlation coefficient equals the \(R^2\) in a SLR (i.e., \(.70^2 = .49\)).

While the regression slope is in the metric of the Y variable, and is dependent on the measurement scale of both the X and Y variables (i.e., a one unit increase in the \(x_i\) scores is associated with a \({b_1}\) unit change in the \(y_i\) scores), a correlation coefficient is in the metric of standardized units.

An interesting property of linear models is that if you create z-scores of your X and Y variables, and then fit a simple linear regression, the slope for the regression line will equal the correlation. Take a look at the code below, and then see the estimate for growth_z in the output. It matches the correlation coefficient. Therefore, we can also interpret the correlation coefficient as “the expected change in the standardized \(y_i\) scores for a one standard deviation increase in the \(x_i\) scores.”

bp_z <- bp |> 
  select(vote, growth) |> 
  mutate(vote_z = (vote - mean(vote))/sd(vote)) |> 
  mutate(growth_z = (growth - mean(growth))/sd(growth)) 

bp_mod_z <- lm(vote_z ~ growth_z, data = bp_z)

bp_mod_z |> 
  tidy() |> 
  select(term, estimate) |> 
  mutate(across(where(is.numeric), ~round(., 3))) 

Note: the extra code below tidy() in the code above rounds the numeric estimates to three decimal places, making it easier to read. Otherwise, it would be printed out using scientific notation.

Correlation is a statistical measure that quantifies the size and direction of a relationship between two variables. As we learned in this module, estimating the relationship between two variables can be extremely useful. However, it is critical to realize that finding a correlation between two variables does not mean that one variable causes the other variable. That is, correlation does not equal causation. Causation indicates that one event is the result of the occurrence of another event — in other words, that one event causes the other. We’ll learn more about correlation and causation in psychological studies later in the course.

For now, please watch the following Crash Course Statistics video that introduces these issues.



Wrap up

In this module on Simple Linear Regression (SLR), we explored the foundational concepts and applications of modeling relationships between two continuous variables. Through practical examples and implementation in R, we demonstrated how SLR can predict outcomes and elucidate how change in one variable is associated with change in another. The core of our learning centered on the best fit line — defined by its intercept and slope — which is determined using the least squares criterion to minimize the sum of squared errors between observed values and predicted values. This best fit line helps in predicting and interpreting the relationship between variables; specifically, the slope indicates the expected change in Y for every one-unit increase in X, while the intercept represents the expected value of Y when X is zero.

We delved into the significance of the SLR equation components, discussing how they contribute to the model’s predictive capabilities and our understanding of the data. The residuals — differences between observed and predicted values — were highlighted for their critical role in evaluating the model’s fit.

Additionally, we examined the correlation coefficient, a key statistical measure that quantifies the strength and direction of a linear relationship between two variables. This coefficient, along with the \(R^2\) value derived from the model, helps quantify how much of the variability in the dependent variable can be explained by the independent variable, providing a robust tool for assessing the effectiveness of our models.

By integrating these concepts with hands-on R exercises and visualizations, this Module not only strengthened your statistical analysis skills but also prepared you for applying these techniques in broader contexts. From this foundation — we will build to consider more complex models in the coming Modules.