Checking Assumptions and Remediation Measures

Module 17

Learning objectives

Describe the assumptions underlying a linear regression model
Practice creating diagnostic plots to assess the assumptions
Describe tactics to remedy violated assumptions

Overview

In this Module, our focus is on checking the assumptions of a linear regression model. This critical topic might seem mundane, but it’s truly a cornerstone in understanding how to use linear regression responsibly and interpret its results correctly. As we journey through this Module, you’ll become familiar with the key assumptions that underpin every linear regression analysis. You’ll understand why they are important, what happens when they aren’t met, and how to check if your data abide by these assumptions.

At its heart, linear regression is a powerful statistical tool that allows us to understand the relationships between different variables. However, for the equation that relates some set of \(X\) variables to some \(Y\) outcome to accurately tell us about these relationships, certain conditions, or assumptions, need to be met by our data. If these assumptions aren’t met, our model could lead us to make inaccurate predictions or conclusions. This Module will provide you with the necessary knowledge and tools to determine if your model is built on a solid foundation.

The core assumptions

In their book Regression and Other Stories, renowned statisticians Drs. Andrew Gelman, Jennifer Hill, and Aki Vehtari outline six key assumptions of linear regression models. Although all are important, they present them in decreasing order of importance.

1. Validity of the model

The most critical factor in linear regression analysis is ensuring the validity of your model. Your model should correctly reflect the research question you’re trying to answer. The most valuable task a data scientist can undertake is to invest substantial effort in defining (and refining) the question that is being posed so that it is as clear and precise as possible. Once this question is meticulously defined, it paves the way for identifying the correct measures to evaluate the key constructs and devise an appropriate model to answer your question.

2. Representativeness of the sample

Secondly, your data should be a good representative of the larger population you want to generalize your results to. As an example, consider a study on the impact of a public health intervention, like a new fitness initiative, on community health. It’s crucial that your sample includes individuals from different age groups, ethnicities, fitness levels, and socioeconomic statuses. If your sample overly represents a specific group, like people who already lead active lifestyles, it could skew your results. This imbalance might lead to an overestimation of the initiative’s effectiveness, as those who are already active may respond differently to the intervention than those who are less active. Thus, ensuring representativeness is a key component in ensuring the validity of your results.

3. Linearity and additivity

Third, linear regression assumes that the relationship between your predictors and the outcome is linear and additive. This means, for instance, if you’re using age and diet to predict heart disease risk, you assume the effect of each predictor on the outcome (adjusting for the other predictors in the model) is linear (i.e., a straight line is a good representation of the relationship). And, you assume that the effect of age on heart disease risk is the same at all levels of diet, and vice versa (this is additivity). If this is not the case, you might need to modify your model by transforming your predictors or outcome (e.g., applying a non-linear transformation), including other methods for accounting for non-linearity (e.g., polynomial terms), including interaction terms between predictors, or choosing a different type of model that provides a better fit to the data.

4. Independence of errors

The fourth assumption is that the errors (or residuals) are independent. This means the error associated with one observation is not related to the error of any other observation. This assumption can be violated in cases where study participants are related to one another or when the data is structured in a particular way that induces dependency. For instance, if you are studying the academic achievement of pre-kindergarten students across 100 different classrooms, the students within each classroom might share similar learning experiences, teaching styles, and educational resources, thus inducing a correlation in their academic achievements.

This clustering effect may result in the violation of the independence assumption as the errors associated with students in the same classroom are likely to be more similar to each other than to students from different classrooms. This situation necessitates the use of more sophisticated statistical methods, like mixed models or multilevel linear models, which account for such intra-group correlation.

5. Equal variance of errors

Next, linear regression assumes that the variance of errors is constant across all levels of your predictors. This is known as homoscedasticity. If the variance changes (i.e., heteroscedasticity), your model might still provide useful predictions, but it might not be as efficient, and the standard errors (and subsequent significance tests) might not be accurate.

6. Normality of errors

The last assumption is that the residuals (or errors) of your model are normally distributed. This means if you were to plot a histogram of residuals, you would see a normal, bell-shaped curve. However, the violation of this assumption often has less severe implications compared to the others, and the model can still provide useful predictions.

Keep in mind, these are ideal conditions. In reality, your data might not perfectly meet these assumptions, but that doesn’t necessarily mean you can’t use linear regression. Instead, you should be aware of these assumptions and the potential implications if they’re not met. Lastly, when checking these assumptions, always interpret your results within the context of your research question and the data you’re using. These assumptions are guidelines, not hard rules. They should aid your analysis, not dictate it.

An example

Throughout this course, we have endeavored to replicate the findings presented by Dr. Irene Blair and her colleagues in their paper, “The Influence of Afrocentric Facial Features in Criminal Sentencing”. For the remainder of this module, we will examine the assumptions underpinning what they regard as their ultimate fitted model — Model 3.

We focus here on assumptions 3, 5 and 6 — as the first two assumptions should be examined and dealt with prior to conducting the study, and assumption 4 is primarily about the design of one’s study. Assumptions 3, 5, and 6 pertain specifically to checking the adequacy of the fitted model.

The data include the following variables:

variable_name	label	notes
id	inmate ID	NA
years	sentence length in years	NA
black	race (1 = black, 0 = white)	NA
afro	rating for afrocentric features	inmate photo rated by ~35 CU undergraduate students; raters assigned a single, global assessment of the degree to which each face had features that are typical of African Americans, using a scale from 1 (not at all) to 9 (very much)
primlev	seriousness of primary offense	based on Florida’s rating system, higher numbers indicate more serious felonies
seclev	seriousness of secondary offense	based on Florida’s rating system, higher numbers indicate more serious felonies
nsecond	number of secondary offenses	NA
anysec	indicator for any secondary offenses (1 = yes, 0 = no)	NA
priorlev	seriousness of prior offenses	based on Florida’s rating system, higher numbers indicate more serious felonies
nprior	number of prior offenses	NA
anyprior	indicator for any prior offenses (1 = yes, 0 = no)	NA
attract	rating for attractiveness	inmate photo rated by ~35 CU undergraduate students
babyface	rating for babyface features	inmate photo rated by ~35 CU undergraduate students

We will need the following packages for our work in this Module:

library(here)
library(skimr)
library(broom)
library(boot)
library(tidyverse)

Let’s import the data frame.

df <- read_csv(here("data", "sentence.csv"))
df |> head()

And, get some descriptive statistics to refresh our memories about the variables.

df |> skim()

Data summary
Name	df
Number of rows	216
Number of columns	13
_______________________
Column type frequency:
numeric	13
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	108.50	62.50	1.00	54.75	108.50	162.25	216.00	▇▇▇▇▇
years	1	6.84	15.54	0.42	1.56	2.67	4.67	99.00	▇▁▁▁▁
black	1	0.46	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
afro	1	4.53	1.77	1.49	2.83	4.47	6.09	7.86	▇▇▆▇▆
primlev	1	6.55	2.07	1.00	5.00	7.00	8.00	11.00	▁▇▇▆▁
seclev	1	3.40	2.57	0.00	0.00	3.65	5.00	9.00	▇▆▇▃▂
nsecond	1	2.41	3.82	0.00	1.00	1.00	3.00	41.00	▇▁▁▁▁
anysec	1	0.75	0.44	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
priorlev	1	1.43	2.29	0.00	0.00	0.00	3.00	9.00	▇▂▂▁▁
nprior	1	0.95	1.90	0.00	0.00	0.00	1.00	13.00	▇▁▁▁▁
anyprior	1	0.33	0.47	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▃
attract	1	3.21	0.91	1.43	2.57	3.16	3.71	6.51	▃▇▅▁▁
babyface	1	4.04	1.10	1.66	3.18	3.91	4.82	7.09	▂▇▇▅▁

Please take a moment to peruse the output to refamiliarize yourself with the variables.

Prepare the data for analysis

From working with these data in prior activities, you know that there is a bit of data wrangling that is necessary to prepare the data to fit the models outlined in the paper. This includes taking the natural log of the outcome, centering the continuous predictors at their mean, forming polynomial terms, and creating an effect code for the indicator for race. The code below accomplishes these tasks.

sentence <- 
  df |> 
  mutate(lnyears = log(years),
         primlev_m = primlev - mean(primlev),
         primlev2 = primlev_m^2,
         seclev_m = seclev - mean(seclev),
         seclev2 = seclev_m^2,
         priorlev_m = priorlev - mean(priorlev),
         priorlev2 = priorlev_m^2,
         nsecond_m = nsecond - mean(nsecond),
         nprior_m = nprior - mean(nprior),
         afro_m = afro - mean(afro),
         babyface_m = babyface - mean(babyface),
         attract_m = attract - mean(attract)) |> 
  mutate(black_m = case_when(black == 0 ~ -1, black == 1 ~ 1)) |> 
  select(id, 
         years, lnyears,
         primlev_m, primlev2, seclev_m, seclev2, nsecond_m, priorlev_m, priorlev2, nprior_m,
         black, black_m,
         afro, afro_m,
         attract_m, babyface_m)

Fit the model

As outlined in the published paper, in Model 3 the authors test their primary hypothesis of interest. That is, they determine if Afrocentric features predicts log sentence length, holding constant race and all of the criminal record variables. The authors describe this step as follows:

In a third model, we added the degree to which the inmates manifested Afrocentric features as a predictor of sentence length, controlling for the race of the inmates and the seriousness and number of offenses they had committed.

Here, the null hypothesis is that, holding constant criminal record and race, Afrocentric features is not related to sentence length. The alternative hypothesis is that, holding constant criminal record and race, Afrocentric features is related to sentence length.

The code below fits Model 3 from the paper.

mod3 <- lm(lnyears ~ primlev_m + primlev2 + seclev_m + seclev2 + nsecond_m + priorlev_m + priorlev2 + nprior_m + 
             black_m + 
             afro_m, 
           data = sentence)

mod3 |> tidy(conf.int = TRUE, conf.level = .95)

Based on Model 3, Blair and colleagues reject the null hypothesis. The coefficient for Afrocentric features (afro_m) is .092. The test statistic is 2.306 and the p-value is .022. The 95% CI is .01 to .17. It’s critical to note that the accuracy and validity of this estimate depends on the assumptions of our linear regression model being met. Let’s determine if this is the case.

Create a data frame with the fit statistics

For many of the graphs that we will produce below, we’ll need statistics from the augment() function (part of the broom package). Let’s create a data frame that holds these values. These statistics will be described as we move through the Module.

eval <- 
  mod3 |> 
  augment(data = sentence) |> 
  select(id, years, lnyears, black, afro, afro_m, .fitted, .resid, .std.resid, .hat, .cooksd)

eval |> head()

Evaluate key assumptions of the linear regression model

Linearity and additivity

Fitted values vs residuals plot

To check linearity and additivity, we will begin by examining a scatter plot of the .fitted values (i.e., y-hat or the predicted values) against the .resid values (i.e., the residuals) for our fitted model.

In linear regression, we aim to model the relationship between our predictor(s) and the outcome. The fitted values (.fitted) represent the predicted values of the outcome variable according to our model - they’re the model’s best guess for what the outcome should be based on the information we provided it about the predictors.

On the other hand, the residuals (.resid) represent the difference between the actual observed values and the predicted values (i.e., the errors in our model’s predictions). Ideally, if our model is perfect, these residuals would all be zero because the model’s predictions match perfectly with the observed data. However, in reality, this is never the case and there’s some degree of error present.

Now, when we plot fitted values against residuals, we’re essentially investigating how the residuals are distributed across different levels of our predicted outcome. If the model did an adequate job of using the predictors to explain variability in the outcome, then every way in which the predictors are related to the outcome should be swept up in the systematic part of the model — and therefore, the residuals should be unrelated to the fitted values. If, instead, the residuals are related to the fitted values, then that is a sign of model misspecification.

In a well-specified model, the residuals should not show any systematic pattern when plotted against the fitted values (reflecting the linearity assumption), and the spread or dispersion of the residuals should be constant across the range of fitted values (reflecting the assumption of homoscedasticity). If residuals are related to the fitted values in a systematic way (e.g., they show a clear pattern or their spread varies with the fitted values), this suggests some form of model misspecification. It could be that the model is missing important predictor(s), the relationship between predictors and the outcome is not linear, or there are missing interactions between variables. It could also suggest that the errors are not independent, another key assumption of linear regression models.

Let’s create a scatter plot of the fitted values (.fitted) and the residuals (.resid) resulting from Model 3 — this is called a Fitted Values vs Residuals Plot.

eval |> 
  ggplot(mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1, color = "#F05676") +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE, color = "#2F9599") +
  labs(x = "Fitted values", y = "Residuals", title = "Fitted values vs residuals") +
  theme_bw()

In this graph, the pink dashed line is simply marking the center of the residuals, that is, it’s a reference line representing zero residual error. If there was no error in predicting the outcome (i.e., we perfectly predicted the outcome for each case), then all the points would fall on this line. Of course, that’s not realistic. Instead, we hope to see that the residuals are randomly scattered around this line.

The blue curve produced by the geom_smooth() function is a loess smoother and serves to visually depict any patterns in the residuals. Ideally, this line should be approximately horizontal and near the pink dashed line. If it shows any pronounced curve or wave-like form, this is an indication of non-linearity. In this case, the blue line isn’t exactly on top of the pink dashed line — but the difference is relatively minor. The area of largest departure is for large fitted values (i.e., people for whom our model predicts a long sentence length). There are few people out in the right-hand tail for sentence length, and therefore, the loess smooth curve doesn’t have a lot of data to go on — therefore, it’s typical to see more wobble or variation in the shape of the curve for extreme values, and we shouldn’t read too much into this.

One common reason for a pattern to emerge in this type of plot is that the relationship between the outcome and one or more of the predictors is non-linear (i.e., curvilinear). In this case — the model should be re-specified. The analyst should consider the need for non-linear transformations or for non-linear effects to be specified. The possibility that there are interactions between variables should also be considered if a systematic pattern is observed in the Fitted Values vs Residuals Plot.

Recall that Blair and colleagues implemented a number of techniques for dealing with the possibility of a non-linear relationship. First, noting the severe skew of the outcome — sentence length in years — they applied a non-linear transformation to this variable. That is, they took the natural log of years and then used log sentence length as the outcome. Second, as part of their theory for the manner in which the crime severity variables should be lawfully related to sentence length, they included polynomial terms for the severity of crime predictors.

Let’s take a peak at what the Fitted Values vs Residuals Plot would look like had they not taken these actions:

mod3_naive <- lm(years ~ primlev_m + seclev_m + nsecond_m + priorlev_m + nprior_m + 
             black_m + 
             afro_m, 
           data = sentence)

mod3_naive |> 
  augment(data = sentence) |> 
  ggplot(mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1, color = "#F05676") +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE, color = "#2F9599") +
  labs(x = "Fitted values", y = "Residuals", title = "Fitted values vs residuals") +
  theme_bw()

In this version of the graph — you can clearly see that the data points are not evenly distributed around the pink dashed line — and the blue line clearly shows a systematic relationship with the residuals. Therefore, the authors’ strategy to apply the non-linear transformation to the outcome, and to include polynomial terms to allow for a curvilinear effect of severity of crimes on log sentence length was effective.

Added variable plot

Besides just looking at the Fitted Values vs Residuals Plot, it’s also worthwhile to take a look at the Added Variable Plot for our key predictor (Afrocentric features) and the outcome (log sentence length) to ensure that the relationship between these two variables is indeed linear after accounting for the control variables. This is important because the interpretation of the effect of Afrocentric features on log sentence length is assumed to be linear controlling for the covariates — that is, each one unit increase in Afrocentric features has the same effect on log sentence length whether we’re consider a difference between 1 and 2, 4 and 5, or 8 and 9.

Let’s take a look at the Added Variable Plot for afro_m and lnyears.

y_resid <- lm(lnyears ~ primlev_m + primlev2 + seclev_m + seclev2 + 
                nsecond_m + priorlev_m + priorlev2 + nprior_m + 
                black_m, 
             data = sentence) |> 
  augment(data = sentence) |> 
  select(id, .resid) |>
  rename(y_resid = .resid)

x_resid <- lm(afro ~ primlev_m + primlev2 + seclev_m + seclev2 + 
                nsecond_m + priorlev_m + priorlev2 + nprior_m + 
                black_m, 
             data = sentence) |> 
  augment(data = sentence) |> 
  select(id, .resid) |>
  rename(x_resid = .resid)

check <- 
  y_resid |> 
  left_join(x_resid, by = "id") 

check |> 
  ggplot(mapping = aes(y = y_resid, x = x_resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "#2F9599") +
  theme_minimal() +
  labs(title = "Added variable plot for lnyears and afro, adjusting for all control variables")

As we learned in Module 11, the logic behind an Added Variable Plot (AVP) is that it allows you to visually assess the linearity and the strength of the relationship between your predictor of interest and the outcome, while taking into account the effects of other predictors. In terms of checking for linearity, the AVP should show a linear pattern; the points should scatter around a straight line (either positive or negative slope depending on the direction of the relationship). If there is a nonlinear pattern in the scatter, it suggests that the relationship between the predictor and the outcome is not linear and the linearity assumption is violated.

The AVP above provides evidence that the relationship is indeed linear — the best fit line (the blue line) seems to do a good job of capturing the unique relationship between Afrocentric features and log sentence length after adjusting for the control variables.

To consider the possibility that interactions between variables are needed to address violations of additivity, let’s consider the possibility that there is an interaction between race and Afrocentric features in predicting log sentence length. Recall that Blair and colleagues tested for this possibility (this was Model 4 in their paper) — but did not find evidence of an interaction. Still, let’s create a plot that allows us to consider the possibility that the effect of Afrocentric features on log sentence length, adjusting for the control variables, could differ by race. Notice that I’m fitting the same two models as before for the AVP to obtain the y- and x- residuals — but, I’m excluding the predictor black this time. Excluding this variable will facilitate the creation of two AVPs — one for White inmates and one for Black inmates.

y_resid <- lm(lnyears ~ primlev_m + primlev2 + seclev_m + seclev2 + 
                nsecond_m + priorlev_m + priorlev2 + nprior_m, 
             data = sentence) |> 
  augment(data = sentence) |> 
  select(id, black, .resid) |>
  rename(y_resid = .resid)

x_resid <- lm(afro ~ primlev_m + primlev2 + seclev_m + seclev2 + 
                nsecond_m + priorlev_m + priorlev2 + nprior_m, 
             data = sentence) |> 
  augment(data = sentence) |> 
  select(id, .resid) |>
  rename(x_resid = .resid)

check <- 
  y_resid |> 
  left_join(x_resid, by = "id") 

check |> 
  mutate(black.f = case_when(black == 0 ~ "White", black == 1 ~ "Black")) |> 
  ggplot(mapping = aes(y = y_resid, x = x_resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, color = "#2F9599") +
  facet_wrap(~ black.f) +
  theme_minimal() +
  labs(title = "Added variable plot for lnyears and afro, stratified by race and adjusting for all control variables")

This plot shows that the relationship between Afrocentric features and log sentence length, adjusting for the control variables, is similar for Black and White offenders. If the best fit lines were different from one another (e.g., in one group the relationship was positive, and in another negative) — then that would suggest the need for an interaction between race and Afrocentric features — however, as verified by Blair and colleagues, and as observed in these graphs, there is not evidence that the additivity assumption is violated. That is, the effect of Afrocentric features on log sentence length after adjusting for control variables is similar for Black and White offenders.

In sum, based on all of these plots, there is no clear sign of a curvature or non-linear pattern to the plots — suggesting that the assumptions of both linearity and additivity are met in the specified Model 3.

Remediation efforts

If the assumptions of linearity and additivity are violated in a regression model, there are several strategies you can consider:

Data transformation: One common approach is to transform the variables in your model. This can sometimes make the relationship between the predictors and the outcome more linear and can also help to meet the assumption of additivity. Logarithmic, square root, or inverse transformations are commonly used.
- Mosteller and Tukey’s Rule of the Bulge might help to identify an appropriate transformation. To apply the rule, one plots the data, identifies if there is a “bulge” in the scatter plot — that is, a curvature. Depending on the direction of the bulge, apply a transformation to one or both of the variables. The ladder and graph below depict the Rule. On the right are four different patterns of nonlinearity (i.e., the blue curves). Identify which curve most closely matches the non-linear pattern of the data, then apply one of the suggested transformations (e.g., up in x, down in x, up in y, or down in y). The ladder figure to the left offers suggestions for the up and down transformations. For example, a square is an up transformation (it makes the values larger), while a log is a down transformation (it makes the values smaller). The approach is to apply the transformations suggested, re-plot the data with the transformed variable(s), and repeat until the scatter plot looks more linear.

Adding interaction terms: If the relationship between predictors and the outcome variable is not additive (that is, the effect of one predictor on the outcome changes depending on the level of another predictor), you might consider adding interaction terms to your model. This allows the effect of one predictor to depend on the level of another predictor.
Polynomial regression: You can add polynomial terms of your predictors to capture non-linear effects. For example, if the plot of residuals versus the fitted values shows a U-shaped pattern, this could be addressed by adding a squared term of the predictor to the model.

Remember, it’s essential to consider the purpose of your model and the trade-offs associated with each method.

Homoscedasticity

Homoscedasticity is an important assumption in regression analysis that refers to the uniformity of the variances of the residuals, or error terms, across the range of predicted outcome values. This assumption means that the spread or dispersion of the residuals is constant for all levels of the predictor variables. In other words, the variability of the error terms should be roughly the same at all values of your predictor variables and the predicted value.

When homoscedasticity is present, it means our model is well-specified, and the error variance is constant across all levels of the predictor variables. This supports reliable statistical inference from the model, like hypothesis testing and the creation of confidence intervals.

The violation of the homoscedasticity assumption is known as heteroscedasticity. This occurs when the size of the error term differs across values of a predictor variable. For example, you might see a pattern where the scatter or spread of residuals increases or decreases systematically along the range of the predictor.

Fitted values vs residuals plot

To examine this assumption the Fitted Values vs Residuals Plot that we created when assessing the linearity assumption can be used. Let’s take another look at this plot.

eval |> 
  ggplot(mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1, color = "#F05676") +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE, color = "#2F9599") +
  labs(x = "Fitted values", y = "Residuals", title = "Fitted values vs residuals") +
  theme_bw()

If the assumption of homoscedasticity is met in your regression model, the scatter plot of residuals versus fitted values will exhibit a random scattering of points around zero, forming a sort of cloud that is spread evenly across all levels of the fitted values. This means the variability in residuals stays roughly the same as you move along the range of fitted values. Plotting the residuals against the fitted values provides a “global” look at homoscedasticity — but you can also plot the residuals against each of your individual predictors. In all cases, you should find that the residuals are evenly spread across the range of scores for your predictor.

In our example to test Blair’s Model 3 the scatter plot of residuals versus fitted values exhibits a random scattering of points around zero, forming a cloud that is spread evenly across all levels of the fitted values. Therefore, we have evidence that the assumption of homoscedasticity is met.

If the assumption of homoscedasticity is violated, you’ll see patterns in the scatter plot that indicate heteroscedasticity. Here are a few typical forms this could take:

Funnel Shape or Cone Shape: The scatter of points widens or narrows as you move along the x-axis. This could mean that the variability of the residuals is increasing or decreasing as the value of the predictor increases. The funnel could open either way - it could be narrow at the bottom and wide at the top, or vice versa.
U-Shape or Inverted U-Shape: The residuals spread wider in the middle of the fitted values range or at the extremes. This is often a sign that a quadratic (i.e, polynomial) term might be missing from your model.
Wave-like or Curvy Patterns: Systematic wave-like patterns in residuals suggest non-linear relationships between predictors and the outcome.

Remember, the scatter plot should ideally look like a random cloud of points with no discernible pattern. Any strong pattern suggests a violation of the homoscedasticity assumption, indicating that your model may not adequately capture the data’s structure, and statistical inference from the model could be unreliable. Importantly, the same problems that cause issues of non-linearity also cause heteroscedasticity of variance — thus, fixing up the linearity and additivity issues very often also solves heteroscedasticity issues.

Heteroscedasticity can cause issues for the standard errors of your coefficients, which can in turn lead to incorrect conclusions about your results.

Remediation efforts

If heteroscedasticity is detected in your data, there are several strategies to address it:

Data Transformation: The simplest way to deal with heteroscedasticity is to transform your outcome variable. Common transformations include taking the log, square root, or reciprocal of the outcome variable. However, it’s important to note that these transformations also alter the interpretation of your model.
Weighted Least Squares: If you have some idea about the nature or structure of the heteroscedasticity, you might be able to use this information to give more weight to the observations with smaller error variances in your model. This is the idea behind weighted least squares (WLS). The challenge here is accurately specifying the weights. Click here for a tutorial on how to conduct WLS in R.
Heteroscedasticity-Consistent Standard Errors: If the main concern is the reliability of inference (e.g., hypothesis tests or confidence intervals), another common approach is to calculate heteroscedasticity-consistent standard errors, often known as “robust standard errors”. This method does not transform the data or the model but instead adjusts the standard errors of the coefficients to account for heteroscedasticity. Click here for directions on fitting this type of model in R.
Bootstrap Resampling: Bootstrapping confidence intervals is another option to consider when dealing with heteroscedasticity. The bootstrap is a resampling method that can be used to obtain robust estimates of standard errors and confidence intervals. This method does not require any assumptions about the form or nature of the heteroscedasticity, which can make it a good option in many cases. You’ve had a few opportunities to learn about bootstrapping in this course. Additionally, there is an example of a bootstrap regression model at the end of this Module.
Adding explanatory variables or interaction terms: In some cases, heteroscedasticity may be a symptom of omitted variables or interaction terms in the model. If some subgroups have more variance than others, or if the variance is related to the value of one of the predictors, adding the right variables or terms to your model may help.

Each method has its own strengths and weaknesses, and the best choice often depends on the specifics of the data frame and the research question. If multiple methods seem feasible, it’s not a bad idea to try several and see if the results are sensitive to the choice of method. In many cases, the most important thing is to be transparent and thoughtful about the issue, rather than simply ignoring it.

Normally-distributed residuals

The assumption of normality of residuals in a linear regression model is the idea that the residuals, or error terms, follow a normal distribution. This assumption is important because many of the inferential statistics associated with linear regression, such as the t-statistic for significance of coefficients, are based on the assumption of normality. These statistics may not have a valid interpretation if the residuals are not normally distributed.

Histogram of residuals

To check this assumption, we can create a histogram of the residuals:

eval |>  
  ggplot(mapping = (mapping = aes(x = .resid))) +
  geom_histogram(bins = 30) +
  theme_minimal() +
  labs("A check of the normality of the residuals from the fitted model")

This plot looks reasonably normally distributed. A plot with clear skew in one direction or the other or with extreme scores far from the others, would be a sign of a potential problem.

Remediation efforts

If the normality assumption is violated, there are a few strategies you might consider:

Data transformation: For example, if your outcome variable has a skewed distribution, applying a logarithm, square root, or inverse transformation might help to make the residuals more normally distributed.
Non-linear models: If the relationship between your predictors and outcome is not linear, and transformations do not work, then linear regression might not be the best modeling choice. This means that the assumption of a straight-line relationship between the predictors (independent variables) and the outcome (dependent variable) does not hold true. In such cases, a linear model will not adequately capture the underlying patterns in the data, leading to poor predictions and potentially misleading conclusions. Depending on your specific research question and data, a generalized linear model (GLM) might be more appropriate than a general linear model (i.e., the model we fit with the lm()). GLMs extend linear models to allow for response variables that have error distribution models other than a normal distribution. For instance:
- Logistic Regression: If your outcome variable is binary (e.g., success/failure, yes/no), logistic regression is a type of GLM that models the probability of the outcome as a function of the predictors. This is commonly used in medical research to model the presence or absence of a disease based on various risk factors.
- Poisson Regression: If your outcome variable represents count data (e.g., the number of times an event occurs), Poisson regression is suitable. This is often used in fields like epidemiology to model the number of occurrences of a disease or condition within a specific period.
- Exponential Regression: When modeling time-to-event data (e.g., survival analysis), exponential regression can be used to model the relationship between predictors and the time until an event occurs.
In summary, if non-linear transformations of your predictors do not result in a linear relationship with the outcome, you should consider generalized linear models that are more suited to the nature of your data and research question.
Robust methods: There are a number of statistical methods that are robust to violations of the normality assumption. These include non-parametric tests, bootstrap methods, and robust regression methods.

Remember that the assumption of normality applies to the residuals, not the variables themselves. There is a common misconception that the variables (predictors and outcomes) must be normally distributed — but that is not the case. Moreover, the Central Limit Theorem assures us that, with a large sample size, the sampling distribution of the residuals will be approximately normally distributed, regardless of the distribution of the residuals in the sample. Therefore, in large samples, minor violations of the normality assumption are often not a major concern.

Evaluate other potential issues

Influential cases

In regression analysis, it’s important to identify influential observations or cases that could unduly affect the model’s estimates and predictions. These cases can significantly skew the direction and strength of the relationships between predictors and the outcome variable. Influential cases may arise due to various reasons such as outliers in the predictor or outcome variables (i.e., extreme scores), data entry errors, or a small subset of the data that doesn’t align well with the overall pattern.

By identifying these influential cases, you can decide how to handle them and better understand the robustness of your model. For instance, you might investigate the data further to identify potential errors or unique aspects about these cases that warrant additional consideration. In some instances, these influential cases might lead you to modify your model or reevaluate its assumptions.

A Residuals vs Leverage Plot, is a diagnostic tool that helps us detect observations that might be unduly influencing the fitted model, also known as influential observations.

Let’s create the plot, then learn how to interpret it. All of the values needed for this plot are created with the augment() function results.

p <- length(coef(mod3))
n <- nrow(sentence)

eval |> 
  ggplot(mapping = aes(x = .hat, y = .std.resid)) +
  geom_point(mapping = aes(size = .cooksd)) +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed") +
  geom_vline(xintercept = 2*(p/n), linetype = "dashed") +
  labs(x = "Leverage", y = "Standardized residuals", 
       title = "Residuals vs leverage") +
  theme_bw()

On the x-axis, we have leverage (called .hat in the augment() output), which measures how far the values of the predictor variables for an observation are from the mean of the predictor variables. The vertical dashed line marks the point of high leverage (which is 2*(p+1)/n; where p is the number of predictors and n is the sample size). Observations located to the right of this line are considered as having high leverage.

On the y-axis, we have standardized residuals — called std.resid in the augment() output. This shows the distance between the actual and predicted values for each observation, but standardized to a mean of 0 and a standard deviation of 1 (i.e., a z-score). The horizontal dashed lines represent a cut-off of -2 and 2 standard deviations. Residuals falling outside this range can be considered as large residuals — though remember from the Empirical Rule that we expect about 5% of cases in a normal distribution to have a score outside of plus or minus 2 standard deviations from the mean. So, some points are fine — too many may be a problem.

The size of the points represents Cook’s distance (called .cooksd in augment()), which is a measure of the influence of each observation. Large values of Cook’s distance identify cases that could have a substantial impact on the location of the regression line. As a rule of thumb, a Cook’s distance larger than 1 is considered influential — but any value that is unusually large for the data frame should be inspected.

Together, the plot gives us a way to visualize the impact of each observation on the model. Observations that are in the top right or bottom right of the plot (i.e., cases with high leverage and a large residual) can be particularly influential in the regression analysis, and thus should be examined closely. Observations with large Cook’s distance (large points) are also worthy of further investigation.

In looking at our plot, there is one case that stands out as having a very large leverage and a large Cook’s distance. Let’s refit our plot and label the points by their ID.

p <- length(coef(mod3))
n <- nrow(sentence)

eval |>  
  ggplot(mapping = aes(x = .hat, y = .std.resid)) +
  geom_point(mapping = aes(size = .cooksd)) +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed") +
  geom_vline(xintercept = 2*(p/n), linetype = "dashed") +
  ggrepel::geom_label_repel(mapping = aes(label = id),
                            color = "grey35", fill = "white", size = 2, box.padding =  0.4, 
                            label.padding = 0.1) +
  labs(x = "Leverage", y = "Standardized residuals", 
       title = "Residuals vs leverage") +
  theme_bw()

The case with ID 204 is the influential case. Let’s take a closer look at their data:

df |> filter(id == 204)

Scanning through this individual’s raw data — you should see something that stands out — they have an unusually large score for nsecond — the number of secondary offenses. As we saw in our skim table earlier, the mean for this variable is 2.4, and the median is 1. Therefore, this person’s score is highly unusual. If this was our own study — we would want to check this case to ensure that there wasn’t a data entry error. If it’s a real, accurate value — then transforming the nsecond variable might be a viable choice. That is, given it’s extremely skewed distribution, taking a log transformation (like was done for sentence length) could be a prudent choice.

It’s important to note that identifying an observation as influential doesn’t mean it should be removed — it might be a correct and important part of your data. But it does mean you should check these points carefully for data errors, and consider the impact they are having on your results.

Multicollinearity

Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are redundant. In other words, one predictor variable can be linearly predicted from the others with a substantial degree of accuracy.

This situation can lead to several problems:

Coefficient estimates can be unreliable: With highly correlated predictors, the variance of the estimated regression coefficients can be quite large. This means that the estimates may change significantly with the addition or deletion of a predictor, making them unstable and difficult to interpret.
Significance tests can be misleading: The standard errors of the coefficients are inflated, which in turn can cause the associated t-statistics to be smaller. This can lead to failing to reject the null hypothesis — thus, incorrectly concluding that a variable is not contributing to the model, when it in fact is.
Interpretation becomes more difficult: When predictors are correlated, it means they are explaining some of the same variance in the response variable. This makes it difficult to assess the individual contribution of each predictor. In some cases this is precisely what you want — i.e., when controlling for a confounder. In other cases, it may present a challenge to interpret coefficients (i.e., when assessing the effect of each variable in your model).
Overfitting: Highly correlated predictors can cause overfitting in the model, meaning the model might fit the sample data very well but fail to generalize to new, unseen data.

Despite these issues, it’s important to note that multicollinearity is not a problem if the goal of your model is prediction, not interpretation of the coefficients. Also, the presence of multicollinearity doesn’t affect the model’s ability to accurately capture the joint effect of the predictors on the outcome variable.

In the planning stage of your analysis, inspection of the correlation matrix of your outcome and predictors is important. If you have predictors that are very highly correlated, then you may consider excluding one of them from the fitted model as they are likely redundant.

Once you fit the model, you can check the Variance Inflation Factor (VIF) for each predictor. The VIF can be calculated for each predictor by regressing the predictor on all the other predictors. The VIF is then calculated using the \(R^2\) from the resulting model with the following formula:

\[\text{VIF} = \frac{1}{1-R^2}\]

Larger VIFs indicate higher multicollinearity.

For example, if only 1% of the variability in X1 could be predicted by the other predictors, \(R^2 = 0.01\), then the VIF would be \(\frac{1}{1-0.01} \approx 1.01\).

If 80% of the variability in X1 could be predicted by the other predictors \(R^2 = 0.80\), then the VIF would be \(\frac{1}{1-0.80} = 5\).

There are many rules of thumb for interpreting VIF values. Most commonly, a VIF of 5 is considered potentially problematic, and a VIF larger than 10 indicates serious multicollinearity issues. However, some experts set a lower threshold, such as 2.5, to indicate potential problems.

The performance package has a function called check_collinearity() that will calculate the VIFs.

mod3 |> performance::check_collinearity() |> select(Term, VIF) |> arrange(VIF)

In our example, nearly all of the VIFs are low (not of concern). The two that reach a moderate level are the polynomial terms of the severity of the prior offense. The VIF is inflated when polynomial terms are included in a regression model. This is because polynomial terms (like \(x_i\) and \(x_i^2\)) are correlated with each other by their very nature. The same goes for interaction terms, which are essentially the product of two or more variables. So the multicollinearity in these instances doesn’t have adverse consequences. Another instance where VIF may be high, but not of concern, is for dummy-coded variables.

If multicollinearity is found to be a problem, you should go back to the correlation matrix and determine if there are very highly correlated predictors that you can exclude without eroding your theory or conceptual model.

Bootstrap resampling

As you’ve learned throughout this course, bootstrap resampling is a powerful tool that can assist in addressing certain violations of regression model assumptions. It is a non-parametric approach, which means it makes fewer assumptions about the data compared to its parametric counterparts.

With regard to regression, bootstrap resampling can be employed to construct robust confidence intervals and hypothesis tests. This is particularly helpful when the assumptions of homoscedasticity (constant error variance) and normality of residuals are violated. Bootstrap achieves this by repeatedly resampling the original data with replacement, essentially generating numerous ‘alternative’ data frames. The regression model is then fitted to each of these resampled data frames, producing a distribution of estimates for the regression coefficients. This distribution can provide a more accurate depiction of the uncertainty associated with these coefficients when standard assumptions are not satisfied.

However, it is crucial to underscore that bootstrap resampling cannot rectify violations of the assumptions of linearity or additivity. These assumptions relate to the structure and form of the relationship between predictors and the response variable. Violations of linearity or additivity suggest that the model’s form might be incorrect or incomplete. In such situations, we might need to contemplate alternative model forms, such as incorporating interaction terms, polynomial terms, or deploying non-linear regression techniques.

The code below demonstrates how to implement bootstrap resampling to formulate confidence intervals for Model 3 from the Blair and colleagues study.

# define a function to calculate the statistics you want on each bootstrap sample
boot_fn <- function(data, indices) {
  sample <- data[indices, ] 
  fit <- lm(lnyears ~ primlev_m + primlev2 + seclev_m + seclev2 + nsecond_m + priorlev_m + priorlev2 + nprior_m + 
             black_m + 
             afro_m, 
            data = sample)
  coef(fit)  # returns a data frame with the coefficients and other model info
}

# use the boot() function to perform the bootstrap resampling 
set.seed(123)  # for reproducibility
boot_results <- boot(data = sentence, statistic = boot_fn, R = 5000)

# calculate bootstrap confidence intervals using the boot.ci() function.
boot.ci(boot.out = boot_results, type = "bca", index = 11) # regression estimate for afro_m

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_results, type = "bca", index = 11)

Intervals : 
Level       BCa          
95%   ( 0.0127,  0.1716 )  
Calculations and Intervals on Original Scale

If you compare the 95% CI for afro_m produced here via bootstrap resamples (.01, .17) to the CI produced using the traditional method — you’ll find that they are nearly identical to one another — indicating that the key assumptions for inference are likely met here.

Wrap up

In this module, we have delved into the critical task of checking and addressing the assumptions underlying linear regression models. Here’s a synthesis of the key takeaways:

Importance of Assumptions: Linear regression relies on several assumptions to provide valid and reliable results. These include the validity of the model, representativeness of the sample, linearity and additivity of relationships, independence and equal variance of errors, and normality of residuals. Considering and evaluating these assumptions is crucial for the accuracy and interpretability of the model.
Diagnostic Tools and Strategies: We utilized various diagnostic plots, such as Fitted Values vs. Residuals Plot, Added Variable Plot, and Residuals vs. Leverage Plot, to visually assess the assumptions. These tools help identify potential issues like non-linearity, heteroscedasticity, and influential cases.
Addressing Violations: When assumptions are violated, several strategies can be employed:
- Transformations: Applying logarithmic, square root, or other transformations to variables can help address non-linearity and heteroscedasticity.
- Model Adjustments: Adding interaction terms or polynomial terms can improve model specification and address violations of linearity and additivity.
- Robust Methods: Using robust standard errors, weighted least squares, or bootstrap resampling can provide reliable inference even when some assumptions are not met.
Practical Application: We applied these concepts to the example study on the influence of Afrocentric features in criminal sentencing. Through diagnostic plots and model adjustments, we ensured that the assumptions were reasonably met, leading to valid conclusions about the predictors’ effects.

In conclusion, checking and addressing the assumptions of linear regression is essential for responsible and accurate analysis. By using diagnostic tools and applying appropriate remedies, we can build robust models that provide reliable insights into the relationships between variables. This Module has equipped you with the knowledge and skills to perform these critical tasks, ensuring the integrity of your regression analyses.

Credits

I drew from the excellent book called Regression and Other Stories by Drs. Andrew Gelman, Jennifer Hill, and Aki Vehtari to create this module, particularly Chapter 11 on assumptions, diagnostics, and model evaluation.