Linear Regression with Categorical Predictors

Module 13

Artwork by @allison_horst

Learning objectives

  • Identify the differences between continuous and categorical variables in regression analysis
  • Explain the purpose and process of dummy-coding categorical variables
  • Construct dummy-coded variables for categorical data with two or more categories
  • Interpret the coefficients of dummy-coded variables in a regression model
  • Incorporate categorical predictors into regression models using R
  • Calculate and interpret within-group and between-group variability in an outcome

Overview

Regression models are powerful tools for analyzing relationships between variables, allowing researchers and analysts to understand how changes in one or more predictors are associated with changes in an outcome variable. Typically, regression models are first introduced to students with continuous predictors only, such as age or income. However, real-world data often include categorical variables — variables that represent distinct groups or categories, such as gender, ethnicity, or treatment condition. Integrating categorical variables into regression models enables a broader application of regression models to be considered.

Incorporating categorical predictors into regression models allows for the analysis of differences among the levels or groups of a categorical variable. For instance, in a medical study examining the effectiveness of a new drug, the inclusion of a categorical variable for treatment type enables the model to estimate and compare the effects of different treatment arms (e.g., placebo versus drug) on some outcome of interest. This allows a question like “Are symptoms decreased for the group that receives the drug versus the group that receives the placebo?” to be answered.

Categorical variables, unlike continuous variables, do not inherently carry a numerical value and usually represent different groups or classifications. To effectively include categorical variables in regression models, we typically recode these categorical predictors into a format suitable for analysis. The most common recoding method is dummy-coding (also called one-hot encoding), where one category is left out to serve as the reference group, and the others are coded as binary indicators to represent each group. This process transforms the categorical variable into a format that can be effectively used in the mathematical equations of regression models. In this Module, we will focus on the most common method for achieving this — dummy-coding. While there are other coding schemes you might encounter, such as effect coding, understanding dummy-coding provides a solid foundation that simplifies learning additional methods.

Dummy-coding translates a qualitative variable with k categories into k-1 binary indicators. For instance, a variable like employment status, which has two categories (employed or unemployed), is represented by a single dummy indicator. Similarly, a variable like ABO blood type with four categories (A, B, AB, O), is represented by three dummy indicators.

Why do we use k-1 indicators? This approach is used because one category must act as the reference (also referred to a “baseline for comparison”) group against which all other categories are compared. This reference group is assigned a value of 0 across all dummy indicators. Each of the remaining categories is represented by setting one of the dummy indicators to 1 (indicating presence) and the rest to 0 (indicating absence).

Artwork by @allison_horst

The group coded zero for the dummy indicator (or dummy indicators if more than 2 groups) serves as the reference group. This group forms the baseline for comparisons in our regression model. Understanding which group serves as the baseline is crucial for correctly interpreting the coefficients of a regression model, as these coefficients measure the expected change in the outcome variable relative to this reference group.

Artwork by @allison_horst

Introduction to the data

In this Module we will use NHANES data. Rather than working with a pre-constructed data frame, we will pull data using the nhanesA R package. This useful package is designed to help researchers with data retrieval and analysis of National Health and Nutrition Examination Survey (NHANES) data. The nhanes() function from the package is used to pull data frames of interest from the NHANES data respository.

The publicly available NHANES survey data is grouped into five primary data categories:

  • Demographics
  • Dietary
  • Examination
  • Laboratory
  • Questionnaire

This CDC website provides more information about what these categories include, and for which years they are available. By looking at the website, you can identify the name of the data frame that contains the data you desire. For example, we will pull the 2017-2018 demographic variables from the Demographics category and blood pressure data from the Examination category.

By perusing the CDC website of 2017-2018 NHANES available data, I found that the demographic data are stored in a data frame named DEMO_J and the blood pressure data are stored in a data frame called BPX_J.

Let’s begin by loading the packages we will need for this Module.

library(gt)
library(skimr)
library(broom)
library(janitor)
library(marginaleffects)
library(tidyverse)

To pull a certain data frame into our R session, we can use the nhanes() function. For example, the code below pulls in these two mentioned files and names them as data frames (bmx2018 and demo2018) in our session.

library(nhanesA)

bmx2018  <- nhanes("BPX_J")
demo2018 <- nhanes("DEMO_J")

Let’s take a look at the beginning of each data frame:

bmx2018 |> head()
demo2018 |> head()

Once the data frames of interest are pulled into the R session, we can work with them as we would any other data frame. The codebooks included on the CDC website that correspond with the data frame name provide information about what each variable represents and how it is coded. These aren’t needed for our simple examples here — but if you have an interest in the NHANES data, you can peruse these on your own.

Let’s take a look at an example code chunk to wrangle the data we pulled from the NHANES site, and produce a data frame that is ready to analyze.

nhanes2018 <- 
  demo2018 |> 
  left_join(bmx2018, by = "SEQN") |> 
  janitor::clean_names() |> 
  mutate(SBP = rowMeans(pick(bpxsy1, bpxsy2, bpxsy3), na.rm = TRUE)) |> 
  rename(sex = riagendr) |> 
  rename(ethnic = ridreth1) |> 
  rename(age = ridageyr) |> 
  select(seqn, 
         age, sex, ethnic,
         SBP) 

This code chunk accomplishes quite a lot, all of which should be familiar to you from the prior Modules — can you read through and figure out what’s happening in each step?

Let’s take a glimpse() of the constructed data frame, please make note of the variable types.

nhanes2018 |> glimpse()
Rows: 9,254
Columns: 5
$ seqn   <dbl> 93703, 93704, 93705, 93706, 93707, 93708, 93709, 93710, 93711, …
$ age    <dbl> 2, 2, 66, 18, 13, 66, 75, 0, 56, 18, 67, 54, 71, 61, 22, 45, 13…
$ sex    <fct> Female, Male, Female, Male, Male, Female, Female, Female, Male,…
$ ethnic <fct> Other Race - Including Multi-Racial, Non-Hispanic White, Non-Hi…
$ SBP    <dbl> NaN, NaN, 202.0000, 111.3333, 128.0000, 141.0000, 118.6667, NaN…

We’re going to primarily work with two categorical variables — both of which are already specified as factors:

  • sex represents the biological sex of the respondent (with levels “Male”, “Female”)

  • ethnic represents the race/ethnicity of the respondent (with levels “Mexican American”, “Other Hispanic”, “Non-Hispanic White”, “Non-Hispanic Black”, and “Other Race - Including Multi-Racial”).

Let’s take a look at the number of people (labeled n) in each category of these two variables:

nhanes2018 |> count(sex) 
nhanes2018 |> count(ethnic)

Dummy-code examples

Categorical variable with two categories

To begin our exploration, we will consider whether SBP differs as a function of sex among people age 50 and older. Let’s start simple, and compute the average SBP by sex. In this data frame, the variable called SBP represents the systolic blood pressure reading for each participant.

nhanes2018_sex <-
  nhanes2018 |> 
  filter(age >= 50) |> 
  select(SBP, sex) |> 
  drop_na()

nhanes2018_sex |> 
  group_by(sex) |> 
  skim()
Data summary
Name group_by(nhanes2018_sex, …
Number of rows 2762
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables sex

Variable type: numeric

skim_variable sex n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SBP Male 0 1 133.30 19.94 72.67 119.33 130.67 144.00 234 ▁▇▅▁▁
SBP Female 0 1 136.27 21.65 88.00 121.33 133.33 149.33 238 ▃▇▃▁▁

The output indicates that the average SBP for females (136.3) is slightly higher than the average SBP for males (133.3). The median (labeled P50) is also slightly higher for females. Studying the distribution of SBP for males and females will help us to further understand these data. Let’s produce a density plot that will allow us to compare the distribution of SBP for males and females.

# set color palette for sex
levels <- c("Male", "Female")
colors <- c("#2F9599", "#F7DB4F") 
my_colors_sex <- setNames(colors, levels)

nhanes2018_sex |> 
  ggplot(mapping = aes(x = SBP, fill = sex)) +
  geom_density(alpha = .75) +
  scale_fill_manual(values = my_colors_sex) + 
  theme_minimal() +
  labs(title = "Does SBP differ by sex?",
       x = "Systolic Blood Pressure (mmHg)",
       y = "Density",
       fill = "Sex")

The distribution of SBP is slightly further to the right for females compared to males — this is in line with the means/medians just examined. To further explore, let’s create a jittered scatter plot to take an additional look at the distribution of SBP as a function of sex.

nhanes2018_sex |> 
  ggplot(mapping = aes(x = sex, y = SBP, color = sex)) + 
  geom_jitter(width = .1, show.legend = FALSE) +
  theme_minimal() +
  scale_color_manual(values = my_colors_sex) + 
  stat_summary(fun = mean, size = 3, geom = "point", colour = "black", show.legend = FALSE) + 
  labs(title = "Does SBP differ as a function of sex?",
       subtitle = "Mean of each group is marked with a black circle",
       x = "",
       y = "Systolic Blood Pressure (mmHg)")

This jittered plot shows the distribution of each individual data point (jittered from left to right in each group so we can see the density of points in each section) — and includes the mean of SBP for each group overlaid with a black circle. We can see that the black circle for females is slightly higher (3.0 mmHg) than for males.

Create a dummy indicator

On our way to use a categorical variable as a predictor in a regression model, our first step is to turn the categorical variable into a set of dummy-coded indicators. Recall that for a categorical variable with k categories, k - 1 dummy-coded indicators are needed. In this first simple example, our categorical variable, sex, has 2 categories, therefore only one dummy indicator is needed (k - 1 = 2 - 1 = 1).

Let’s create a dummy indicator to represent sex — we’ll call this new variable female, and code female participants as 1, and male participants as 0. As a result, males will serve as the reference group. To create the dummy-coded indicator, the case_when() function inside of the mutate() function is used.

nhanes2018_sex <-
  nhanes2018_sex |> 
  mutate(female = case_when(sex == "Male" ~ 0,
                          sex == "Female" ~ 1))

nhanes2018_sex |> 
  select(sex, female) |> 
  head(n = 25)

Notice that when sex equals Female, the dummy-coded indicator (female) is equal to 1. Otherwise, when sex equals Male, the dummy-coded indicator (female) is equal to 0.

Add a dummy indicator to a regression model

Once the dummy indicator is created, then we can enter it into a linear regression model as follows:

lm_sex <- lm(SBP ~ female, data = nhanes2018_sex)

lm_sex |> 
  tidy(conf.int = TRUE, conf.level = .95) |> 
  select(term, estimate, std.error, conf.low, conf.high)

How should the model results be interpreted?
Let’s focus on the columns labeled term and estimate. The intercept of the regression model is, as usual, the predicted score when \(x_i\) = 0, that is, the predicted SBP for males (recall that female is equal to 0 for males). In other words, the average SBP for males in the sample is 133.3 (notice that this is precisely the mean we computed earlier using skim()).

The slope of the regression model (i.e., the estimate for the female dummy indicator) is, as usual, the predicted change in the outcome (i.e., SBP) for a one-unit increase in the predictor (i.e., female). Therefore, we can interpret this as the expected difference in SBP for females as compared to males. An increase of one unit for the female dummy indicator means contrasting a female (1) to a male (0). This means that the average SBP for females is 3.0 units higher than the average SBP for males.

The equation from the fitted regression models may be written as follows:

\[\hat{y} = b_{0}+(b_{1}\times {female_i})\] Plugging in our estimates from the fitted regression model yields:

\[\hat{y} = 133.3+(3.0\times {female_i})\]


Further, we can solve for the predicted (i.e., average) SBP for each group by plugging in a 1 for female for female participants and a 0 for female for male participants.

  • For males: \(\hat{y} = 133.3+(3.0\times 0) = 133.3\)
  • For females: \(\hat{y} = 133.3+(3.0\times 1) = 136.3\)

Let’s map all of this onto the jitter plot that we created earlier. The dots marking the group means are in the same place, but I’ve removed the individual data points for clarity. I also colored what was a black circle for the means to the designated color for each sex. Note that the slope of the line, defined as the rise (3.0 mmHg for females compared to males) over the run (1, because we’re contrasting 0 for males and 1 for females for the dummy-coded indicator) is 3.0.

Notice from the tidy() output that the 95% confidence interval for the slope of the female dummy indicator does not include 0. This suggests that there is likely a meaningful difference in systolic blood pressure (SBP) between the sexes, with females having a substantially higher SBP than males. If 0 were included in the 95% CI, that would suggest that 0 is a plausible value for the difference in SBP between males and females — and that there may be no reliable difference between the groups.

Women tend to have higher blood pressure than men in later life due to several factors, including hormonal changes. Menopause significantly impacts blood pressure in women. Estrogen, a hormone that helps protect the cardiovascular system, decreases after menopause. This reduction in estrogen can lead to an increase in blood pressure. Click here for more information on this topic.

Add sex as a factor to the regression model

When regressing an outcome on a factor variable like sex, R has the capacity to conveniently manage the encoding process for you. By default, R transforms the factor into dummy indicators, using the first specified level as the reference group. This automatic handling simplifies the modeling process by allowing us to directly interpret the effects of each non-reference category relative to the reference category.

First, let’s check how R has the levels of sex stored.

nhanes2018_sex |> pull(sex) |> levels() 
[1] "Male"   "Female"

Since the first level listed is “Male” — then males will serve as the reference group. This is the same reference group we selected in our manual dummy code. Therefore, the results will be identical.

Let’s take a look at the fitted model, notice that I replace female with sex in the code below:

lm_sex_f1 <- lm(SBP ~ sex, data = nhanes2018_sex)

lm_sex_f1 |> 
  tidy(conf.int = TRUE, conf.level = .95) |> 
  select(term, estimate, std.error, conf.low, conf.high)

R has essentially created a dummy indicator called sexFemale (though it doesn’t actually get added to our data frame), where males are treated as the reference category (coded as 0) and females are coded as 1.

Can I change the reference group when using a factor?
It is common to want or need to specify a different level as the reference group to align better with specific research questions or prior analyses. To adjust the reference group in R, you can use the relevel() function. Here’s how you can set “Female” as the reference group for sex, and then refit the regression model.

nhanes2018_sex <-
  nhanes2018_sex |> 
  mutate(sex = relevel(sex, ref = "Female"))

lm_sex_f2 <- lm(SBP ~ sex, data = nhanes2018_sex)

lm_sex_f2 |> 
  tidy(conf.int = TRUE, conf.level = .95) |> 
  select(term, estimate, std.error, conf.low, conf.high)

The output of the regression model includes an intercept term and a coefficient for sexMale. The intercept term represents the average SBP for female participants, which is estimated to be 136.3. The coefficient for sexMale is -3.0, indicating that the average SBP for male participants is 3.0 units lower than the average for female participants. Please note that the overall model is the same even though the reference group has been rotated — and we can recover the exact same predicted values for mean SBP for each group:

  • For the males: \(\hat{y} = 136.3 + (-3.0\times 1) = 133.3\)
  • For the females: \(\hat{y} = 136.3 + (-3.0\times 0) = 136.3\)

Categorical variable with more than two categories

Now, let’s consider a categorical variable with more than two categories. Here we’ll consider whether the SBP of adults 50 and older differs as a function of race/ethnicity. There are five race/ethnic categories in the data frame. Let’s use skim() to compute the mean SBP for each race/ethnic group.

nhanes2018_eth <- 
  nhanes2018 |> 
  select(SBP, ethnic, age) |> 
  filter(age > 50) |> 
  drop_na() 

nhanes2018_eth |> 
  group_by(ethnic) |> 
  skim(SBP)
Data summary
Name group_by(nhanes2018_eth, …
Number of rows 2688
Number of columns 3
_______________________
Column type frequency:
numeric 1
________________________
Group variables ethnic

Variable type: numeric

skim_variable ethnic n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SBP Mexican American 0 1 134.20 18.98 88.00 122.67 131.33 143.50 208.00 ▁▇▅▁▁
SBP Other Hispanic 0 1 133.14 21.66 89.33 116.67 130.00 147.33 207.33 ▃▇▅▂▁
SBP Non-Hispanic White 0 1 133.09 19.95 88.00 118.67 130.00 145.33 238.00 ▃▇▃▁▁
SBP Non-Hispanic Black 0 1 140.43 22.61 92.67 124.50 137.33 152.75 224.67 ▂▇▅▁▁
SBP Other Race - Including Multi-Racial 0 1 133.88 20.33 72.67 120.67 131.33 144.83 215.33 ▁▇▇▂▁

We can see that the means for all groups except Non-Hispanic Black participants is around 133/134. Non-Hispanic Black participants have a mean SBP that is about 7 points higher than the other groups.

Next, let’s construct a density plot of SBP, one for each group.

# set color palette for ethnic
levels <- c("Mexican American", "Other Hispanic", "Non-Hispanic White", "Non-Hispanic Black", "Other Race - Including Multi-Racial")
colors <- c("#FFB600", "#44A9CC", "#EB563A", "#F4B998", "#50BCB9") 
my_colors_eth <- setNames(colors, levels)

nhanes2018_eth |> 
  ggplot(mapping = aes(x = SBP, fill = ethnic)) +
  geom_density(alpha = .5) +
  scale_fill_manual(values = my_colors_eth) +  
  theme_minimal() +
  theme(legend.position = "right") +
  labs(title = "Does systolic blood pressure differ as a function of race/ethnicity?",
       x = "Systolic Blood Pressure (mmHg)",
       y = "Density",
       fill = "Race/Ethnic Group") 

The same pattern is observed in the density plots, we can see that the density plot for Non-Hispanic Black individuals is pushed slightly to the right of the graph.

Let’s create one last plot, a jittered point plot of SBP, with race/ethnicity as a grouping variable. As we did with the sex example, we’ll overlay the mean of each group with a black dot.

nhanes2018_eth |> 
  ggplot(mapping = aes(x = ethnic, y = SBP, color = ethnic)) +
  geom_jitter(width = .1, show.legend = FALSE, alpha = .4) +  # Random jitter for data points
  stat_summary(fun = mean, size = 3, geom = "point", colour = "black", show.legend = FALSE) + 
  scale_color_manual(values = my_colors_eth) + 
  theme_minimal() +
  labs(title = "Does systolic blood pressure differ as a function of race/ethnicity?",
       subtitle = "Mean of each group is depicted with a black circle",
       x = "",
       y = "Systolic Blood Pressure (mmHg)") |> 
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) 

Create the dummy indicators

Now that we’ve visualized the data, we’re ready to fit a model. Before fitting the linear regression model, we need to create a set of dummy-coded indicators to represent race/ethnic group. We’ll make Mexican Americans the reference group for race/ethnicity, and create 4 dummy-coded variables to denote the other four race/ethnic groups.

nhanes2018_eth <-
  nhanes2018_eth |> 
  mutate(other.hisp = case_when(ethnic == "Other Hispanic" ~ 1, TRUE ~ 0),
         nh.white = case_when(ethnic == "Non-Hispanic White" ~ 1, TRUE ~ 0),
         nh.black = case_when(ethnic == "Non-Hispanic Black" ~ 1, TRUE ~ 0),
         other.race = case_when(ethnic == "Other Race - Including Multi-Racial" ~ 1, TRUE ~ 0))

It’s always smart to double check that your code works as intended. Here’s a summary of how each group is coded for this set of dummy-codes:

nhanes2018_eth |> 
  select(ethnic, other.hisp, nh.white, nh.black, other.race) |> 
  distinct() |> 
  arrange(ethnic)

Everything looks good here — notice that for the selected reference group (Mexican Americans) all dummy indicators are coded as 0. And, for the other dummy indicators, a score of 1 is coded for the corresponding group represented by the dummy indicator, and a 0 is recorded otherwise.

Add the dummy indicators to a regression model

Now, we can fit the linear regression model with the dummy-coded indicators. The following equation represents the model we will fit, notice that we have four dummy-coded indicators that are needed to predict the outcome.

\[ \hat{y} = b_{0} + (b_{1} \times \text{other.hisp}) + (b_{2} \times \text{nh.white}) + (b_{3} \times \text{nh.black}) + (b_{4} \times \text{other.race}) \]

lm_eth <- lm(SBP ~ other.hisp + nh.white + nh.black + other.race, data = nhanes2018_eth)
lm_eth |> 
  tidy(conf.int = TRUE, conf.level = .95) |> 
  select(term, estimate, std.error, conf.low, conf.high)

Let’s focus on the columns labeled term and estimate. The intercept of the regression model is the predicted score when all predictoor variables equal 0. In this example, that is the predicted (i.e., average) SBP in the reference group. Mexican American older adults serve as the reference group in this model. Note that the intercept estimate is precisely the mean for Mexican Americans that we computed earlier with skim().

The four slopes compare each of the other groups to Mexican Americans.

  • The estimate for other.hisp indicates that the mean SBP for other Hispanic individuals is 1.1 mmHG lower than the mean for Mexican Americans.

  • The estimate for nh.white indicates that the mean SBP for non-Hispanic White individuals is 1.1 mmHG lower than the mean for Mexican Americans.

  • The estimate for nh.black indicates that the mean SBP for non-Hispanic Black individuals is 6.2 mmHG higher than the mean for Mexican Americans.

  • The estimate for other.race indicates that the mean SBP for individuals who identify as some other race/ethnic group or multi-racial is 0.3 mmHG lower than the mean for Mexican Americans.

The equation from the fitted regression models is written as follows:

\[\hat{y} = 134.2 + (-1.1 \times \text{other.hisp}) + (-1.1 \times \text{nh.white}) + (6.2 \times \text{nh.black}) + (-0.3 \times \text{other.race})\]

We can recover the mean for each group by using this equation:

For Mexican Americans: \(\hat{y} = 134.2 + (-1.1 \times 0) + (-1.1 \times 0) + (6.2 \times 0) + (-0.3 \times 0) = 134.2\)

For other Hispanic: \(\hat{y} = 134.2 + (-1.1 \times 1) + (-1.1 \times 0) + (6.2 \times 0) + (-0.3 \times 0) = 133.1\)

For non-Hispanic White: \(\hat{y} = 134.2 + (-1.1 \times 0) + (-1.1 \times 1) + (6.2 \times 0) + (-0.3 \times 0) = 133.1\)

For non-Hispanic Black: \(\hat{y} = 134.2 + (-1.1 \times 0) + (-1.1 \times 0) + (6.2 \times 1) + (-0.3 \times 0) = 140.4\)

For other groups: \(\hat{y} = 134.2 + (-1.1 \times 0) + (-1.1 \times 0) + (6.2 \times 0) + (-0.3 \times 1) = 133.9\)

Notice that these computed means are identical to the means from skim() that we estimated earlier.

Let’s map all of this onto our earlier plot. In the enhanced plot below. You can see how the intercept and slopes of the regression model line up with the graph. The intercept is just the mean for the reference group (Mexican American adults), the slope for the Other Hispanic dummy-code is the difference between Other Hispanic adults and Mexican American adults (Other Hispanic adults have an average SBP that is 1.1 mmHg lower than the average for Mexican American adults). Similarly the other slopes represent the difference between each respective group and Mexican Americans. That is, with each comparison, we compare back to the reference group — which is Mexican American participants in our set up. Of course, we could choose a different reference group and create the corresponding dummy-coded indicators if we wish.

Take a moment now to examine the 95% CIs for the four slopes (as displayed in the tidy output. Notice that only one of the CIs, the CI for comparing non-Hispanic Black participants to Mexican American participants, doesn’t include 0. This indicates that, of the four comparisons made in this model, the only reliable difference in SBP is between non-Hispanic Black individuals and Mexican American individuals — with the former having substantially higher SBP than the latter.

High blood pressure (BP) is the leading preventable risk factor for mortality in the United States, with Black older adults experiencing substantially higher rates of hypertension compared to White counterparts, leading to about four times the hypertension-related cardiovascular disease (CVD) mortality rate. Social determinants of health (SDOH), including socioeconomic and environmental conditions such as housing, neighborhood poverty, food security, access to healthcare, segregation, structural racism, and chronic stress, play a crucial role in these disparities. Click here for more information on this topic.

Add ethnicty as a factor to the regression model

Just as we did for the sex example, we can have R create the dummy-codes for us. Because “Mexican American” is the first level of the factor, the results are identical to our manual model:

nhanes2018_eth |> pull(ethnic) |> levels()
[1] "Mexican American"                    "Other Hispanic"                     
[3] "Non-Hispanic White"                  "Non-Hispanic Black"                 
[5] "Other Race - Including Multi-Racial"
lm_eth <- lm(SBP ~ ethnic, data = nhanes2018_eth)

lm_eth |> 
  tidy(conf.int = TRUE, conf.level = .95) |> 
  select(term, estimate, std.error, conf.low, conf.high)

Of course, the relevel() function can be used to choose a different reference group.

Obtain model predicted means

Once the model is fitted, we can use the predictions() function from the marginaleffects package to compute the predicted means and 95% CIs for each group. This highly useful package is worth mastering because it offers many very useful functions for model interpretation — particularly as you move to more advanced models. We’ll use the marginaleffects package in the next few Modules, here we start with a very simple application to compute the mean and 95% CI of SBP for each group based on our model. Here’s an exerpt from the marginaleffects documentation about it’s key purpose:

The parameters of a statistical model can sometimes be difficult to interpret substantively, especially when that model includes non-linear components, interactions, or transformations. Analysts who fit such complex models often seek to transform raw parameter estimates into quantities that are easier for domain experts and stakeholders to understand, such as predictions, contrasts, risk differences, ratios, odds, lift, slopes, and so on. Unfortunately, computing these quantities—along with associated standard errors—can be a tedious and error-prone task. This problem is compounded by the fact that modeling packages in R and Python produce objects with varied structures, which hold different information. This means that end-users often have to write customized code to interpret the estimates obtained by fitting Linear, GLM, GAM, Bayesian, Mixed Effects, and other model types. This can lead to wasted effort, confusion, and mistakes, and it can hinder the implementation of best practices.

Let’s take a look at the code needed to accomplish the task at hand, with a more detailed explanation following the code:

unadjusted_means <- 
  lm_eth |> 
  predictions(
    df = insight::get_df(lm_eth),
    conf.level = .95,
    newdata = datagrid(
    ethnic = unique
    )) |> 
  select(ethnic, estimate, std.error, conf.low, conf.high) |> 
  mutate(type = "unadjusted") |> 
  as_tibble()

unadjusted_means

This code chunk computes the predicted means and their confidence intervals (a 95% CI as specified by conf.level = .95) for each level of the ethnic factor using the predictions() function.

  • It pulls out the appropriate degrees of freedom (df) needed for the CI by using the argument insight::get_df(lm_eth), which utilizes a handy function called get_df() from a package called insight. The correct degrees of freedom for this instance is n - 1 - p, where p is the number of predictors. For a categorical variable, this refers to the number of dummy codes necessary to represent the categorical variable. Thus, the degrees of freedom is 2688 - 1 - 4 = 2683.

  • The datagrid() function (also from marginaleffects) is used to create a data frame representing all unique combinations of the factor levels in the variable ethnic. This useful function generates a data grid of user-specified values for use in the newdata argument. In particular, it generates a data grid of each unique race/ethnic group.

  • The predictions are then extracted and formatted, creating a predicted score for SBP for each race/ethnic group in the created data grid.

  • I also added a variable called “unadjusted” to denote that these are unadjusted means (we’ll calculate something called adjusted means in the next section).

  • The function as_tibble() is used to define the output as a tibble — which is just a tidy data frame.

Obtain adjusted predicted means

To further enhance our understanding, let’s now refine our existing model by including participant age as a covariate. By doing this, we adjust for the effects of age, allowing us to examine the differences in SBP across various race and ethnic groups while holding age constant. This adjustment is important because age is a significant determinant of blood pressure, potentially confounding the relationships we are investigating. Holding age constant ensures that the observed differences in SBP are more likely attributable to racial and ethnic variations rather than age disparities. Such adjustments enable a more accurate and nuanced interpretation of the data, providing insights into whether certain groups may face higher risks of hypertension independent of age effects.

First, since age in this subsample of adults doesn’t have a meaningful 0 point, we will center age at the mean. Then we will refit our ethnicity model, adding centered age as a control variable.

# what is the mean age?
nhanes2018_eth |> select(age) |> summarize(mean(age))
# add centered age
nhanes2018_eth <- 
  nhanes2018_eth  |>  
  mutate(age.c = age - mean(age))

lm_eth_adj <- lm(SBP ~ ethnic + age.c, data = nhanes2018_eth) 

lm_eth_adj |> 
  tidy(conf.int = TRUE, conf.level = .95) |> 
  select(term, estimate, conf.low, conf.high) 

In this adjusted model, age is controlled by including it as a covariate, centered around the mean age. This adjustment allows for a more accurate estimation of the impact of ethnicity on SBP independent of age effects.

How shall we interpret these effects?

  • Intercept (135.7 mmHg): Now reflects the estimated mean SBP for Mexican Americans at the average age in the subsample.

  • Other Hispanic: The slightly increased estimate (-1.2 mmHg) for this slope compared to the unadjusted model compares Other Hispanic adults to Mexican American adults, holding age constant. In other words, if we compared two groups of people, all of which were the same age (e.g., 60), but the people in one group identified as Mexican American and people in the other group identified as Other Hispanic, we would expect the Other Hispanic adults to have a SBP that is 1.2 mmHg lower. However, note that the 95% CI still contains 0 – suggesting that 0 (i.e., no difference) is a plausible difference in SBP between the two groups.

  • Non-Hispanic White: There is a notable change here from our adjusted model for this slope; the non-Hispanic White group now shows a larger difference as compared to Mexican Americans (-4.2 mmHg), with a confidence interval that does not include zero (-6.8 to -1.6 mmHg).

  • Non-Hispanic Black: This slope comparing non-Hispanic Black adults to Mexican American adults continues to show a higher SBP (5.2 mmHg) after adjusting for age, but the increase is slightly lower than in the unadjusted model. This CI continues to not include 0 (2.5 to 8.0 mmHg).

  • Other Race - Including Multi-Racial: Similar to the unadjusted model, this slope shows a very small difference (-0.5 mmHg) with a confidence interval (-3.5 to 2.4 mmHg) that includes zero.

  • Age: Holding constant race/ethnic group (i.e., comparing two individuals who are from the same race/ethnic group), each one unit increase in age is associated with a substantial increase in SBP (0.5 mmHg per year), with a confidence interval (0.46 to 0.64 mmHg) that is well away from zero. For example, if we compared two people from the same race/ethnic group — but one was 60 and one was 70 — we’d expect the older person to have a SBP that was ~5 mmHg higher (\(0.5 \times 10 = 5\)).

Adding age as a control variable modifies the impacts of ethnicity on SBP, particularly accentuating the effects where age plays a significant confounding role. For Non-Hispanic Whites, controlling for age reveals a decrease in SBP compared to Mexican Americans, while for Non-Hispanic Blacks, the increase in SBP is slightly attenuated. These results underscore the importance of considering age in analyses of SBP across different racial groups and ethnicities.

Recall that earlier in the semester, you used a weighting approach to compute adjusted means by hand based on a standard population. This current approach of directly controlling for age via a statistical model provides an alternative method to obtain adjusted means.

We can use the predictions() function again with the new age-adjusted model to obtain the predicted mean SBP for each group, holding age constant at the mean in the subsample of adults (~65 years). Note that grid_type = "mean_or_mode" specifies that any predictor not mentioned should be held at the mean (for continuous variables) or the mode (for categorical variables) when computing the predicted scores.

adjusted_means <- 
  lm_eth_adj |> 
  predictions(
    df = insight::get_df(lm_eth_adj), 
    newdata = datagrid(
      ethnic = unique, 
      grid_type = "mean_or_mode")) |> 
  select(ethnic, estimate, std.error, conf.low, conf.high) |> 
  mutate(type = "adjusted") |> 
  as_tibble()

adjusted_means

If instead, you desired to hold age constant at a particular value (for example, at the mean, 10 years below the mean, and 10 years above the mean) — you could specify that using the code chunk below. Notice that we now get three means per race/ethnic group, one when age.c is 10 units below the mean (e.g., ~55), one when age.c is at the mean (e.g., ~65), and one when age.c is 10 units above the mean (e.g., ~75).

lm_eth_adj |> 
  predictions(
    df = insight::get_df(lm_eth_adj), 
    newdata = datagrid(ethnic = unique, 
                       age.c = c(-10,0,10))) |> 
  select(ethnic, age.c, estimate, std.error, conf.low, conf.high) |> 
  mutate(type = "adjusted") |> 
  as_tibble()

Now, returning back to the adjusted means, holding constant age at the mean in the sample, let’s merge the unadjusted means that we computed earlier with the adjusted means from our model with age as a control variable.

combine <- bind_rows(unadjusted_means, adjusted_means) 
combine

With some formatting1, we can create a nice table that presents the unadjusted SBP for each race/ethnic group, alongside the adjusted SBP for each group, holding age constant at the mean in the sample.

# pivot data frame to wide for side by side table
combine_wide <- 
  combine |> 
  pivot_wider(names_from = type, values_from = c(estimate, std.error, conf.low, conf.high), names_glue = "{.value}_{type}")

# use the gt package to create a nice-looking table of the results
combine_wide |> 
  gt() |> 
  tab_header(
    title = "Comparison of unadjusted and age-adjusted estimates of systolic blood pressure among adults 50 and older by race/ethnicity"
  ) |> 
  cols_label(
    ethnic = "Ethnicity",
    estimate_unadjusted = "Mean",
    std.error_unadjusted = "SE",
    conf.low_unadjusted = "95% CI",
    estimate_adjusted = "Mean",
    std.error_adjusted = "SE",
    conf.low_adjusted = "95% CI"
  ) |> 
  fmt_number(
    columns = everything(),
    decimals = 2
  ) |> 
  cols_merge(
    columns = c(conf.low_unadjusted, conf.high_unadjusted),
    pattern = "({1}, {2})"
  ) |> 
  cols_merge(
    columns = c(conf.low_adjusted, conf.high_adjusted),
    pattern = "({1}, {2})"
  ) |> 
  cols_hide(c(conf.high_unadjusted, conf.high_adjusted)) |> 
  tab_spanner(
    label = "Unadjusted",
    columns = c(estimate_unadjusted, std.error_unadjusted, conf.low_unadjusted)
  ) |> 
  tab_spanner(
    label = "Age-Adjusted (held constant at 65.5 years)",
    columns = c(estimate_adjusted, std.error_adjusted, conf.low_adjusted)
  ) |> 
  cols_align(
    align = "center",
    columns = everything()
  ) |> 
  tab_footnote(
    footnote = "SE denotes Standard Error.",
    locations = cells_column_labels(columns = c(std.error_unadjusted, std.error_adjusted))
  ) |> 
  tab_footnote(
    footnote = "CI denotes Confidence Interval.",
    locations = cells_column_labels(columns = c(conf.low_unadjusted, conf.low_adjusted))
  ) 
Comparison of unadjusted and age-adjusted estimates of systolic blood pressure among adults 50 and older by race/ethnicity
Ethnicity
Unadjusted
Age-Adjusted (held constant at 65.5 years)
Mean SE1 95% CI2 Mean SE1 95% CI2
Non-Hispanic Black 140.43 0.82 (138.83, 142.03) 140.90 0.79 (139.34, 142.46)
Other Race - Including Multi-Racial 133.88 0.97 (131.97, 135.78) 135.14 0.95 (133.28, 137.01)
Non-Hispanic White 133.09 0.65 (131.82, 134.36) 131.48 0.64 (130.21, 132.74)
Mexican American 134.20 1.20 (131.85, 136.54) 135.67 1.17 (133.37, 137.97)
Other Hispanic 133.14 1.29 (130.60, 135.68) 134.42 1.26 (131.94, 136.90)
1 SE denotes Standard Error.
2 CI denotes Confidence Interval.

Alternatively, we could present the same results in a plot.

combine |> 
  mutate(type = factor(type, levels = c("unadjusted", "adjusted"))) |> 
  ggplot(mapping = aes(x = ethnic, y = estimate, ymin = conf.low, ymax = conf.high, color = ethnic)) +
  coord_flip() +
  geom_pointrange() +
  geom_errorbar(width = 0.2) +
  facet_grid(~ type, 
             labeller = labeller(type = c(unadjusted = "Unadjusted Estimates", adjusted = "Age-Adjusted Estimates"))) +
  labs(
    title = "Comparison of unadjusted and age-adjusted estimates of systolic blood pressure",
    subtitle = "Among adults 50 and older by race/ethnicity, age is held constant at the mean for adjusted estimates",
    caption = "Whiskers show 95% CI",
    x = "",
    y = "Systolic Blood Pressure (mmHg)",
    color = "Estimate Type"
  ) +
  scale_color_manual(values = my_colors_eth) +  
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.position = "none"
  ) 

Both the table and the graph depict the differences in unadjusted and age-adjusted SBP across race/ethnic groups, providing an intuitive way to understand race/ethnic disparities and how SBP differences are affected when accounting for age.

Digging into within group and between group variability

To finish up this Module on the inclusion of categorical variables in regression models, we will consider the salience of within and between group variability in some continuous variable of interest (e.g., SBP) as a function of an important grouping variable (e.g., sex or race/ethnicity).

Imagine that you are a scientist interested in physical activity among older adults. You and your research team are preparing to study an intervention to increase physical activity among older adults living in urban affordable housing communities. For your study, you want to collect objective measures of physical activity rather than relying on self-reported physical activity. There are many wearable devices on the market that measure physical activity. You are considering two — we’ll call these Device A and Device B.

  • Device A is considered the gold standard for measuring physical activity — but is very expensive.
  • Device B is new on the market and hasn’t been well studied, it is much less expensive and would be a cost savings to the project if it worked as well as Device A.

You seek to measure the accuracy of these devices to measure minutes spent in moderate physical activity (MPA). You devise the following study:

All participants will spend 60 minutes engaged in easy to brisk walking on a treadmill under the supervision of research assistants, but the amount of time in moderate physical activity will differ. This will be controlled by the treadmill program — with the speed and the grade varying to push the individual’s heart rate into a moderate physical activity range for the designated minutes.

  1. In condition 1, participants will accumulate 15 minutes of MPA over the 60-minute session
  2. In condition 2, participants will accumulate 30 minutes of MPA over the 60-minute session
  3. In condition 3, participants will accumulate 45 minutes of MPA over the 60-minute session

You recruit 45 people from your target demographic to participate in the measurement study, and randomly assign 15 people to each condition. All people wear both devices simultaneously and complete their designated 60-minute activity bout. The resulting data are in the data frame called activity.Rds. The following variables are included:

Variable Description
pid Unique ID for participant
condition The condition randomly assigned — 15 minutes, 30 minutes, 45 minutes of MPA
DevA_MPA The recorded minutes of MPA from Device A
DevB_MPA The recorded minutes of MPA from Device B

Let’s read in the data frame.

activity <- read_rds(here("data", "activity.Rds")) 
activity 

Here’s a glimpse of the data frame.

activity |> glimpse()
Rows: 45
Columns: 4
$ pid       <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ condition <chr> "15 minutes", "15 minutes", "15 minutes", "15 minutes", "15 …
$ DevA_MPA  <dbl> 15.7, 12.0, 14.7, 14.7, 16.0, 14.1, 14.0, 14.6, 15.4, 15.3, …
$ DevB_MPA  <dbl> 18.0, 27.4, 25.6, 6.7, 17.5, 16.8, 6.9, 19.1, 25.1, 20.1, 10…

Let’s also produce descriptive statistics for DevA_MPA and DevB_MPA by condition.

activity |> 
  select(-pid) |> 
  group_by(condition) |> 
  skim()
Data summary
Name group_by(select(activity,…
Number of rows 45
Number of columns 3
_______________________
Column type frequency:
numeric 2
________________________
Group variables condition

Variable type: numeric

skim_variable condition n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DevA_MPA 15 minutes 0 1 14.68 1.31 11.6 14.35 15.3 15.55 16.0 ▂▁▂▃▇
DevA_MPA 30 minutes 0 1 29.81 0.72 28.3 29.50 30.0 30.35 30.5 ▂▂▁▅▇
DevA_MPA 45 minutes 0 1 44.96 1.04 43.3 44.20 45.0 45.70 46.9 ▃▇▅▅▃
DevB_MPA 15 minutes 0 1 16.39 7.97 1.4 9.80 17.5 22.60 27.4 ▂▇▃▇▇
DevB_MPA 30 minutes 0 1 31.52 8.12 16.0 25.85 32.0 37.05 45.5 ▃▆▇▇▃
DevB_MPA 45 minutes 0 1 46.68 8.00 30.2 41.00 47.9 51.45 61.4 ▂▅▇▆▃

Notice the mean of minutes in MPA measured by each device seems quite close to the values dictated by condition. For example, for Device A the mean measured MPA (DevA_MPA) is 14.7 minutes for the 15-minute condition, 29.8 minutes for the 30-minute condition, and 45.0 minutes for the 45-minute condition. And, a similar pattern with regard to the means is observed for Device B.

Let’s visualize the data with a jittered point plot by condition (along the x axis) and facet by measurement device. Notice that in this section I use pivot_longer() to pivot the data from a wide to long data frame for aid in creation of the plot. I also create a factor for the newly created device variable.

activity_long <- activity |> 
  pivot_longer(cols = c("DevA_MPA", "DevB_MPA"), names_to = "device",  values_to = "MPA") |> 
  mutate(device = factor(device, levels = c("DevA_MPA", "DevB_MPA"), labels = c("Device A", "Device B")))

activity_long |> head(n = 20)
activity_long |> 
  ggplot(mapping = aes(x = condition, y = MPA, color = condition)) + 
  geom_jitter(width = .1, size = 3, show.legend = FALSE, alpha = .35) +  # Increase data point size
  geom_hline(yintercept = 45, color = "#00A5E3", linetype = "dashed") +
  geom_hline(yintercept = 30, color = "#8DD7BF", linetype = "dashed") +
  geom_hline(yintercept = 15, color = "#FF96C5", linetype = "dashed") +
  theme_minimal() +
  stat_summary(fun = mean, size = 5, shape = 4, geom = "point", colour = "black", stroke = 1.5, show.legend = FALSE) + 
  labs(title = "Variability in MPA by experimental condition and measurement device",
       subtitle = "Mean within each condition is marked with an X, dashed colored lines correspond to condition specifications",
       x = "Condition",
       y = "Minutes of Moderate Physical Activity") +
  facet_wrap(~device) +
  theme(
    plot.title = element_text(size = 20),  # Increase title size
    plot.subtitle = element_text(size = 15),  # Increase subtitle size
    axis.title.x = element_text(size = 14),  # Increase x-axis title size
    axis.title.y = element_text(size = 14),  # Increase y-axis title size
    axis.text.x = element_text(size = 12),  # Increase x-axis text size
    axis.text.y = element_text(size = 12),  # Increase y-axis text size
    strip.text = element_text(size = 14)  # Increase facet text size
  )

This plot represents the data collected in the study. On the left, we see the results for Device A. On the right, we see the results for Device B. Along the x-axis are the experimental conditions (condition), and on the y-axis is the outcome — the minutes of physical activity that the device measured (MPA).

I added three dashed horizontal lines to this plot to indicate the minutes of physical activity that the device should have picked up given the experimental condition. For example, the blue line represents 45 minutes of moderate physical activity, this is the number of minutes that the device should have recorded in condition 3 (the 45-minute condition) if the device was accurate. Notice that the sample means from the two devices (the X marks) are quite similar and are close to what was dictated by the experimental condition (see dashed lines). Device B seems to consistently overestimate minutes in MPA (the X mark is slightly above what it should be as noted by the dashed line), while Device A produces a mean in each condition that is very close to the minutes dictated by the experimental condition (the X is right on the dashed line).

Within each condition, the variability of the individual data points around the group mean differs substantially. But, the individual measures garnered from Device A are tightly clustered around the mean, while the individual measures garnered from Device B are much more variable. This means that Device A consistently captures the correct minutes for every participant, while Device B is not very consistent.

For now, let’s focus only on the results from Device A.

We can partition the total variability in MPA into two components — within group variation and between group variation, where the groups are conditions in our example (i.e., 15 minutes, 30 minutes, or 45 minutes of MPA).

Within group variability

The within group variation captures the extent to which each individual differs from their group’s mean.

Consider the 15-minute condition. There are 15 people in this condition. To quantify the within group variation, we can compute the difference between each person’s MPA score and the average MPA score for their group, square these differences, and then sum these squared differences across all people in the group.

ex15min <- activity |> 
  filter(condition == "15 minutes") |> 
  select(pid, DevA_MPA) |> 
  mutate(grp_mean_DevA = mean(DevA_MPA),
         wg_diff_DevA = DevA_MPA - grp_mean_DevA,
         sq_wg_diff_DevA = wg_diff_DevA^2) 

ex15min
ex15min |> summarize(SSwithin_grp = sum(sq_wg_diff_DevA)) 

We can do the same thing for the 30-minute condition and the 45-minute condition. Rather than writing the code three times, we can use the group_by() function to perform the calculations by condition. Doing so yields three sum of squares — one for each group. Once the three sum of squares are calculated, we sum the sum of squared differences for each group (in our example, that means summing three values — one for each condition).

activity |> 
  select(pid, condition, DevA_MPA) |> 
  group_by(condition) |> 
  mutate(grp_mean_DevA = mean(DevA_MPA),
         wg_diff_DevA = DevA_MPA - grp_mean_DevA,
         sq_wg_diff_DevA = wg_diff_DevA^2) |> 
  summarize(SSwithin_grp = sum(sq_wg_diff_DevA)) |> 
  ungroup() |> 
  summarize(SSwithin = sum(SSwithin_grp))

This value (46.7) represents a quantity officially called the Within Groups Sum of Squares. When analyzing variance across groups in an experiment (where people are randomly assigned to groups) — this part of the variance is referred to as the error as this is variation in the scores that is not due to the experimental condition or treatment. That is, to the extent that people differ within a group, then that variability is clearly not due to experimental condition as they all received the same manipulation/treatment. For example, within each of our groups, all people accumulated the same minutes of MPA. This quantity is also called the Sum of Squares Error (SSE) or, sometimes, the Sum of Squares Residual.

Between group variability

The between groups variation captures the extent to which groups differ from one another. In this case, that is, the degree to which our three experimental conditions differ. To obtain the between groups variation, we compute the difference between each group’s average MPA score and the overall average MPA (i.e., the average score for all 45 people in the experiment), square these differences, and then sum these squared differences across all groups.

For example, the average minutes of MPA across the three conditions is 29.8 minutes.

activity |> 
  select(DevA_MPA) |> 
  summarize(mean = mean(DevA_MPA))

And, the average MPA for the 15 minute condition is 14.7 minutes.

activity |> 
  filter(condition == "15 minutes") |> 
  select(DevA_MPA) |> 
  summarize(mean = mean(DevA_MPA))

We compute the difference between the 15 minute group’s average MPA score and the overall average MPA as 14.7 minutes - 29.8 minutes — which equals about -15.1. Then, -15.1 squared (i.e., \(-15.1^2\)) equals ~229. Since there are 15 people in the group, we sum this value across the 15 people — which equals \(229 \times 15 = 3436\) (the actual number without any rounding). We then do this same work for the 30 minute group and the 45 minute group. Finally, we sum these three Sum of Squares across the three groups. This value (6876.6) represents the Between Groups Sum of Squares. The code chunk below does all of this work.

activity |> 
  select(pid, condition, DevA_MPA) |> 
  mutate(total_mean_DevA = mean(DevA_MPA)) |> 
  group_by(condition) |> 
  mutate(grp_mean_DevA = mean(DevA_MPA),
         bg_diff_DevA = grp_mean_DevA - total_mean_DevA,
         sq_bg_diff_DevA = bg_diff_DevA^2) |> 
  summarize(SSbetween_grp = sum(sq_bg_diff_DevA)) |> 
  ungroup() |> 
  summarize(SSbetween = sum(SSbetween_grp))

When analyzing variance across groups in an experiment (where people are randomly assigned to groups), this part of the variability is due to experimental condition. That is, to the extent that group means differ from one another, then that variability is due to the experimental condition or treatment. This quantity is also called the Model Sum of Squares (SSM) or Sum of Squares Regression (SSR).

Contrasting within group and between groups variability

By contrasting the within groups and between groups variability we gain a sense of which component encompasses more of the variability. For Device A, the Between Groups Sum of Squares is very large (~6877) and the Within Groups Sum of Squares is comparatively much smaller (~47). This tells us that most of the variability in MPA scores for Device A are due to the different conditions people were assigned to (i.e., 15 minutes, 30 minutes, or 45 minutes), with very little variability within condition. As we saw in the jittered point plot of the data, the mean MPA is extremely close to the designated minutes of MPA for the condition (corresponding dashed line), denoting a high degree of accuracy of Device A. Furthermore, there is little person to person variability within condition, denoting a high degree of precision.

We don’t need to compute Within Groups and Between Groups Sum of Squares on our own as I have in this section. Instead, the aov() function will do this work for us. Note that the value under Sum Sq for condition is 6877 — matching with our calculation of the Between Groups Sum of Squares, and the value for Residuals is 47 — matching with our calculation of Within Groups Sum of Squares.

lm_A <- lm(DevA_MPA ~ condition, data = activity)

aovA <- 
  anova(lm_A) |> 
  tidy() |> 
  select(term, sumsq)

aovA

Now, let’s assess these same values for Device B. Recall from our plot of the data, that Device B had much greater within group variability.

lm_B <- lm(DevB_MPA ~ condition, data = activity)

aovB <- 
  anova(lm_B) |> 
  tidy() |> 
  select(term, sumsq)

aovB

The value under Sum Sq for condition is 6883, this is the Between Groups Sum of Squares for Device B. It is quite close to the Between Groups Sum of Squares for Device A. The value under Sum Sq for Residuals is 2708, this is the Within Groups Sum of Squares for Device B. It is much larger than the same value for Device A.

Thus, the ratio of Between Groups variability to Within Groups variability for Device A is very large, while the ratio of Between Groups variability to Within Groups variability for Device B is substantially smaller. I augmented the graph below with the Between Groups variability divided by the Total Variability to present a measure of the total variability attributed to condition differences. This is the same as \(R^2\) in a regression context. For our example, this indicates that for Device A about 99% of the variability in MPA is due to condition, while for Device B about 72% of the variability in MPA is due to condition. Which device would you choose for your study?

Application to our NHANES example

With an understanding of within and between group variability, we will now revisit our NHANES example focusing on SBP, specifically the model where SBP is regressed on ethnic. Recall that the model results are stored in an object named lm_eth.

# get the ANOVA table 
aov_eth <- 
  anova(lm_eth) |> 
  tidy() |> 
  select(term, sumsq) 

aov_eth

The table below augments the resulting table with some descriptors, and adds the Sum of Squares Total — which is the sum of the Within Groups and Between Groups variability.

For purposes of comparison, let’s also print out the r \(R^2\) from the regression model output.

# request r-square via glance
lm_eth |> glance() |> select(r.squared)

The Between Groups Sum of Squares is 24,472.3, which represents the variability in SBP attributable to differences across racial and ethnic groups. Conversely, the Within Groups Sum of Squares amounts to 1,154,871.3, indicating the variability in SBP that cannot be explained by racial or ethnic differences as it captures the variability within these groups.

By summing these two values we obtain the Total Sum of Squares: 1,179,343.6. This figure represents the overall variability in systolic blood pressure across all observations. From these values, we can compute the \(R^2\), which is the proportion of variability in systolic blood pressure explained by racial and ethnic differences:

\[ R^2 = \frac{24,472.31}{1,179,343.62} \approx 0.02 \]

This calculation suggests that approximately 2% of the variability in systolic blood pressure can be attributed to differences among racial and ethnic groups. This \(R^2\) value is consistent with what is reported by the glance() function, where it is labeled as r.squared.

Wrap up

Understanding how to integrate categorical variables into regression models is essential for analyzing complex real-world data. Unlike continuous variables, categorical variables represent distinct groups and require special handling. The most common method for including these variables in regression models is dummy-coding, which translates categorical data into a series of binary indicators. This allows researchers to compare each category against a reference group, thus broadening the applicability of regression analysis.

Along the way, we also grew our intuition about within-group and between-group variability when considering how some continuous variable (e.g., SBP) differs as a function of group membership (e.g., race/ethnic groups). Understanding these concepts is crucial for interpreting the results of a regression analysis, especially when categorical predictors are involved.

  • Within-group variability refers to the differences in the continuous variable (e.g., SBP) among individuals within the same group. High within-group variability means that individuals in the same group have widely varying values for the continuous variable.

  • Between-group variability refers to the differences in the average values of the continuous variable (e.g., SBP) between different groups. High between-group variability means that the average values for the continuous variable differ significantly from one group to another.

When the between-group variability is large in comparison to the within-group variability, it suggests that the differences in the continuous variable are more strongly associated with group membership. In other words, the group membership (e.g., race/ethnic group) appears to be an important factor in explaining the differences in the continuous variable. This gives us more confidence that the observed differences are meaningful and not just due to random variation within each group.

Conversely, if the within-group variability is large relative to the between-group variability, it indicates that the differences in the continuous variable are more pronounced within each group than between the groups. This suggests that group membership may not be a strong predictor of the continuous variable, and the observed differences between group means might not be as meaningful.

Footnotes

  1. This code is a bit complex, and I don’t expect you to know what’s happening in every section, but it showcases the power of the wonderful gt package. Click here for a great introduction to more advanced capabilities — and here for the winners of a Posit Cloud gt contest for inspiration.↩︎