PSY 652: Research Methods in Psychology I

Categorical Predictors

Kimberly L. Henry: kim.henry@colostate.edu

Import the data

To begin, let’s consider benevolence at baseline

After describing a past event in which they were hurt by another person, participants responded to each item by reflecting on the person who had hurt them. Items were rated using a five-point response format (1 = Strongly disagree; 5 = Strongly agree). We focus here on items 13 to 18 of the inventory (which correspond to the benevolence items).

  • t1_trim13: Even though his/her actions hurt me, I still have goodwill for him/her.
  • t1_trim14: I want us to bury the hatchet and move forward with our relationship.
  • t1_trim15: Despite what he/she did, I want us to have a positive relationship again.
  • t1_trim16: I have given up my hurt and resentment.
  • t1_trim17: Although he/she hurt me, I put the hurt aside so we could resume our relationship.
  • t1_trim18: I have released my anger so I could work on restoring our relationship to health.

Form the scale

This line of code creates a new variable called t1_trim_benevolence by taking the average of six existing variables in the data frame — specifically, the columns named t1_trim13 through t1_trim18.

  • rowMeans() computes the average (mean) of values across columns — that is, across each row of data.
  • pick(num_range("t1_trim", 13:18)) selects all columns that start with the string “t1_trim” followed by numbers 13 through 18. You could accomplish the same task without helpers as follows:
df <-
  df |>
  mutate(
    t1_trim_benevolence = rowMeans(
      pick(t1_trim13, t1_trim14, t1_trim15, t1_trim16, t1_trim17, t1_trim18),
      na.rm = TRUE
    )
  )

The argument na.rm = TRUE tells R to ignore missing values (NAs) when calculating the mean for each row. So if a participant skipped one or more of the six items, R will still compute the average using the available (non-missing) responses.

If instead, you wanted to only form the scale if at least 4 of the 6 items were observed, you could use:

df |>
  mutate(
    t1_trim_benevolence = case_when(
      # If at least 4 of the 6 benevolence items are non-missing,
      # compute the mean across those items (ignoring any missing values)
      rowSums(!is.na(pick(num_range("t1_trim", 13:18)))) >= 4 ~
        rowMeans(pick(num_range("t1_trim", 13:18)), na.rm = TRUE),
      # Otherwise (fewer than 4 items answered), set the score to missing
      TRUE ~ NA_real_
    )
  )

NA_real_ is R’s way of specifying a missing value that is explicitly a numeric (real number) type — it ensures that when we assign NA (missing) to our new variable, R knows it should be treated as a missing numeric value rather than a missing character string or logical value. See this brief tutorial for more information.

Dummy Coding (2 categories)

Representing treatment condition

The treat variable has 2 levels.

For a categorical variable with \(k\) categories, we need \(k-1\) dummy variables. Therefore, we need a single dummy-coded indicator.

Notice that in the data frame we have two versions of the treatment variable:

  • treat: a factor
  • treat_dummy: a numeric variable coded 0 for Control and 1 for Treatment

Research Question

Does baseline benevolence differ by treatment condition?

This is important because:

  • We expect there to be no differences between treatment conditions on any of the pre-treatment variables since treatment was randomly assigned (and therefore should be unrelated to any pre-treatment variables).

Compute mean of baseline benevolence across conditions for Colombia

Graph baseline benevolence by condition for Colombia

Recover the means for Colombia with a SLR

In equation form

The fitted regression equation is:

\[\hat{y}_i = b_0 + b_1 \times \text{treat\_dummy}_i\]

Plugging in the estimated coefficients:

\[\hat{y}_i = 3.363 + 0.054 \times \text{treat\_dummy}_i\]

Interpretation

  • Intercept (3.363): The predicted baseline benevolence for the Control group
  • Slope (0.054): The Treatment group’s benevolence is 0.05 points higher than Control, on average
  • 95% CI (-0.066, 0.174): Includes zero, suggesting the conditions are not reliably different at baseline

This confirms successful randomization — the groups are similar before the intervention begins.

Use equation to recreate means

For Control Group: \[\hat{y}_i = 3.363 + 0.054 \times 0 = 3.363\] For Treatment Group: \[\hat{y}_i = 3.363 + 0.054 \times 1 = 3.417\]

Verify reference group

R lists factor levels in order, with the first level (“Control”) serving as the reference group. This means all comparisons in the regression model will be made relative to the Control group. The default reference group in R is the first level alphabetically.

Same result with factor version

Change the reference group and refit

Notice what changed

  • Intercept (3.42): Now represents the Treatment group mean
  • Slope (-0.05): Control group is 0.05 points lower than Treatment
  • Same information, different reference point

The overall model is identical — we’ve just rotated which group serves as the comparison baseline.

Your turn, check equivalence for other sites

We expect to see no baseline differences in any site (due to randomization).

Let’s verify this holds across all study locations.

  • Hong Kong
  • South Africa
  • Colombia
  • Indonesia
  • Ukraine (UISA)
  • Ukraine (Realis)

Dummy coding (>2 categories)

Representing study site

The site variable has 6 levels.

For a categorical variable with \(k\) categories, we need \(k-1\) dummy variables. Therefore, we need five dummy-coded indicators.

Research Question

Does baseline benevolence differ across study sites?

This is important because:

  • Cultural differences may influence forgiveness attitudes
  • Site differences could affect intervention effectiveness
  • We want to understand baseline heterogeneity before examining treatment effects

Compute mean baseline benevolence by site

Visualize baseline benevolence across sites

Fit the regression model with site

Understanding the output

With Colombia as the reference group:

  • Intercept (3.39): Mean benevolence for Colombia
  • siteHong Kong (-0.37): Hong Kong is 0.37 points lower than Colombia
  • siteIndonesia (-0.32): Indonesia is 0.32 points lower than Colombia
  • siteSouth Africa (-0.36): South Africa is 0.36 points lower than Colombia
  • siteUkraine (Realis) (-0.22): Ukraine (Realis) is 0.22 points lower than Colombia
  • siteUkraine (UISA) (-0.43): Ukraine (UISA) is 0.43 points lower than Colombia

All comparisons are made relative to the reference group (Colombia)

In equation form

The fitted regression equation is:

\[\begin{aligned} \hat{y}_i = b_0 &+ b_1 \times \text{HK}_i + b_2 \times \text{Indo}_i \\ &+ b_3 \times \text{SA}_i + b_4 \times \text{UkrR}_i \\ &+ b_5 \times \text{UkrU}_i \end{aligned}\]

where each dummy variable equals 1 for that site and 0 otherwise.

Plugging in the estimated coefficients:

\[\begin{aligned} \hat{y}_i = 3.39 &+ (-0.37) \times \text{HK}_i + (-0.32) \times \text{Indo}_i \\ &+ (-0.36) \times \text{SA}_i + (-0.22) \times \text{UkrR}_i \\ &+ (-0.43) \times \text{UkrU}_i \end{aligned}\]

Which sites differ significantly from Colombia?

Look at the 95% confidence intervals:

Key observation: None of the confidence intervals include zero — all sites show reliably lower baseline benevolence compared to Colombia.

Interesting finding: Colombia has the highest baseline benevolence among all study sites!

Within vs. Between Group Variability

Between-Group SS (Site): Variance explained by site differences
Within-Group SS (Residuals): Unexplained individual differences

\[R^2 = \frac{\text{SS}_{\text{Between Groups}}}{\text{SS}_{\text{Total}}} = \frac{\text{SS}_{\text{Site}}}{\text{SS}_{\text{Site}} + \text{SS}_{\text{Residuals}}}\]

\[R^2 = \frac{117.81}{117.81 + 4811.04} = \frac{117.81}{4928.85} = 0.0239\]

Site explains approximately 2.4% of the variance in baseline benevolence.

Verify with glance()