Categorical Predictors
After describing a past event in which they were hurt by another person, participants responded to each item by reflecting on the person who had hurt them. Items were rated using a five-point response format (1 = Strongly disagree; 5 = Strongly agree). We focus here on items 13 to 18 of the inventory (which correspond to the benevolence items).
This line of code creates a new variable called t1_trim_benevolence by taking the average of six existing variables in the data frame — specifically, the columns named t1_trim13 through t1_trim18.
pick(num_range("t1_trim", 13:18)) selects all columns that start with the string “t1_trim” followed by numbers 13 through 18. You could accomplish the same task without helpers as follows:The argument na.rm = TRUE tells R to ignore missing values (NAs) when calculating the mean for each row. So if a participant skipped one or more of the six items, R will still compute the average using the available (non-missing) responses.
If instead, you wanted to only form the scale if at least 4 of the 6 items were observed, you could use:
df |>
mutate(
t1_trim_benevolence = case_when(
# If at least 4 of the 6 benevolence items are non-missing,
# compute the mean across those items (ignoring any missing values)
rowSums(!is.na(pick(num_range("t1_trim", 13:18)))) >= 4 ~
rowMeans(pick(num_range("t1_trim", 13:18)), na.rm = TRUE),
# Otherwise (fewer than 4 items answered), set the score to missing
TRUE ~ NA_real_
)
)NA_real_ is R’s way of specifying a missing value that is explicitly a numeric (real number) type — it ensures that when we assign NA (missing) to our new variable, R knows it should be treated as a missing numeric value rather than a missing character string or logical value. See this brief tutorial for more information.
The treat variable has 2 levels.
For a categorical variable with \(k\) categories, we need \(k-1\) dummy variables. Therefore, we need a single dummy-coded indicator.
Notice that in the data frame we have two versions of the treatment variable:
Does baseline benevolence differ by treatment condition?
This is important because:
The fitted regression equation is:
\[\hat{y}_i = b_0 + b_1 \times \text{treat\_dummy}_i\]
Plugging in the estimated coefficients:
\[\hat{y}_i = 3.363 + 0.054 \times \text{treat\_dummy}_i\]
This confirms successful randomization — the groups are similar before the intervention begins.
For Control Group: \[\hat{y}_i = 3.363 + 0.054 \times 0 = 3.363\] For Treatment Group: \[\hat{y}_i = 3.363 + 0.054 \times 1 = 3.417\]
R lists factor levels in order, with the first level (“Control”) serving as the reference group. This means all comparisons in the regression model will be made relative to the Control group. The default reference group in R is the first level alphabetically.
The overall model is identical — we’ve just rotated which group serves as the comparison baseline.
We expect to see no baseline differences in any site (due to randomization).
Let’s verify this holds across all study locations.
The site variable has 6 levels.
For a categorical variable with \(k\) categories, we need \(k-1\) dummy variables. Therefore, we need five dummy-coded indicators.
Does baseline benevolence differ across study sites?
This is important because:
With Colombia as the reference group:
All comparisons are made relative to the reference group (Colombia)
The fitted regression equation is:
\[\begin{aligned} \hat{y}_i = b_0 &+ b_1 \times \text{HK}_i + b_2 \times \text{Indo}_i \\ &+ b_3 \times \text{SA}_i + b_4 \times \text{UkrR}_i \\ &+ b_5 \times \text{UkrU}_i \end{aligned}\]
where each dummy variable equals 1 for that site and 0 otherwise.
Plugging in the estimated coefficients:
\[\begin{aligned} \hat{y}_i = 3.39 &+ (-0.37) \times \text{HK}_i + (-0.32) \times \text{Indo}_i \\ &+ (-0.36) \times \text{SA}_i + (-0.22) \times \text{UkrR}_i \\ &+ (-0.43) \times \text{UkrU}_i \end{aligned}\]
Look at the 95% confidence intervals:
Key observation: None of the confidence intervals include zero — all sites show reliably lower baseline benevolence compared to Colombia.
Interesting finding: Colombia has the highest baseline benevolence among all study sites!
Between-Group SS (Site): Variance explained by site differences
Within-Group SS (Residuals): Unexplained individual differences
\[R^2 = \frac{\text{SS}_{\text{Between Groups}}}{\text{SS}_{\text{Total}}} = \frac{\text{SS}_{\text{Site}}}{\text{SS}_{\text{Site}} + \text{SS}_{\text{Residuals}}}\]
\[R^2 = \frac{117.81}{117.81 + 4811.04} = \frac{117.81}{4928.85} = 0.0239\]
Site explains approximately 2.4% of the variance in baseline benevolence.