variable_name | label | notes |
---|---|---|
id | inmate ID | NA |
years | sentence length in years | NA |
black | race (1 = black, 0 = white) | NA |
afro | rating for afrocentric features | inmate photo rated by ~35 CU undergraduate students; raters assigned a single, global assessment of the degree to which each face had features that are typical of African Americans, using a scale from 1 (not at all) to 9 (very much) |
primlev | seriousness of primary offense | based on Florida’s rating system, higher numbers indicate more serious felonies |
seclev | seriousness of secondary offense | based on Florida’s rating system, higher numbers indicate more serious felonies |
nsecond | number of secondary offenses | NA |
anysec | indicator for any secondary offenses (1 = yes, 0 = no) | NA |
priorlev | seriousness of prior offenses | based on Florida’s rating system, higher numbers indicate more serious felonies |
nprior | number of prior offenses | NA |
anyprior | indicator for any prior offenses (1 = yes, 0 = no) | NA |
attract | rating for attractiveness | inmate photo rated by ~35 CU undergraduate students |
babyface | rating for babyface features | inmate photo rated by ~35 CU undergraduate students |
Apply and Practice Activity
Blair Replication Explore Logarithms
Introduction
Building on our replication of the Blair et al. study — in this Apply and Practice Activity you will examine the distribution of the outcome for the analysis — sentence length in years. You’ll find that it is a highly skewed variable that needs to be transformed in order to produce a linear relationship between the key predictor of interest (Afrocentric facial features) and the outcome (sentence length). You’ll perform the transformation, examine if it works to solve the problem, and learn how to interpret the effects of the fitted model.
As a reminder, here is a list of the variables in the data frame:
Background
When we build a linear regression model, a fundamental assumption is that the relationship between the predictor(s) and outcome is linear. However, real-world data is often more complex and often non-linear patterns and trends are observed. When faced with such data, linear regression models may not perform optimally, leading to violations of regression assumptions and inaccurate predictions. To address this issue, the concept of applying non-linear transformations to variables comes into play. By strategically transforming certain variables using various mathematical functions, we can potentially uncover hidden patterns, unravel non-linear relationships, and bring the data closer to meeting the linearity assumption of regression. These transformations can help to reveal underlying trends that were obscured by the original data, ultimately improving the model’s performance and enhancing the reliability of its predictions. In this activity, we will explore one of the most common types of non-linear transformations — logarithms. We’ll study their applications, and how they can effectively linearize relationships to make our regression models more robust and accurate in capturing the complexities of our data.
Please follow the steps below to complete this activity.
Step by step directions
Step 1
Open up the blair_replication project in the Posit Cloud.
Step 2
In the Files tab of the lower right quadrant of RStudio, open up the blair_replication_setup.qmd document that you created earlier. Save a copy of this — call it blair_replication_logs.qmd (click File -> Save As…, then provide the new name). In the YAML header, change the title to: “Blair Replication: Explore Logarithms”. We’ll do all of the work for this activity in this new version — blair_replication_logs.qmd.
Step 3
Click the down arrow beside Run in the top menu of the RStudio session, and then choose Restart R and Run All Chunks.
Step 4
Use the output from the code chunk under the header # Describe variables
(which uses the skim() function to provide descriptive statistics for the variables).
Match the variables up with the description of variables in the Introduction. Write a few of sentences to describe the descriptive statistics — focus on the variables sentence length in years (years) and afrocentric features (afro). Describe what the variables represent and describe the descriptive statistics in sentence form. Put these sentences in the white space of your analysis notebook underneath the skim() output.
Step 5
The outcome variable of interest in the Blair and colleagues paper is sentence length in years.
At the bottom of the analysis notebook, create a first level header called
# Examine the distrubution of years
Then, a second level header called
## The raw distribution
Then insert a code chunk and create a histogram of years. Give the graph an informative title and label the x-axis with an informative axis title.
Take a look at the resultant histogram and write a few sentences in your notebook to describe what you see (i.e., put these sentences in the white space of your analysis notebook underneath the graph).
Step 6
This histogram illustrates that sentence length has extreme positive skew (i.e., right skew). Most of the inmates have relatively short sentence lengths, with a small number of inmates given very long sentences.
Although there is no requirement in linear regression modeling that the variables (either the predictors or the outcomes) are normally distributed — very highly skewed variables often require some sort of transformation. Often when you encounter a variable with extreme skew of this type, a transformation that pulls in the tail of the distribution is needed in order to avoid violating the assumptions of your modeling procedure. For a linear regression model (the type of model used in the paper) extremely skewed variables can cause a violation of several assumptions (we’ll study these closely in Module 17) – including normality of the residuals, homegeneity of variance of the residuals, and, perhaps most importantly, linearity.
In the paper, Blair and colleagues indicate:
Because sentence length was positively skewed, a log-transformation was performed on this variable prior to analysis.
Let’s examine whether or not it was a good idea for Blair and colleagues to apply a log transformation.
Step 7
Create a second level header
## Transform years and create a new histogram
Insert a new code chunk. In the code chunk, first use mutate() to create a new variable called lnyears which is the natural log of years (ln is the common abbreviation for natural log). The function to take the natural log of a variable in R is log(). Once the new variable is created, recreate your histogram with the logged version of years. Be sure to give the histogram an informative title, and label the x-axis appropriately.
Step 8
Compared to the initial histogram, lnyears displays a distribution that is closer to normal. That’s good news. However, we also hope to see that the transformation linearizes the relationship between the predictor(s) and the outcome. For now, let’s consider the primary predictor — Afrocentric Features (called afro in the data frame).
Step 9
Create a second level header
## A scatterplot of years, lnyears and afro
Insert a code chunk. Then, create two scatterplots, one for years and afro and one for lnyears and afro. In both plots, put afro on the x-axis. Overlay the best fit line. Be sure to provide an informative title, as well as x- and y-axis labels.
Take a look at the resulting graphs and write a few sentences to provide your insight underneath your graphs.
Step 10
The second plot, which includes lnyears, is preferable. With the first plot, the linear model does a poor job of modeling the data. But, with the second, the best fit line seems to do a reasonable job of explaining the relationship between these two variables. Note that in later activities you will see that the relationship between Afrocentric features and the natural log of sentence length becomes even more linear once we adjust for the other control variables in the model. For now, we will call this scatter plot linear enough to proceed with our activity.
Step 11
Create a first level header
# Regress lnyears on afro
Insert a code chunk, then fit a simple linear regression model — regress lnyears on afro. Call the model slr1. Use the tidy() function to request the output, including the 95% CI. Also, request the glance() output.
Write a few sentences to interpret the intercept, slope, and r.squared from the model output. Include an interpretation of the 95% CIs for the slope in your response.
Step 12
Now, let’s convert the estimate for the intercept back to the original metric of years. Recall from Module 15 that we can back transform a logged value (called taking the antilog or inverse log), by exponentiating the log transformed value to the base that was used. For the natural log transformation, that means raising e to the power of the transformed value. You can accomplish this in R with the exp() function. The argument to the function is the value that you want to antilog (i.e., our intercept).
Create a second level header
## Further interpretations
Insert a code chunk and compute the antilog of the intercept.
Write a sentence or two to interpret the anti log of the intercept.
Step 13
Now let’s further consider the slope. In the usual way, this slope is interpreted as follows: A one-unit increase in Afrocentric features is associated with a .09 unit increase in the natural log of sentence length.
For most of us, an increase of .09 in the natural log metric is difficult to comprehend. It would be preferable to know how much the sentence length in actual years increases as a function of Afrocentric features. We can interpret the effect with regard to years (rather than lnyears) — but we must interpret in terms of a percent change in years not a unit change in years. To do so, we need to apply an additional transformation — which I have coded up into a function called ln.y() below. The function ln.y() helps you to interpret a regression coefficient in which a natural log transformation has been applied to the y variable, but the x variable is in its natural metric.
Copy and paste the code below into the code chunk below your antilog of the intercept.
In the second part of code below (labeled “input your slope…”), input the slope estimate (i.e., \(b_1\) for lnyears) from your fitted regression model, as well as a value of 1 for x_chg.
# function --- don't change anything here
<- function(slope, x_chg) {
ln.y <- 100 * (exp(slope * x_chg) - 1)
new_slope return(new_slope)
}
# input your slope from the regression output, and the desired change in x
ln.y(slope = XX, x_chg = XX)
Write a few sentences to interpret the transformed slope.
Step 14
Let’s build further intuition about predicted scores when the outcome variable is transformed with a natural log transformation.
Let’s calculate the predicted sentence length in years for two scores for the Afrocentric features variable: a score of 2 and a score of 3. We will use the results of the linear regression model fit earlier to compute these predicted values.
|> tidy() slr1
# afro = 2
= .71351909 + (0.09156341 * 2)
lnyears_at_afro_2 lnyears_at_afro_2
[1] 0.8966459
# afro = 3
= .71351909 + (0.09156341 * 3)
lnyears_at_afro_3 lnyears_at_afro_3
[1] 0.9882093
Now, these predicted values are in the metric of lnyears — but we can antilog them to return them to the metric of years. The code below performs this.
# afro = 2
exp(lnyears_at_afro_2)
[1] 2.451367
# afro = 3
exp(lnyears_at_afro_3)
[1] 2.68642
The results tell us that, based only on this one predictor, an inmate with an Afrocentric features score of 2 is expected to receive a sentence length of 2.451367 years and an inmate with an Afrocentric features score of 3 is expected to receive a sentence length of 2.68642 years. What percent difference are these two predicted values?
# percent difference in years
2.68642 - 2.451367)/2.451367)*100 ((
[1] 9.58865
An inmate with an Afrocentric features score of 3 is expected to have a sentence length in years that is about 9.6% longer than an inmate with an Afrocentric features score of 2.
Notice that this is the same estimate for the slope that we derived using the ln.y() function — that is, the expected increase in sentence length in years for a 1-unit increase in Afrocentric features.
ln.y(slope = 0.09156341, x_chg = 1)
[1] 9.588626
Importantly, the same percent difference is garnered regardless of what values for Afrocentric features that we compare. For example, take a look at the code below where the expected difference in sentence length in years between someone who has an Afrocentric features score of 6 and someone who has an Afrocentric features score of 7 is calculated. The log transformation ensures that the percent difference between predicted values remains consistent, regardless of the actual values being compared.
# afro = 6
= .71351909 + (0.09156341 * 6)
lnyears_at_afro_6 exp(lnyears_at_afro_6)
[1] 3.535658
# afro = 7
= .71351909 + (0.09156341 * 7)
lnyears_at_afro_7 exp(lnyears_at_afro_7)
[1] 3.87468
# percent difference in years
3.87468 - 3.535658)/3.535658)*100 ((
[1] 9.588654
Hopefully this demonstration helps you to understand what it means to interpret the effect of a one-unit increase in \(x_i\) on a certain percent change in \(y_i\) (when the natural log of \(y_i\) has been taken).
Step 15
Now, it’s your turn to practice using the ln.y() function. Copy the code in the code chunk below into your notebook (at the bottom of your code chunk under the header ## Further interpretations
) and then calculate the expected percent change in sentence length in years for a 1 standard deviation increase in Afrocentric features.
# function --- don't change anything here
<- function(slope, x_chg) {
ln.y <- 100 * (exp(slope * x_chg) - 1)
new_slope return(new_slope)
}
# input your slope from the regression output, and the desired change in x
ln.y(slope = XX, x_chg = XX)
Below the calculation of the slope for a one standard deviation increase in Afrocentric features, write a few sentences to describe this effect.
Step 16
Was the choice to apply a log transformation to years a good one?
The decision by Blair and colleagues to apply a log transformation was well-grounded. The transformation enhanced the data’s characteristics to better meet the assumptions of linear regression and rendered the model coefficients more interpretable.
What should you take away?
This activity delved into some nuanced concepts of linear regression. It’s important to understand the core principles, which include:
The Objective of Linear Regression: At its heart, linear regression aims to establish a linear relationship between a predictor (or predictors) and an outcome variable. This allows us to make predictions and understand relationships among variables.
Addressing Non-linear Relationships: Real-world data often have complexities, including non-linear relationships. However, linear regression is designed to fit linear relationships. So, what do we do when the relationship seems non-linear?
Transformations as a Tool: By applying non-linear transformations (like taking the natural logarithm) to variables, we can sometimes turn a non-linear relationship into a linear one—at least in the context of the transformed variable(s). This lets us harness the power of linear regression even for relationships that, at first glance, appear non-linear.
Remember, the goal of these transformations is to help meet the assumptions of linear regression and provide a better model fit. Over time, with practice, these concepts will become more intuitive. For now, it’s crucial to grasp these foundational ideas and understand why they matter.
Step 17
Now that you’ve completed all tasks, to help ensure reproducibility, click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output. Scroll through your notebook and see that all of the output is now gone. Now, click the down arrow beside the Run button again, then click Restart R and Run All Chunks. Scroll through the file and make sure that everything ran as you would expect. You will find a red bar on the side of a code chunk if an error has occurred. Taking this step ensures that all code chunks are running from top to bottom, in the intended sequence, and producing output that will be reproduced the next time you work on this project.
Now that all code chunks are working as you’d like, click Render. This will create an .html output of your report. Scroll through to make sure everything is correct. The .html output file will be saved along side the corresponding .qmd notebook file.