Causal Inference

Module 18

Learning objectives

Define a causal effect
Contrast the difference between an individual causal effect and an average causal effect
Describe the potential outcomes framework
Contrast the factual and counterfactual outcome
Describe why a randomized experiment is well-suited to test a causal effect
Describe the advantages and disadvantages of using randomized experiments to estimate causal effects
Define confounders in observational studies
Control for confounders in observational studies using statistical methods
Understand the difference between internal validity and external validity
Identify the limitations and strengths of observational studies to estimate causal effects

What is a causal effect?

A causal effect refers to the relationship between two variables where one variable (the cause) directly influences the other variable (the effect). In other words, if there is a causal effect between variable X and variable Y, a change in variable X will cause a corresponding change in variable Y. You are already quite familiar with causal effects because you use them frequently to guide your decisions in everyday life. For example:

You eat a nutritious lunch and the nutrients in the food cause the cells in your body to function properly. The nutrients cause cells to function.
When you have a headache, you take a pain reliever and the active drug in the pill causes your headache to subside. A pain reliever causes headache relief.

As behavioral scientists, we are often interested in identifying the causes of human behavior. For example, we may seek to understand if social media use among teenage girls causes depression, if exercise causes a reduction in the risk for dementia among older adults, or if a 4-day work week causes better employee well-being.

In this Module, we will focus on causal relationships between two variables for which there is clear directional influence. That is, where a change in one variable causes a change in the other variable. We will call the former variable (i.e., where the change originates) the treatment variable and the latter variable (i.e., the variable that may change as a result of the treatment) the outcome variable. Often, analysts will refer to the treatment variable as X and the outcome variable as Y; in this way, we can represent the causal effect as:

\[X \rightarrow Y\]

A parallel universe for the ultimate experiment

To explore the key principles needed to uncover a causal effect, let’s imagine two people who have recently experienced a traumatic event — Billy and Erica.

In this imaginary world, a Psychologist has developed a new treatment for people experiencing stress following a substantial trauma. The purpose of the treatment is to prevent post-traumatic stress disorder (PTSD). She delivers the new treatment to Billy, and following the treatment Billy does not have PTSD. The Psychologist also delivers the new treatment to Erica, and following the treatment Erica does not have PTSD.

Based on these findings, is there evidence for a causal effect of the new treatment?

Now, imagine a parallel universe where Billy does not receive the new treatment, and in this parallel universe Billy has PTSD. In this case, could we assert that the new treatment caused him to recover from the trauma (i.e., the treatment prevented PTSD)?

Imagine a parallel universe where Erica does not receive the new treatment, and in this parallel universe Erica does not have PTSD. In this case, could we assert that the new treatment caused her to recover from the trauma?

Let’s summarize these findings:

For Billy:

When the new treatment is given, Billy does not have PTSD
When the new treatment is withheld, Billy does have PTSD
The outcome if treatment is given IS NOT EQUAL TO the outcome if treatment is withheld
Therefore the new treatment caused the outcome (i.e., prevented PTSD) for Billy

For Erica:

When the new treatment is given, Erica does not have PTSD
When the new treatment is withheld, Erica does not have PTSD
The outcome if the new treatment is given IS EQUAL TO the outcome if the new treatment is withheld
Therefore the new treatment did not cause the outcome for Erica

Let’s return to the real world — and think about how this example of two people’s experience in two parallel universes could extend to a scientific study. To begin, let’s define the variables of the study:

We will call the new treatment variable \(x_i\). It’s a random variable (different people can have different values), and it equals 1 if the individual received the treatment and 0 if the treatment was withheld from the individual (i.e., the control condition).

We will call the PTSD outcome variable \(y_i\). It’s a random variable and it equals 1 if the individual has PTSD or 0 if the individual does not have PTSD.

Recall that we depict the causal effect: \(X \rightarrow Y\).

Definition of an individual causal effect

When we considered whether the treatment prevented PTSD for Billy and Erica in the imaginary world, we were interested in the individual causal effect — that is, we sought to understand the causal effect of the treatment on the outcome for each person. To do this, we considered two potential outcomes for each individual. That is, presence/absence of PTSD if the treatment is given (potential outcome one) and the presence/absence of PTSD if the treatment is withheld (potential outcome two). Formally, we can write this as:

\(y_i( x_i = 1)\) represents the potential outcome under the treatment condition for individual i. In other words, the value of \(y_i\) if \(x_i\)= 1. This is potential outcome one.
\(y_i( x_i = 0)\) represents the potential outcome under the control condition (treatment withheld) for individual i. In other words, the value of \(y_i\) if \(x_i\)= 0. This is potential outcome two.

If, for each individual (i), we could observe both of these potential outcomes (like we did for Billy and Erica in the imaginary world), then we could simply compute the individual causal effect for each individual by calculating the difference in the two potential outcomes:

Individual causal effect = \(y_i( x_i = 1)\) — \(y_i( x_i = 0)\)

For Billy, potential outcome one was equal to 0 (i.e., \(y_i = 0\), he didn’t have PTSD when given the treatment), and potential outcome two was equal to 1 (i.e., \(y_i = 1\), he had PTSD when the treatment was withheld). Therefore his individual causal effect is 0 — 1 = -1.

For Erica, potential outcome one was equal to 0 (i.e., \(y_i = 0\), she didn’t have PTSD when given the treatment), and potential outcome two was equal to 0 (i.e., \(y_i = 0\), she didn’t have PTSD when the treatment was withheld). Therefore her individual causal effect is 0 — 0 = 0.

If the individual causal effect is not equal to zero, then the treatment caused the outcome for the individual. If the individual causal effect is equal to zero, then the treatment did not cause the outcome for the individual.

In this way, the new treatment had a causal effect for Billy, but not Erica. In other words, the treatment made Billy recover from the trauma. But, the treatment had no effect on Erica, she didn’t experience PTSD whether given the treatment or not.

Let’s build on this example a bit, and imagine that we could observe both potential outcomes of the new treatment for all 63 characters in the Stranger Things series.

That is, continue to imagine that there are two parallel universes that are exactly the same. In one universe (Universe 1), the Psychologist delivers her new treatment to all of the characters and she observes potential outcome one (\(y_i( x_i = 1)\)). In the parallel universe (Universe 2), she doesn’t give any of the characters the new treatment and she observes potential outcome two (\(y_i( x_i = 0)\)). The table below describes the variables we will consider in this toy example.

Variable	Description
character	Name of character
ptsd_universe1	PTSD when given the treatment (Universe 1), 1 = has PTSD, 0 = does not have PTSD
ptsd_universe2	PTSD when treatment is withheld (Universe 2), 1 = has PTSD, 0 = does not have PTSD

The data frame is presented below.

Let’s take a look at the cross tabulation of the PTSD potential outcomes.

parallel |> 
  tbl_cross(row = ptsd_universe1, col = ptsd_universe2, percent = "cell")

	ptsd_universe2		Total
	0	1	Total
ptsd_universe1
0	22 (35%)	26 (41%)	48 (76%)
1	4 (6.3%)	11 (17%)	15 (24%)
Total	26 (41%)	37 (59%)	63 (100%)

Let’s breakdown this table:

Of the 63 characters, 11 (i.e., 17% of the total) had PTSD whether or not they received the treatment and 22 (i.e., 35%) didn’t have PTSD whether or not they received the treatment. For these people, there is no individual causal effect of the treatment because their outcome, \(y_i\), under treatment is the same as their outcome under control.
For 26 people (41% of the characters), they didn’t have PTSD when given the treatment (Universe 1), but they did have PTSD if the treatment was withheld. The outcome under treatment is not equal to the outcome under control — thus there is an individual causal effect for these characters. This represents exactly what an interventionist would want to see — a beneficial treatment effect.
Last, there are 4 people (6.3% of the characters) who had PTSD if they received the treatment, but they didn’t have PTSD if the treatment was withheld. Because the outcome under treatment is different than the outcome under control, there is an individual causal effect of the treatment — but it’s an iatrogenic causal effect. That is, for these 4 people, receiving the treatment caused PTSD.

Definition of an average causal effect

Now, let’s do something a bit different. Rather than examining the individual causal effect for each character, let’s take a look at what’s happening at the aggregate level. We’ll compute the proportion of characters who have PTSD in each universe — Universe 1 where everyone receives the treatment and Universe 2 where no one receives the treatment.

parallel |> 
  select(-character) |> 
  tbl_summary()

Characteristic	N = 63¹
ptsd_universe1	15 (24%)
ptsd_universe2	37 (59%)
¹ n (%)

In Universe 1 where all individuals received the treatment (\(x_i=1\)), 15 of the 63 individuals had PTSD. Therefore, the probability of having PTSD if given the new treatment is 15/63 = 0.24. Another way of saying this is that 24% of the characters had PTSD in Universe 1 — where everyone received the treatment.

In Universe 2 where the treatment was withheld from all individuals (\(x_i=0\)), 37 of the 63 individuals had PTSD. Therefore, the probability of having PTSD if the treatment is withheld is 37/63 = 0.59. Another way of saying this is that 59% of the characters had PTSD in Universe 2 — where no one received the treatment.

Tip

Note that in the summary table above, for each row, the number of people with a score of 1 (i.e., they have PTSD) is presented and the corresponding percentage of 1s is presented. For example, in parallel Universe 1, where all 63 people received the treatment, 15 of the 63 people, or 15/63 = 24% of the people, had PTSD. If you like, you can make the two variables being summarized into factor variables instead of numeric variables, and then the tbl_summary() function will print out the number of people with PTSD and the number of people without PTSD. See the example below:

parallel <-
  parallel |> 
  mutate(ptsd_universe1.f = factor(ptsd_universe1, levels = c(0,1), labels = c("does not have PTSD", "has PTSD"))) |> 
  mutate(ptsd_universe2.f = factor(ptsd_universe2, levels = c(0,1), labels = c("does not have PTSD", "has PTSD"))) 
  

parallel |> 
  select(ptsd_universe1.f, ptsd_universe2.f) |> 
  tbl_summary()

Characteristic	N = 63¹
ptsd_universe1.f
does not have PTSD	48 (76%)
has PTSD	15 (24%)
ptsd_universe2.f
does not have PTSD	26 (41%)
has PTSD	37 (59%)
¹ n (%)

When we tabulate the occurrence of PTSD in the two Universes, we find that with treatment 24% (i.e., 15 of the 63) of people had PTSD, but without the treatment 59% (i.e., 37 of the 63) of people had PTSD. We can contrast the probability across the two Universes as a Risk Ratio.

A Risk Ratio (RR), also known as a Relative Risk, is a measure used in epidemiology and statistics to quantify the risk of an outcome (such as developing a disease) in a certain group in comparison to another group. Specifically, the Risk Ratio is calculated by dividing the probability of the outcome in the exposed or treatment group by the probability of the outcome in the control or non-exposed group. For instance, if you’re studying the effect of smoking (an exposure) on lung cancer, the risk ratio is the risk of lung cancer in smokers (exposed group) divided by the risk of lung cancer in non-smokers (unexposed group). If the RR is 1, that means there’s no difference in risk between the two groups. If the RR is greater than 1, that suggests an increased risk in the exposed group. If it’s less than 1, that suggests a decreased risk in the exposed group.

To calculate the RR in our Stranger Things example, we take: 0.24/0.59 = 0.41. The risk of PTSD if the treatment is received is about 0.41 times the risk if the treatment is withheld¹. In other words, the risk of PTSD in those who have undergone the therapy is 41% of the risk in those who have not. This summary of the treatment effect across the 63 characters represents the average causal effect.

Recall that we learned earlier that an individual causal effect exists if the outcome under treatment is different than the outcome under control. Similarly an average causal effect exists if the average outcome under treatment is different than the average outcome under control. If the risk of PTSD was equal in the two Universes, then the risk ratio would be 1 (e.g., if the probability of PTSD was 0.5 in both Universes, then 0.5/0.5 = 1) — therefore, since a risk ratio of 0.41 IS NOT EQUAL TO a risk ratio of 1, we can conclude that there is an average causal effect of the treatment among the Stranger Things characters.

At this point, if you’re thinking that this all seems impossible — that we cannot observe what happens to people in two parallel Universes where treatment is delivered in one and withheld in the other — you’re absolutely right! In real life studies, we cannot observe an individual’s outcome under both treatment conditions.

Factual and counterfactual outcomes

In the real world, for each individual, only one of the potential outcomes can be observed — either they receive the treatment or they don’t, and then we observe their PTSD. That is, we observe the factual outcome (i.e., the potential outcome under the condition the individual received).

For example, if in the real world we give Billy the treatment, then we see his PTSD score after receiving the treatment. Since a person can only experience one treatment condition — we don’t observe the other potential outcome — that is, we don’t see Billy’s PTSD score after having the treatment withheld.

This potential outcome that we do not see is referred to as the counterfactual outcome (i.e., the potential outcome under the condition that was not received). Because we cannot observe both potential outcomes for each individual — that is, we cannot observe both an individual’s factual outcome and their counterfactual outcome, we cannot compute causal effects at the individual level in the real world.

Importantly, note that in the potential outcomes framework — all people have both potential outcomes. When we either assign or withhold treatment — one of the potential outcomes is revealed — the factual outcome, and the other potential outcome remains concealed — the counterfactual outcome.

In the study described, let’s assume that Universe 1 is the real world, we only have the ability to see what happens to the characters when they receive the treatment. We have a missing data problem because we can’t travel to the parallel universe! Take a look at the data sheet below for the first 20 characters — the sheet signifies that, in the real world, we would observe the character’s factual outcome (i.e., the outcome when they receive treatment), but we don’t see their counterfactual outcome.

The “fundamental problem of causal inference” is that we DO NOT directly observe causal effects for individuals because we can never observe all potential outcomes for a given individual. Holland, 1986.

What can we do absent a parallel universe?

Can we obtain average causal effects absent a parallel universe?

To compensate for this fundamental problem of causal inference, we need to identify a way of approximating the counterfactual outcomes.

What if instead of giving everyone the treatment, we randomly assigned half of the participants to receive the treatment and withheld the treatment from the other half? For example, we could flip a coin for each individual, and assign them to the treatment condition if it comes up heads, and to the control condition if it comes up tails.

This process describes a randomized experiment or a randomized controlled trial (RCT). Here, the researcher randomly assigns people to receive the treatment condition or the control condition. This is crucial because when treatment assignment is random, the only difference between the treatment and control groups is the treatment itself; all other differences are attributable to random variation. That is, in a randomized experiment the two groups are comparable to one another in every way except the treatment condition they received. And, this includes all pre-treatment characteristics of the participants, whether observed or unobserved. For example, if random assignment to condition is employed, then, on average, there will be no differences on any pre-treatment characteristics between the treatment and control groups (e.g., sex, age, prior trauma, attitudes toward therapy, etc.). In this setting, where all background variables are held equal — comparing the two groups randomly assigned to condition is similar to comparing people in two parallel universes. Everything is the same — it’s just that in one setting treatment is received and in another setting treatment is withheld. Thus, the two groups can be compared on post-treatment variables (i.e., PTSD) and the researcher can be confident that any differences on the outcome is due to the treatment.

In this way, we move away from estimating a causal effect for the individuals, and move toward estimating an average causal effect across groups of individuals (i.e., a comparison of those randomly assigned to the treatment condition, and those randomly assigned to the control condition). Let’s imagine that we carried out this process of randomization to condition for the Stranger Things characters. The data below shows the treatment the individual was randomly assigned to receive, condition_rct, and whether or not they subsequently had PTSD, ptsd_rct.

Here’s a summary of the variables.

Variable	Description
character	Name of character
condition_rct	Treatment condition, 1 = treatment, 0 = control (treatment is withheld)
ptsd_rct	PTSD, 1 = has PTSD, 0 = does not have PTSD

Here is the data frame:

Let’s create factor versions of the \(x_i\) and \(y_i\) variables (i.e., treatment condition and PTSD indicators) to make our summary of results easier to read. Then, we’ll compute the number of people with PTSD by treatment condition.

rct <- rct |> 
  mutate(condition_rct.f = factor(condition_rct, levels = c(0,1), labels = c("control", "treatment"))) |>
  mutate(ptsd_rct.f = factor(ptsd_rct, levels = c(0,1), labels = c("does not have PTSD", "has PTSD")))

rct |> 
  select(condition_rct.f, ptsd_rct.f) |> 
  tbl_summary(by = condition_rct.f,
              label = list(ptsd_rct.f = "PTSD status in the RCT"))

Characteristic	control N = 31¹	treatment N = 32¹
PTSD status in the RCT
does not have PTSD	12 (39%)	25 (78%)
has PTSD	19 (61%)	7 (22%)
¹ n (%)

By randomly assigning people to treatment condition (i.e., treatment group, control group), we are able to recover the effect. Comparing the bottom row (has PTSD) across condition, we find that 22% of characters randomly assigned to the treatment condition have PTSD post-intervention, compared to 61% of characters randomly assigned to the control condition. Thus, a treatment effect is apparent — i.e., the treatment appears to work to mitigate PTSD.

It is of interest to note that in this setting, we observe the outcome from parallel Universe 1 for people who were randomly assigned to the treatment condition and we observe the outcome from parallel Universe 2 for people who were randomly assigned to the control condition. Notice this in the table below:

When people are randomly assigned to receive the treatment condition (condition_rct.f == “treatment”), then the PTSD score that we observe (ptsd_rct) is equal to the PTSD score in Universe 1 (where everyone received the treatment) and the PTSD score in Universe 2 (where no one received the treatment) is missing. For these people, the factual outcome is the PTSD score when treatment is received (the Universe 1 outcome) and the counterfactual outcome is the PTSD score when treatment is withheld (the Universe 2 outcome). The counterfactual outcome is missing because in the real world we don’t observe what happens to this group of people when treatment is withheld.

The table below presents the data for people who were randomly assigned to the treatment condition in the RCT.

On the other hand, notice that when people are randomly assigned to receive the control condition (condition_rct.f == “control”), then the PTSD score that we observe (ptsd_rct) is equal to the PTSD score in Universe 2 (where no one received the treatment) and the PTSD score in Universe 1 (where everyone received the treatment) is missing. For these people, the factual outcome is the PTSD score when treatment is withheld (the Universe 2 outcome) and the counterfactual outcome is the PTSD score when treatment is received (the Universe 1 outcome). The counterfactual outcome is missing because in the real world we don’t observe what happens to this group of people when treatment is given.

The table below presents the data for people who were randomly assigned to the control condition in the RCT.

In this way, a randomized experiment presents a solution to the fundamental problem of causal inference. By randomly assigning participants to either a treatment or control group, any differences between these groups besides the treatment are attributed to chance. This includes all pre-treatment characteristics such as age, sex, prior trauma, and more. Any differences in post-treatment variables (e.g., PTSD) are thus confidently attributed to the treatment. This method allows for the estimation of an average causal effect across groups, instead of individual causal effects. The analogy of two parallel universes is used to explain the concept of factual and counterfactual outcomes. In this approach, we observe the factual outcome (the outcome in the the real world where treatment is either received or withheld) and estimate the counterfactual outcome (the outcome that could have occurred in the parallel universe). This way, even without a parallel universe, we can approximate counterfactual outcomes and infer causal effects.

A real life example of a RCT

Our Module 16 Example

In Module 16 we studied a real life example of a Randomized Controlled Trial conducted by Hofman, Goldstein & Hullman.

The researchers aimed to investigate how the type of uncertainty interval presented in visualizations affects individuals’ willingness to pay (WTP) for a special boulder. Specifically, participants were randomly assigned to one of two conditions: they either viewed a Confidence Interval (CI) or a Prediction Interval (PI).

Participants in the CI condition were shown a visualization with a CI, which provides a range where the true population parameter is expected to lie with a certain level of confidence (e.g., 95%). In contrast, participants in the PI condition saw a visualization with a PI, which indicates the range where future individual observations are expected to fall. The primary outcome of interest was WTP, measured in “ice dollars,” which quantified how much participants were willing to pay for the boulder based on the visualization they saw.

Thus, the treatment effect indicator (i.e., X) in this example was the variable called interval_CI and the outcome (i.e., Y, willingness to pay) was called wtp_final. In analyzing the data, we found a statistically significant average causal effect of interval type on willingness to pay — where participants who viewed the CI paid, on average, about 29 ice dollars more for the special boulder, than participants who viewed the CI. We interpreted this as an indication that the way information is presented (using CIs versus PIs) has a measurable impact on people’s willingness to pay. Specifically, the CI visualization appeared to make the special boulder seem more appealing, leading to a higher willingness to pay. This could be because the narrower range of the CI gave a more precise estimate of the expected improvement, which may have increased confidence in the effectiveness of the special boulder in the eyes of the participant.

Causal Inference from Randomized Controlled Trials

One of the fundamental strengths of the Hofman et al. study lies in the random assignment of participants to the CI or PI conditions. Because the type of interval observed was randomly assigned, any differences in WTP between the two groups can be attributed to the type of interval presented, rather than to other confounding variables. This randomization ensures that, on average, the two groups are comparable in all respects except for the intervention they received (the type of interval). Consequently, any observed difference in WTP can be causally attributed to the type of interval.

In randomized controlled trials (RCTs) like this one, the process of randomization mitigates the influence of confounding variables by evenly distributing them across treatment groups. This allows researchers to isolate the effect of the treatment (here, the type of interval) and make strong causal inferences. The design of the Hofman et al. experiment, therefore, provides robust evidence that the type of interval observed causally affects participants’ WTP, free from the biases that might plague observational studies.

What if participants choose the treatment or condition?

Reflecting back on the Stranger Things RCT, let’s imagine that instead of randomly assigning the Stranger Things characters to the treatment or control condition, we allowed each character to choose to receive the therapy. Here, there is no random assignment to treatment or control groups. Therefore, there could be differences between the two groups, other than condition, that cause PTSD status at the end of the study. For instance, people who choose to receive the therapy might be more motivated to improve, have greater resources, be healthier to begin with, or have more supportive networks than those who do not. All these variables (and potentially many others) are potential confounding variables and could influence whether an individual has PTSD at the end of the study, making it difficult to establish whether the therapy itself is responsible for reduced likelihood of PTSD. A confounding variable, also known as a confounder, is an external factor in a study that may cause both the predictor (i.e., treatment received) and the outcome (i.e, PTSD).

This problem is known as selection bias, which refers to the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved. When selection bias occurs, the sample obtained is not representative of the population intended to be analyzed. In the potential outcomes framework — you can think of selection bias occurring when pretreatment variables make an individual more likely to select a certain treatment condition, and therefore, we’re more likely to see one potential outcome rather than the other. For example, if Stranger Things characters who were more motivated to improve their mental health prior to the therapy were more likely to choose the therapy — then we will be more likely see the potential outcome under treatment when motivation is high, and we will be more likely to see the potential outcome under control when motivation is low. Thus, when treatment condition is chosen by the participant, then treatment condition is related to which potential outcome we observe — and thus selection bias threats the validity of our study.

When analyzing the results of a study in which people chose the condition themselves, if you simply compare the outcomes of those who chose the treatment/exposure with those who rejected the treatment/exposure, the difference you see might be due to selection bias, not the treatment condition itself. This makes it challenging to draw conclusions about the causal effect of the treatment.

For this reason, RCTs are often considered the gold standard for determining causal relationships, because the random assignment of participants to treatment or control groups helps to control for both known and unknown confounding variables, minimizing the impact of selection bias. And, when treatment is randomly assigned, it is, by design, unrelated to which potential outcome we observe.

Benefits and barriers to RCTs

Benefits of randomization of participants to condition

Eliminates selection bias: By randomly assigning participants, researchers can eliminate selection bias, which occurs when the assignment of participants to groups is related to the outcome of interest. Randomization ensures that the groups are comparable at the beginning of the study, allowing for a fair comparison of the treatment’s effect.
Balances both observed and unobserved variables: Randomization helps balance not only observed variables, such as age, gender, or income, but also unobserved variables that researchers may not even be aware of. This is particularly important because unobserved variables can also confound the relationship between treatment and outcome.
Facilitates statistical analysis: Since randomization balances background variables across groups, it simplifies the statistical analysis of the results. This allows researchers to estimate the causal effect of the treatment using relatively straightforward statistical methods — for example, comparing two means or two proportions.
Increases internal validity: Randomization helps to ensure that any observed differences in outcomes between the treatment and control groups are due to the treatment itself, increasing the internal validity of the study. Internal validity in this context refers to the accuracy of conclusions drawn about the causal effect of the treatment within the study population.

Challenges to randomization of participants to condition

For all of the reasons just listed, randomized experiments are considered the gold standard for establishing causality in scientific research. However, there are several reasons why randomized experiments may not always be feasible or ethical, and researchers may need to rely on observational data to estimate causal effects.

Ethical concerns: In some cases, it may be unethical to randomly assign individuals to different treatments or control groups. For example, it would be unethical to randomly expose individuals to harmful substances (i.e., when studying exposure to a chemical) or withhold potentially life-saving treatments from patients in need.
Practical constraints: Randomized experiments can be expensive, time-consuming, and difficult to implement, especially in large populations or over long periods. In some cases, the necessary resources and infrastructure may not be available to conduct such studies.
Logistical issues: Randomly assigning participants to treatment and control groups can be challenging when studying populations that are difficult to access, such as isolated communities or certain demographic groups.
External validity: Randomized experiments may have limited generalizability if the study population or setting is not representative of the broader population or context in which the intervention would be implemented. This can be particularly true for studies conducted in controlled settings, such as laboratories.
Rare events or long-term outcomes: Randomized experiments may not be suitable for studying rare events or long-term outcomes, as they often require large sample sizes and extended periods of observation.

How can we estimate causal effects when randomized experiments aren’t possible?

When randomized experiments are not feasible or appropriate, researchers can consider whether observational data can be used to estimate a causal effect of interest. Observational studies examine the relationship between an exposure or intervention and an outcome without the researcher actively manipulating who is exposed to the exposure (or who receives the treatment). That is, the researcher does not randomly assign participants to groups. This presents a major challenge that must be carefully dealt with. That is, if we seek to estimate a causal effect of an exposure or treatment when random assignment to the exposure or treatment isn’t performed, we must use alternative techniques to ensure that the groups (i.e., treatment condition versus control condition, or group exposed to an event versus those not exposed) are comparable. This is typically accomplished through statistical adjustment.

Confounders

A confounder, in the context of research and statistics, is a variable that influences both the independent variable (cause) and the dependent variable (effect) in a study, leading to an inaccurate or misleading conclusion about the relationship between the two. Confounders can create the illusion of a causal effect when none exists, or they can mask an actual relationship between the variables being studied. An extreme example of confounding where no real relationship exists between two variables can be found in the classic case of ice cream sales and shark attacks.

Suppose a study observes that when ice cream sales increase (i.e., the \(x_i\) variable), the number of shark attacks also increases (i.e., the \(y_i\) variable). At first glance, it may seem that there’s a direct relationship between ice cream sales and shark attacks. However, in reality, there’s no causal relationship between these two variables. Instead, a confounding variable — the weather, specifically warm temperatures — is responsible for the apparent connection.

During warmer months people seek out cold treats and ice cream sales increase. At the same time, warmer weather attracts more people to swim in the ocean, resulting in a higher likelihood of shark encounters. The confounding variable, in this case, is the weather, which affects both ice cream sales and shark attacks, creating the illusion of a relationship between the two variables when none actually exists.

We can draw a diagram of this confounded relationship as follows. Notice that there is no arrow connecting ice cream sales and shark attacks because ice cream sales does not actually cause shark attacks.

For a variable to be considered a confounder, it must influence both X (e.g., ice cream sales) and Y (e.g., shark attacks). If it only affects one of them, it is not a confounder and doesn’t complicate the estimation of the causal effect of X on Y. There are three scenarios in which a variable (we’ll refer to it as variable A) does not confound the relationship between X and Y:

Variable A causes X, but not Y: In this case, variable A is not a confounder because it does not affect the outcome variable Y. For example, in a study examining the effect of social media use (X) on mental health (Y), variable A might be the availability of high-speed internet, which influences social media use but has no direct effect on mental health.

Variable A causes Y, but not X: Here, variable A is not a confounder since it does not influence the exposure variable X. For instance, in a study investigating the relationship between parenting style (X) and child academic performance (Y), variable A could be the child’s innate intelligence, which affects their academic performance but not the parenting style employed.

Variable X causes variable A, and in turn, A causes Y (i.e., an intermediate variable): In this scenario, variable A is not a confounder but rather a mediator or an intermediate variable. It is part of the causal pathway between X and Y. For example, in a study assessing the impact of cognitive-behavioral therapy (CBT — the X variable) on depressive symptoms (Y), variable A might be improved problem-solving skills. The CBT intervention might enhance a participant’s problem-solving skills (A), which in turn reduces their depressive symptoms (Y). However, A does not confound the relationship between CBT and depressive symptoms because it is an outcome of the treatment and not an external factor influencing both X and Y.

Turning back to situations of confounding, it is important to note that in most research scenarios, confounding isn’t as extreme as the ice cream sales and shark attacks example. More commonly, a causal relationship may indeed exist between the variables of interest, but the presence of confounding variables makes it challenging to accurately estimate the causal effect.

Let’s consider three examples from behavioral science:

The relationship between stress and academic performance. Suppose a study is conducted to investigate the impact of stress on academic performance. The researchers find that students with higher stress levels perform worse academically. However, a confounding variable in this study might be the students’ sleep quality. Poor sleep quality can independently affect both stress levels and academic performance, making it difficult to accurately determine the direct causal effect of stress on academic performance.
The relationship between physical exercise and cognitive function. A study might examine the association between regular physical exercise and cognitive function in older adults. The results show that those who exercise regularly have better cognitive function. However, a confounder in this study could be socioeconomic status. Higher socioeconomic status might allow individuals to have better access to exercise facilities and a healthier lifestyle, as well as greater access to resources that promote cognitive function. In this case, socioeconomic status is a confounding variable that can influence the observed relationship between physical exercise and cognitive function.
The relationship between social media use and adolescent depression. Imagine a study is conducted to assess the impact of social media use on depression among adolescents. The researchers find that adolescents who use social media more frequently exhibit higher rates of depression. However, a potential confounding variable in this study could be “social isolation.” Social isolation might lead adolescents to use social media more frequently as a substitute for face-to-face interaction. At the same time, social isolation itself can be a contributing factor to depression. In this scenario, social isolation is a confounding variable because it is related to both the exposure — X (social media use) and the outcome — Y (adolescent depression), making it difficult to discern whether the relationship between social media use and depression is causal or whether it’s influenced by the levels of social isolation among the adolescents.

In summary, confounders can distort or even mask the true relationship between the independent and dependent variables, leading to misleading conclusions. To ensure valid and reliable findings when randomization is not possible, researchers must carefully design their studies to control for potential confounders. In the remainder of this Module, we will examine how an analyst can control (i.e., adjust) for confounders by including them as additional predictors of the outcome in a multiple regression model in which the outcome is regressed on the treatment/exposure. By accounting for and addressing confounding variables, researchers can enhance the validity of their findings and better understand the true nature of the relationship between the variables under investigation. Through careful consideration and measurement of confounding variables, and valid inclusion of the confounders into the analysis, we will see in this Module how estimation of causal effects with observational data can be possible.

A case study

The history between Ukraine and Russia is complex, characterized by centuries of political, cultural, and social intertwinement. The two nations share a common Slavic heritage, with Kievan Rus’ — a medieval state existing from the 9th to the 13th centuries — often considered the cradle of both Ukrainian and Russian cultures. However, tension has long simmered between the two nations, particularly as Ukraine sought to establish its own identity separate from Russia. Historically, Ukraine has repeatedly faced subjugation by Russia, including incorporation into the Russian Empire in the 18th century and the Soviet Union in the 20th century. The dissolution of the Soviet Union in 1991 granted Ukraine independence, but tensions resurged as Russia continued to exert influence over the country. In 2014, when Ukraine’s pro-Russia president, Viktor Yanukovych, lost power, Russia took control of Crimea and refused to support the new Ukrainian government, which leaned towards the European Union and Western values. To mitigate their loss of power in Ukraine, Russia aided separatist groups in eastern Ukraine while simultaneously employing disinformation campaigns to sway public opinion in their favor — using methods that often relied on false or twisted information.

Television was one method that Russia used to disseminate their propaganda. In 2014, television served as the primary news source for over 90% of Ukrainians, and Russian propaganda exploited this by disseminating messages through Russian TV channels, attempting to convince Ukrainians that Russia and pro-Russian politicians in Ukraine had the country’s best interests at heart. To counter the influence of Russian media on their domestic affairs, the new Ukrainian government banned broadcasts of Russian state-controlled television. Despite the ban, approximately 21% of Ukrainians still received their news from Russian TV stations.

During the 2014 presidential and parliamentary elections in Ukraine, propaganda intensified. Russia aimed to use misinformation to persuade Ukrainians to vote for pro-Russian candidates. To understand how Russian television covered Ukraine, transcripts of daily news reports from 2010 to 2015 on Channel One, Russia’s most widely viewed TV station, were analyzed. The graph below displays the frequency of Ukraine being mentioned on Channel One news during this period. Before 2014, Ukraine received minimal attention, even during election periods. However, throughout the 2014 presidential and parliamentary elections, Ukraine became the primary topic of discussion. For instance, during the week before Ukraine’s parliamentary election, Russia’s most popular evening news show, Vremia, devoted 31-46% of its weekday broadcast time and 78% of its Sunday news broadcasts to Ukraine. The prevailing storyline on all major Russian channels was noticeably critical of Ukraine’s government and the political parties advocating for stronger ties with the West. News anchors asserted that “extreme nationalists” and “neo-Nazis,” backed by Western nations, were preparing to participate in the parliamentary election to establish a “new order.” They also claimed that pro-Russian opposition faced violent suppression, and that the existing government was an unauthorized “junta.”

Taking advantage of the significant increase in Russian propaganda on state-owned television leading up to the 2014 elections, two New York University Political Science professors, Leonid Peisakhin and Arturas Rozenas, designed a study to investigate the impact of Russian propaganda on Ukraine election outcomes. In their 2018 paper titled Electoral Effects of Biased Media: Russian Television in Ukraine, the authors analyzed the influence of state-owned Russian television on Ukraine’s 2014 election results.

Ukraine presented an ideal setting for this research, as Russian TV broadcasts were available near the Russia-Ukraine border. However, within Ukraine, some areas had good reception and could access Russian TV, while others could not. These different regions were otherwise very similar in terms of population, political inclinations, and economic conditions. Although Russian television exposure was not randomly assigned to participants, there was sufficient natural variation in exposure across potential confounding factors. This allowed the authors to investigate whether exposure to Russian propaganda via television could sway individuals to favor a pro-Russia candidate over a pro-Western candidate. After publishing their study, a follow-up article in the Washington Post was published — if interested, you can access it here.

In the remainder of this Module, we will examine the data from Peisakhin and Rozenas’s paper to study the causal effect of exposure to state-owned Russian television (the treatment/independent variable) on percent of votes for pro-Russian candidates in the 2014 Ukrainian elections (the outcome/dependent variable). The data frame includes data on 3,567 voting precincts in Ukraine. The models fit in the Peisakhin and Rozenas’s paper are more advanced than what we’re ready for at this stage — so we’ll consider a simplified version of their fitted models here.

Introduction to the data

Peisakhin and Rozenas’s paper focuses on two national elections held in 2014 when Ukrainian domestic affairs were prominent on Russia’s news agenda. The data file includes election data and other relevant variables pertaining to 3,567 voting precincts in Ukraine. That is, each row of data represents a precinct.

First, we need to load the necessary packages:

library(here)
library(gtsummary)
library(gt)
library(broom)
library(tidyverse)

The data frame we’ll consider includes the following variables:

Variable	Description
precinct	Precinct code
raion.f	Administrative county
r14pres	Percent pro-Russian votes in the 2014 presidential election
r14parl	Percent pro-Russian votes in the 2014 parliamentary election
distrussia	Distance to the Russian border (km) — log scale
ukrainian	Percent Ukrainian speakers from census
r12.f	Deciles of percent pro-Russian votes in the 2012 parliamentary election
turnout12	Percent of voters who voted in the 2012 parliamentary election
registered12	Registered voters in the 2012 parliamentary election — log scale
roads	Road density within 1-km of the polling station — log scale
village	Binary indicator to compare villages to towns and cities
qualityq	Probabality of Russian TV reception

Let’s import the data frame.

ukraine <- read_rds(here("data", "ukraine.Rds"))
ukraine |> head()

The variables for this study are more complex than the other studies that we have examined so far. Let’s dig in a little more to learn about the exposure, the outcome, the potential confounders, and other relevant control variables.

The exposure variable: The authors used the Irregular Terrain Model to measure the quality of reception of Russian analog television in Ukraine. All Russian transmitters broadcasting channels that carry news programming and located within 100 kilometers of the area under study were included in the authors’ analyses. Using these data, the authors calculated the probability that a precinct receives Russian analog television — where 0 indicates a zero probability of reception and 1 indicates certain probability of reception.

The outcome variables: Ukraine has a multiparty system with numerous candidates and political parties, and to formulate the two outcomes from the 2014 elections, r14pres and r14parl, the authors classified all candidates and parties into pro-Russian and pro-Western blocs². That is, the authors coded candidates and parties as pro-Western if they publicly advocated for Ukraine’s membership in the European Union or NATO or promoted the strengthening of economic, social, or military ties with Western and Central Europe. In contrast, those candidates and parties calling for closer relations with Russia were coded as pro-Russian. For presidential contenders, they labeled all those who served exclusively in the Viktor Yushchenko or Yulia Tymoshenko administrations or who were active on the side of the anti-Yanukovych protesters during the Euromaidan protests as pro-Western. Those who served exclusively in the Yanukovych government were labeled as pro-Russian.

Confounders: The strength of Russian television signal tends to improve in the immediate vicinity of the Russian border, thus, this geographic variation might in some way correlate with political behavior and as a result confound the effect of Russian propaganda on election outcomes. In other words, geography is a potential confounder. Two variables to assess geography are included in the data frame: raion.f, which is the county of each precinct, and distrussia, which captures the distance of the precinct to the Russian border. These variables may be considered confounders because they may cause both the exposure variable (reception to state-owned Russian TV) and the outcomes (support of pro-Russian candidates in the 2014 presidential and parliamentary election results). That is, the locale of the precinct and the precinct’s proximity to the Russian border may both affect the degree of Russian TV reception (e.g., closer proximity to Russia increases the likelihood that a household can pick up Russian TV) and voter’s support for pro-Russian candidates (e.g., closer proximity to Russia could mean closer ties to Russia and/or could mean greater awareness of the Russian-Ukraine conflict).

Other control variables: When estimating a causal effect, it’s essential to control or adjust for confounders to obtain an accurate representation of the relationship between the exposure and the outcome. Additionally, an analyst might want to control for other variables that may be associated with just the treatment or just the outcome, even if they don’t directly confound the relationship between the exposure and the outcome. There are several reasons why this can be beneficial:

Improved precision: Controlling for additional variables can help increase the precision of the estimates by reducing the residual variance, leading to smaller standard errors and narrower confidence intervals.
Enhanced understanding of effect modification: Controlling for additional variables related to the outcome can help uncover potential effect modifiers or interactions. These variables may modify the effect of the exposure on the outcome, revealing subgroups for which the treatment effect is stronger or weaker. For example, in the paper, the authors go on to consider support for pro-Russian candidates in the 2012 parliamentary elections as a variable that might affect how much Russian propaganda may effect election outcomes in 2014. They find that in precincts where there was a lot of prior support for pro-Russian candidates, the propaganda had the effect of increasing support for pro-Russian candidates in 2014, while in precincts with little support for pro-Russian candidates, the propaganda had the effect of decreasing support for pro-Russian candidates in 2014.
Generalizability: By controlling for variables that are associated with the outcome, the analyst can better understand how the results generalize to different populations or contexts, which is essential for applying the research findings in practice. For example, when additional control variables are included, the effect of interest (i.e., the effect of Russian propaganda), represents the treatment/exposure effect holding constant all other variables included in the model. For example, we can estimate the effect of the exposure when comparing two precincts that are the same on important characteristics — e.g., population size, same level of infrastructure. Different populations or subgroups may vary significantly in terms of demographics, socioeconomic status, cultural norms, and other characteristics. By controlling for these variables, researchers can ensure that the findings are not specific to a particular subgroup but are relevant to a broader population. This helps in understanding how the treatment effect might vary across different settings or populations.

In thinking about what variables to include as additional predictors in the model, it is crucial to be cautious. Of utmost importance, one should avoid adjusting (i.e. controlling) for variables that are caused by the exposure or treatment being studied, as this can introduce bias into the estimated causal effect. Adjusting for a variable that is caused by the exposure or treatment and is also related to the outcome will result in explaining away part of the effect that the analyst wants to examine. For example, in our case study, we would NOT want to control for number of hours that voters in the precinct watched state-owned Russian television, as hours watched would be an intermediate variable in the causal pathway (i.e., a Ukrainian citizen receives good reception to Russian television, as a result they watch many hours of television and are exposed to Russian propaganda, more exposure to Russian propaganda leads them to vote in a certain way). Moreover, if a variable is caused by both the treatment and the outcome, and an analyst includes it as a control variable, it can in fact create spurious associations, leading to biased estimates of the causal effect. For example, we would NOT want to control for the number of causalities in the precinct during the Russian invasion of Ukraine that began in 2022. Last, we should avoid controlling for too many variables as this can lead to over-adjustment, which can reduce the efficiency of the estimates (make the standard errors and confidence intervals larger than they should be) or introduce multicollinearity issues. This can make it more challenging to interpret the results and draw valid conclusions about the treatment effect. In summary, an analyst should be careful, thoughtful and judicious in selecting control variables for a multiple linear regression designed to examine a causal effect of some exposure or treatment on an outcome of interest. If you’re interested in this topic of covariate selection, this paper by Dr. Tyler Vanderweele is helpful for identifying which variables to control for in an observational study designed to examine a causal effect.

In our case study, the authors identified a number of other variables they felt necessary to control for in the analysis. These include pre-existing political preferences (pro-Russian vote in the 2012 parliamentary election and voter turnout), the level of economic development (density of road networks within a one kilometer radius of the polling station), population size (number of registered voters), and whether the precinct is rural (i.e., village) or urban (town or city). They also adjusted for the percent of Ukrainian speakers as reported in the most recent (2001) census.

Descriptive statistics

Let’s first create a table of all variables that will be considered in this examination. There are already variable labels in the data frame — so we can benefit from that in our gtsummary tables. I’ll print those labels here using look_for() from the labelled package.

ukraine |> 
  labelled::look_for() |> 
  select(variable, label)

I’m going to exclude precinct code and the county from the descriptive table since the listings of these are very long.

ukraine |> 
  select(-precinct, -raion.f) |> 
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ({sd}) [{min}, {max}]")) |> 
  as_gt() |>
  tab_header(title = md("**Table 1. Descriptive statistics for study variables**"))

Table 1. Descriptive statistics for study variables
Characteristic	N = 3,567¹
Percent pro-Russian votes in the 2014 presidential election	22 (18) [0, 76]
Percent pro-Russian votes in the 2014 parliamentary election	27 (19) [0, 79]
Probabality of Russian TV reception	0.11 (0.15) [0.00, 0.85]
Distance to the Russian border (km) -- log scale	3.88 (0.83) [0.12, 5.20]
Percent Ukrainian speakers from census	79 (27) [2, 100]
Deciles of percent pro-Russian votes in the 2012 parliamentary election
1	358 (10%)
2	357 (10%)
3	356 (10.0%)
4	358 (10%)
5	358 (10%)
6	357 (10%)
7	355 (10.0%)
8	356 (10.0%)
9	359 (10%)
10	353 (9.9%)
Registered voters in the 2012 parliamentary election -- log scale	6.66 (0.91) [3.85, 7.83]
Percent of voters who voted in the 2012 parliamentary election	59 (10) [32, 98]
Road density within 1-km of the polling station -- log scale	3.35 (0.77) [0.00, 5.11]
Binary indicator to compare villages to towns and cities	1,977 (55%)
¹ Mean (SD) [Min, Max]; n (%)

Scatterplot between key predictor and key outcome

Let’s take a look at the relationship between our exposure and the key outcome we will consider here (results of the 2014 presidential election). On this graph, I’m going to request the best fit line using geom_smooth() — and I’ll color it blue. I’m also going to request a loess smooth using geom_smooth(). The loess smooth line is a non-parametric smoothing technique that captures local patterns in the data without imposing a strict linear assumption. It can help identify non-linear trends, such as curves or bends, that might not be apparent from the linear best fit line alone.

ukraine |> 
  ggplot(mapping = aes(y = r14pres, x = qualityq)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "#2F9599") +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE, color = "#F26B38") +
  theme_minimal() +
  labs(title = "Is there a linear relationship between Russian TV reception and \npro-Russian votes in the 2014 Ukrainian Presidential election?",
       subtitle = "Blue line represents linear best fit, orange line represents loess smooth",
       x = "Quality of Russian television reception in the precinct",
       y = "Percent of pro-Russian votes in the precinct")

The graph visualizes the relationship between Russian TV reception and pro-Russian votes in the 2014 Ukrainian Presidential election. By including both the linear best fit line and the loess smooth line, we can assess the adequacy of the linear assumption and capture any non-linear patterns that might exist in the data.

The linear best fit line (blue line) represents the line of best fit obtained through linear regression, assuming a linear relationship between the variables. It provides a summary of the overall trend and direction of the relationship. If the best fit line closely aligns with the data points, it suggests that a linear model may be appropriate for describing the relationship between the variables.

However, with large data frames, it can be challenging to discern whether the linear best fit line adequately captures the underlying relationship. This is where the loess smooth line (orange line) becomes valuable.

By including both the best fit line and the loess smooth line in the graph, we gain a more comprehensive understanding of the relationship between Russian TV reception and pro-Russian votes. We can assess whether the linear assumption holds reasonably well, or if a non-linear model might be more appropriate to capture the underlying patterns in the data. In this case, the linear assumption seems reasonable.

Center predictors for analysis

Before we fit the regression models, I am going to center the numeric confounders and control variables at the mean. When predictor variables are centered at their mean, the intercept term in the linear regression model represents the predicted response when all the predictor variables are at their mean values. This makes the interpretation of the intercept more meaningful and easier to interpret in practical terms. I will leave the TV reception variable, qualityq uncentered — since left in it’s original form signifies a precinct without any reception — which is a meaningful quantity.

ukraine_centered <- 
  ukraine |> 
  mutate(across(c(distrussia, ukrainian, turnout12, registered12, roads, village), ~ . - mean(., na.rm = TRUE)))

There are two factor variables that will be used as predictors. The default in R is to treat these as dummy-coded indicators when included as predictors in a regression model. Instead, I’ve changed these to be considered effect code indicators — which is accomplished with the code below. When using the contr.sum (i.e., effect coding) method for contrast coding in R, the intercept term represents the grand mean or the overall average response or outcome across all levels of the factor variable, rather than the selected reference group (as is the case for dummy-coding). If interested, you can learn more about this here.

contrasts(ukraine_centered$raion.f) <- "contr.sum"
contrasts(ukraine_centered$r12.f) <- "contr.sum"

ukraine_centered |> 
  skimr::skim()

Data summary
Name	ukraine_centered
Number of rows	3567
Number of columns	12
_______________________
Column type frequency:
character	1
factor	2
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
precinct	0	1	6	6	0	3567	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
raion.f	0	1	FALSE	66	Kha: 641, Che: 189, Sum: 167, Rom: 89
r12.f	0	1	FALSE	10	9: 359, 1: 358, 4: 358, 5: 358

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
r14pres	1	22.48	17.69	0.00	6.02	16.92	37.89	75.71	▇▂▅▂▁
r14parl	1	26.73	19.02	0.00	8.84	21.96	45.16	78.90	▇▃▃▃▁
qualityq	1	0.11	0.15	0.00	0.02	0.04	0.15	0.85	▇▂▁▁▁
distrussia	1	0.00	0.83	-3.75	-0.42	0.03	0.60	1.32	▁▁▂▇▇
ukrainian	1	0.00	27.05	-76.60	-5.75	14.65	19.26	21.48	▁▂▁▂▇
registered12	1	0.00	0.91	-2.81	-0.70	0.09	0.89	1.17	▁▂▅▅▇
turnout12	1	0.00	9.74	-27.31	-7.26	-1.04	6.58	38.64	▁▇▆▂▁
roads	1	0.00	0.77	-3.35	-0.51	0.05	0.59	1.77	▁▁▅▇▃
village	1	0.00	0.50	-0.55	-0.55	0.45	0.45	0.45	▆▁▁▁▇

Fit regression models to predict 2014 presidential election results

Model 1: Without confounders

First, let’s fit a naive model that ignores potential confounders. Here, we regress r14pres (i.e., the percent of pro-Russian votes in the 2014 presidential election in the precinct) on qualityq (i.e., the strength of Russian television reception in the precinct).

fit1 <- lm(r14pres ~ qualityq, data = ukraine_centered)
fit1 |> tidy(conf.int = TRUE, conf.level = .95) |> select(term, estimate, std.error, conf.low, conf.high)

The intercept in this model is the predicted percent of pro-Russian votes in a precinct with 0 (no) Russian television reception. Therefore, if a precinct has no reception, we predict that 17.1% of the votes in the precinct for the 2014 Presidential election will be for a pro-Russian candidate. The regression coefficient for qualityq quantifies the predicted change in the percent of pro-Russian votes for a one unit increase in the quality of reception. Because qualityq ranges from 0 to 1, where 0 indicates a zero probability of reception and 1 indicates certain probability of reception, a one unit increase in qualityq essentially contrasts precincts with zero probability of reception to those with certain probably of reception. Therefore, we predict that the vote share for pro-Russian candidates will increase by nearly 48 percentage points if the precinct surely has Russian television reception. Using the equation, the model predicts that 65% of the votes in the precinct will be for a pro-Russian candidate if the precinct has a qualityq score of 1 (that is, \(17.1 + 47.7 = 64.8\)).

But, of course, this effect does not consider the confounders and other important control variables. Therefore, we fit a second model that adjusts for the confounders, and a third model that adjust for the confounders and the additional relevant control variables discussed by the authors.

Model 2: With confounders

In the second model, we add the identified confounders — namely the variables that control for geography.

fit2 <- lm(r14pres ~ qualityq + distrussia + raion.f, data = ukraine_centered)
fit2 |> tidy(conf.int = TRUE, conf.level = .95) |> select(term, estimate, std.error, conf.low, conf.high)

In the second model, the intercept represents the predicted pro-Russian vote for precincts with no Russian television reception, an average score on distance to Russia, and averaged across raions. Notice that in the second model, the effect of interest (the estimate for qualityq) has been substantially reduced, from an effect of nearly 47.7 percentage points to an effect of about 8.8 percentage points. This large reduction in the effect estimate of exposure to Russian television indicates that much of the initial observed effect was confounded by geography. By controlling for geographic factors, we now estimate that the difference in pro-Russian votes between precincts with and without Russian TV reception is 8.8 percentage points.

Model 3: With confounders and other control variables

In the third model, we add the relevant control variables.

fit3 <- lm(r14pres ~ qualityq + raion.f + distrussia + ukrainian + registered12 + r12.f + turnout12 + village + roads, data = ukraine_centered)
fit3 |> tidy(conf.int = TRUE, conf.level = .95) |> select(term, estimate, std.error, conf.low, conf.high)

In the third model, the intercept represents the predicted pro-Russian vote share for precincts with no Russian television reception and average scores on all numeric confounders and control variables, and averaged across the effect-coded variables. Notice that the effect of exposure to Russian television decreases slightly once these additional control variables are included.

For example, using the results from the third model, if we compare two precincts that are identical in all aspects except for their Russian television reception, we can observe the following: one precinct receives no Russian television reception (qualityq = 0), while the other has very strong Russian television reception (qualityq = 1). We expect the precinct with very strong reception (and presumably greater exposure to Russian propaganda) to have a pro-Russian vote share approximately 7.2 percentage points higher than the precinct with no reception. The standard error is relatively small, leading to a 95% confidence interval that ranges from 4.9 to 9.5 percentage points, which does not include 0.

The authors describe the effect as follows:

The goal of this article was to evaluate how conspicuously biased media impacts mass electoral behavior in a highly polarized political environment. We find consistent evidence that Russian television had a major impact on electoral outcomes in Ukraine by increasing electoral support for pro-Russian political candidates and parties.

Check the linearity assumption

We interpreted the effect of interest in this model — i.e., the effect of qualityq holding constant the confounders and other control variables — as the expected change in percent vote for pro-Russian candidates for a one-unit increase in the quality of Russian TV reception in the precinct. This assertion assumes that the relationship between the predictor and the outcome, holding constant the other variables, is indeed linear. Earlier, we examined a scatter plot of the raw data — and the linear assumption seemed reasonable. Now, let’s examine whether the assumption holds after controlling for the confounders and the other control variables.

Let’s begin with an Added Variable Plot.

y_resid <- 
  lm(r14pres ~ raion.f + distrussia + ukrainian + registered12 +  r12.f + turnout12 + village + roads, data = ukraine_centered) |> 
  augment(data = ukraine) |> 
  select(precinct, r14pres, qualityq, .resid) |> 
  rename(y_resid = .resid)

x_resid <- 
  lm(qualityq ~ raion.f + distrussia + ukrainian + registered12 + r12.f + turnout12 + village + roads, data = ukraine_centered) |> 
  augment(data = ukraine) |> 
  select(precinct, .resid) |> 
  rename(x_resid = .resid)

check <- 
  y_resid |> 
  left_join(x_resid, by = "precinct") 

check |> 
  ggplot(mapping = aes(y = y_resid, x = x_resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "#2F9599") +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE, color = "#F26B38") +
  theme_minimal() +
  labs(title = "Does the linear relationship between Russian TV reception and \npro-Russian votes in the 2014 Ukrainian Presidential election hold after \naccounting for covariates?",
       subtitle = "Blue line represents linear best fit, orange line represents loess smooth",
       x = "Residual for quality of Russian television reception in the precinct",
       y = "Residual for percent of pro-Russian votes in the precinct")

The linear assumption appears to hold here as well. Therefore, we are on solid ground for using a linear model to relate our exposure (qualityq) to our outcome (r14pres) after accounting for the confounders and control variables.

Recall that another useful plot for examining the assumption of linearity is to examine a scatterplot of the .fitted values and the .resids from the fitted model. Let’s create this plot for our third model — which we named fit3.

fit3 |> 
  augment(data = ukraine_centered) |> 
  ggplot(mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1, color = "#2F9599") +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE, color = "#F26B38") +
  xlab("Fitted Values") +
  ylab("Residuals") +
  ggtitle("Fitted Values vs Residuals") +
  theme_bw()

If the relationship between the fitted values and residuals appears random and shows no discernible pattern, it suggests that the linearity assumption for the overall model is reasonably met. If the residuals exhibit a clear pattern or systematic trend as the fitted values change, it indicates a violation of the linearity assumption. If the residuals form a curved pattern, it suggests a potential nonlinear relationship between the predictors and the outcome variable. The addition of the loess smooth on this graph helps us to discern patterns. The assumptions of linearity seem to be met in this graph as well — the orange loess smooth only departs from the blue horizontal line minimally.

Create a publication-ready table of the results

We can create a publication ready table of these findings using tbl_regression() from the gtsummary package. Note that I left out the indicators for county (raion.f) to create a smaller table.

The code fits three regression models (fit1, fit2, and fit3) with different sets of predictor variables. For each model, it creates a table of regression results using the tbl_regression() function from the gtsummary package. The include argument specifies the variables to include in the regression model, while intercept = TRUE includes the intercept term.

After obtaining the tables for each model (out1, out2, and out3), the code merges them into a single table using tbl_merge() from the gtsummary package. The tab_spanner argument specifies the labels for each model in the merged table.

Finally, the as_gt() function is used to convert the table to the gt format. The gt package provides additional styling and customization options for the table — including titles and source notes as demonstrated here.

out1 <- fit1 |> 
  tbl_regression(include = c("qualityq"), intercept = TRUE) |> 
  modify_header(label ~ "**Term**", estimate ~ "**Estimate**")

out2 <- fit2 |> 
  tbl_regression(include = c("qualityq", "distrussia"), intercept = TRUE) |> 
  modify_header(label ~ "**Term**", estimate ~ "**Estimate**")

out3 <- fit3 |> 
  tbl_regression(include = c("qualityq", "distrussia", "ukrainian", "r12.f", "registered12", "turnout12", "village", "roads"), intercept = TRUE) |> 
  modify_header(label ~ "**Term**", estimate ~ "**Estimate**")

  tbl_merge(
    tbls = list(out1, out2, out3),
    tab_spanner = c("Model 1", "Model 2", "Model 3")) |> 
    as_gt() |> 
    tab_header(title = md("**Table 2. Fitted regression model to estimate the causal effect of Russian propaganda on the presidential election for Ukraine in 2014**")) |> 
    tab_source_note("Ukrainian counties are included as confounders in Models 2 and 3, but excluded from table")

Term	Model 1			Model 2			Model 3
Table 2. Fitted regression model to estimate the causal effect of Russian propaganda on the presidential election for Ukraine in 2014
Term	Estimate	95% CI¹	p-value	Estimate	95% CI¹	p-value	Estimate	95% CI¹	p-value
(Intercept)	17	16, 18	<0.001	20	20, 21	<0.001	20	20, 20	<0.001
Probabality of Russian TV reception	48	44, 51	<0.001	8.8	5.9, 12	<0.001	7.2	4.9, 9.5	<0.001
Distance to the Russian border (km) -- log scale				-4.3	-5.1, -3.6	<0.001	-1.6	-2.3, -1.0	<0.001
Percent Ukrainian speakers from census							-0.03	-0.04, -0.02	<0.001
Deciles of percent pro-Russian votes in the 2012 parliamentary election
1							-8.5	-9.3, -7.7	<0.001
2							-7.5	-8.2, -6.9	<0.001
3							-6.7	-7.3, -6.0	<0.001
4							-5.6	-6.2, -5.1	<0.001
5							-3.7	-4.2, -3.1	<0.001
6							-1.4	-2.0, -0.84	<0.001
7							1.7	1.1, 2.2	<0.001
8							5.3	4.7, 5.9	<0.001
9							9.4	8.7, 10	<0.001
10							—	—
Registered voters in the 2012 parliamentary election -- log scale							0.16	-0.21, 0.53	0.4
Percent of voters who voted in the 2012 parliamentary election							-0.02	-0.04, 0.01	0.2
Binary indicator to compare villages to towns and cities							-0.94	-1.5, -0.37	0.001
Road density within 1-km of the polling station -- log scale							-0.43	-0.74, -0.12	0.006
Ukrainian counties are included as confounders in Models 2 and 3, but excluded from table
¹ CI = Confidence Interval

Enhanced interpretation of effect of exposure

Rather than interpreting the effect of the exposure as the expected difference in the vote percentage when comparing a precinct with no reception (qualityq equals 0) to a precinct with certain reception (qualityq equals 1), we could instead compute the expected change in the vote percentage for a 1 standard deviation increase in reception strength.

To calculate the expected change in the pro-Russian vote (the outcome) for a 1 standard deviation increase in qualityq, we can use the estimated slope for qualityq from our regression results. Here are the steps for Model 3:

Identify the standard deviation of qualityq: The standard deviation of qualityq is 0.1480122.
Identify the estimated slope for qualityq from our regression results: The estimated slope for qualityq is 7.17268669.
Calculate the expected change in the outcome for a 1 standard deviation increase in qualityq: Multiply the coefficient for qualityq by the standard deviation of qualityq. 7.17268669×0.1480122 ≈ 1.061. Therefore, the expected change in the pro-Russian vote (the outcome) for a 1 standard deviation increase in qualityq is approximately 1.1 percentage points.

Acknowledgement of study limitations

Importantly, the authors carefully consider potential issues with their analyses that could threaten the validity of their findings.

Two additional concerns regarding identification are worth noting. First, Russia might be building its television transmitters strategically in order to influence Ukrainian voters. According to the data by the International Telecommunication Union, Russia issued 108 new analog television transmitter licenses from 2013 to 2015. None of these new transmitters were placed in the vicinity of the Russian–Ukrainian border. In fact, in June 2015, Russia reduced the power of television transmitters along its border with Ukraine. This is the opposite of what one would expect had Russia been strategically placing its transmitters along the Ukrainian border.

Another potential source of concern is residential self-sorting: Individuals might relocate to places with better (worse) Russian analog television reception if they already have pro-Russian (pro-Western) sympathies and values. This concern is exacerbated by the fact that millions of internally displaced persons (IDPs) moved from the conflict zone in the east to other parts of Ukraine. While this type of self-sorting is possible in theory, there is little empirical support for this notion. The IDPs typically move to settlements where there are jobs and government services geared toward them (primarily cities and large towns), and it is highly unlikely that the IDPs would prioritize the availability of Russian analog television when deciding where to relocate. In addition, the movement of the IDPs began in earnest in the summer of 2014, whereas we identify electoral effects of Russian television as of May 2014. In summary, the overall evidence indicates that the main assumptions behind our research design are well justified. At the same time, as in any observational study, the problem of confounders can never be ruled out conclusively, so the results should be interpreted with caution.

Overall, this is an excellent example of how to cleverly construct a data frame to answer an important research question, and carefully carry out an analysis to estimate a causal effect.

Summary and closing remarks

Estimating causal effects from observational data is a complex endeavor that demands meticulous planning and execution to ensure the validity and reliability of the findings. In contrast to controlled experiments, observational studies rely on naturally occurring data, making it challenging to establish a cause-and-effect relationship.

As renowned statistician Dr. John Tukey once stated,

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

This quote reminds us of the importance of approaching causal estimation with caution and rigor.

To estimate causal effects, researchers must carefully design their studies, taking into account various factors such as study population selection, data collection methods, and controlling for confounding variables. Proper study design involves identifying potential confounders and implementing strategies to minimize their influence on the observed relationships.

Addressing confounding variables often involves employing statistical techniques such as well-executed regression models (as utilized in this Module). Advanced techniques like propensity score matching, instrumental variables, regression discontinuity designs, and difference-in-differences models may also be utilized, and I provide some resources at the end of this Module for those readers interested in exploring these methods.

Additionally, meticulous execution of data collection procedures is crucial. Accurate and reliable measurements, appropriate sample sizes, and proper statistical analysis techniques are essential to ensure the robustness of the estimated causal effects. Researchers should also be aware of potential biases that may arise from issues such as selection bias, measurement errors, or unmeasured confounders. These biases can distort the estimated causal effects and undermine the validity of the findings. Mitigating these biases requires careful attention to study design, data collection protocols, and analytical approaches.

In summary, estimating causal effects from observational data is indeed feasible, but it requires a meticulous and thoughtful approach. Through careful planning and execution, researchers can overcome challenges related to confounding variables, biases, and other limitations inherent in observational studies. By doing so, they can derive meaningful and reliable causal insights from the available data.

Wrap up

In this Module, we have delved into the intricate world of causal inference, a cornerstone of empirical research that seeks to unravel the cause-and-effect relationships underlying observed data. Understanding and estimating causal effects is pivotal for advancing knowledge across various fields, particularly in the social and behavioral sciences.

The Essence of Causal Inference

Causal inference distinguishes itself from mere correlation by establishing a directional influence between variables. It enables researchers to identify whether a change in one variable (the cause) directly leads to a change in another variable (the effect). This capability is fundamental for developing interventions, shaping policies, and advancing theoretical understanding.

The Power of Randomized Controlled Trials (RCTs)

RCTs are the gold standard in establishing causality due to their ability to eliminate confounding variables through random assignment. This randomization ensures that any observed differences in outcomes between the treatment and control groups can be attributed to the treatment itself, thus enhancing internal validity. However, ethical, practical, and logistical constraints can limit the feasibility of RCTs, necessitating alternative approaches.

Addressing Confounding in Observational Studies

When RCTs are not feasible, observational studies become essential for causal inference. These studies must meticulously account for confounding variables — factors that influence both the exposure and the outcome. By controlling for confounders, researchers can approximate the counterfactual scenario and estimate causal effects with greater accuracy. The potential outcomes framework, which differentiates between factual and counterfactual outcomes, is instrumental in this process.

Balancing Internal and External Validity

While internal validity ensures the accuracy of causal claims within the study population, external validity pertains to the generalizability of these findings to broader contexts. Achieving a balance between these two forms of validity is crucial for applying research insights to real-world scenarios.

Concluding Remarks

Estimating causal effects is a nuanced and meticulous endeavor, especially when relying on observational data. The quest for answers must be tempered with methodological rigor and a critical eye towards potential biases and confounding factors. Through thoughtful planning, rigorous execution, and the judicious application of statistical techniques, researchers can unlock meaningful insights into causal relationships, thereby contributing to the advancement of knowledge and the betterment of society.

As we conclude this Module, we emphasize the significance of causal inference in research. It empowers us to make informed decisions, develop effective interventions, and build robust theoretical frameworks. By mastering the principles and techniques of causal inference, we enhance our ability to uncover the underlying mechanisms driving observed phenomena, ultimately leading to a deeper and more accurate understanding of the world around us.

Resources

If you’d like to learn more about estimating causal effects with observational data, I highly recommend the following books:

Mastering Metrics by Drs. Joshua D. Angrist and Jörn-Steffen Pischke
What if by Drs. Miguel Hernan and Jamie Robins
Causal Inference: The Mixtape by Dr. Scott Cunningham
Regression and Other Stories by Drs. Andrew Gelman, Jennifer Hill, and Aki Vehtari

Footnotes

If we reverse the numerator and denominator, we would calculate the RR as: RR = 0.59/0.24 = 2.46. In this case, the risk of PTSD if the treatment is withheld is about 2.46 times the risk if the treatment is received. In other words, the risk of PTSD in those who have not undergone the therapy is 2.46 times the risk in those who have. This also indicates a strong effect of the treatment in reducing the risk of PTSD. You can convert from one form of the RR to the other by taking the reciprocal of the RR as well.↩︎
Bloc refers to a group or coalition of political parties that share similar ideologies, goals, or interests.↩︎