library(gtsummary)
library(gt)
library(tidyverse)
Basic Rules of Probability
Module 5
Learning objectives
- Define probability and its measurement scale
- Discuss the meaning of the “data generating process”
- Explain the role of probability in statistics, including its use in quantifying uncertainty
- Differentiate between theoretical and empirical probability and discuss their applications
- Understand the basis of inferential statistics and how it uses probability
- Apply the basic rules of probability, including the addition and multiplication rules
- Describe Bayes’ theorem in updating probabilities
Let’s set off our exploration of probability with this video from Crash Course Statistics.
What is probability and why does it matter to statistics?
Probability is a mathematical concept that measures the likelihood of an event occurring. It is expressed as a number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty. The concept of probability is foundational to statistics because it allows for the quantification of uncertainty and the making of informed predictions about events based on observed or assumed frequencies of occurrence.
Probability matters to statistics for several reasons:
Modeling Uncertainty: Probability provides a framework for modeling uncertainty in real-world phenomena. This is crucial in many areas, such as weather forecasting, risk assessment in finance and insurance, and decision-making under uncertainty in various fields.
Inferential Statistics: Probability is the basis of inferential statistics, which involves making predictions or inferences about a population based on a sample. Probability theory allows statisticians to quantify the uncertainty of these inferences through concepts such as confidence intervals and significance tests.
Hypothesis Testing: Probability is used in hypothesis testing, where a statistical hypothesis is evaluated. The concepts of p-values and significance levels, which are grounded in probability, help determine whether observed data are consistent with a given hypothesis.
Bayesian Statistics: In Bayesian statistics, probability is used to update the probability of a hypothesis as more evidence or data becomes available. This approach relies on Bayes’ theorem and provides a flexible way to incorporate prior knowledge into statistical analysis.
Predictive Modeling: Probability is key to predictive modeling, including machine learning, where the goal is to predict future outcomes based on past data. Probability helps in assessing the likelihood of various possible outcomes, thereby informing decisions and strategies.
Design of Experiments: Probability theory guides the design of experiments and surveys. It helps in determining sample sizes necessary to achieve desired levels of accuracy and confidence in the results.
In summary, probability is essential to statistics as it provides the tools to model, analyze, and make decisions about data under conditions of uncertainty. It enables statisticians to quantify the likelihood of events and to draw conclusions that are informed by data.
In this module, we’ll study several relevant facets of probability — with an eye toward giving you the foundation needed to understand the role of uncertainty in statistics as well as inferential statistics.
What is a data generating process?
Throughout the semester, we’ll often circle back to the the concept of the data generating process. The data generating process (DGP) refers to the underlying mechanism or system that produces observed data. It encompasses all factors, conditions, and models that determine how data comes into existence. Understanding the DGP is crucial for both theoretical and applied statistical analyses because it provides insights into the relationships between variables, the distributional properties of the data, and potential causal mechanisms.
How is probability different from statistics?
Probability theory starts with known models or theoretical frameworks to predict the likelihood of various outcomes. It effectively focuses on the Data Generating Process (DGP) — a forward process that begins with the premises or conditions of how data is produced and then predicts the outcomes. For example, if we know a coin is fair, this is part of the DGP. Probability theory allows us to calculate the probability of landing heads in a series of flips — a predictive approach based on predefined rules and models. In this case, each flip of the fair coin has a predefined probability of 0.50 (i.e., a 50% chance) of landing heads and 0.50 (i.e., a 50% chance) of landing tails.
Statistics, on the other hand, operates in reverse. It begins with outcomes or observed data without a priori knowledge of the underlying model or the conditions that generated the data, working backwards from the DGP. The task of statistics is to analyze the data to infer the models, conditions, or processes that could have produced it. This process is inherently inductive, moving from specific instances (data) to general conclusions (models or theories) about the underlying reality. For instance, if a coin is flipped 100 times and lands heads 75 times, we can use statistics to analyze these outcomes to infer whether the coin might be biased towards heads, rather than assuming it was fair as probability theory does. In essence, while probability theory asks, “Given a known model, what outcomes can we expect?” statistics asks, “Given these outcomes, what can we infer about the model or the DGP?”
This inverse operation offered by statistics is crucial for several reasons:
Empirical Inquiry: In real-world scenarios, especially in research and scientific inquiry, we often start with empirical data — observations or experimental results — without a clear understanding of the underlying phenomena. Statistics allows us to make sense of this data, constructing models that explain it or predicting future data points.
Model Testing and Validation: Statistics enables researchers to test the validity of theoretical models against observed data. By comparing the predictions of a model (derived from probability theory) with actual outcomes, statisticians can assess the model’s accuracy and applicability to real-world scenarios.
Inference Under Uncertainty: Unlike probability, which deals with theoretical certainty, statistics grapples with uncertainty and variability in data. It provides methods to estimate the probability of hypotheses, taking into account the randomness and imperfection in data collection and measurement.
Adaptive Learning: Statistics supports the iterative process of learning from data. As new data becomes available, statistical methods allow for the refinement and adjustment of models, enhancing our understanding of the phenomena under study.
Why does a student of statistics need to understand probability?
Given this interplay between statistics and probability theory, having a good working understanding of basic probability is crucial for anyone delving into statistics. This foundational knowledge is not merely academic; it is the bedrock upon which statistical reasoning and analysis are built. Probability concepts such as random variables, probability distributions, expectations, and variance form the core language and toolkit of statistics. Without a firm grasp of these concepts, one will struggle to understand how statistical models are constructed, how data is analyzed, and how inferences about the real world are drawn.
Defining Events
In probability theory, an event represents a specific outcome (or a set of outcomes) from a probability experiment. Events are the fundamental building blocks in the study of probability. For example, in the context of tossing a coin, we might define event A as the coin landing on heads. As another example, in the scenario of drawing cards from a deck, an event could involve drawing a heart (event B) or a face card (event C).
To formalize the study of these events, we often assign them labels, such as event A, event B, and so on, and we seek to determine the likelihood of these events. The probability of an event A, denoted as \(P(A)\), quantifies the chance of event A occurring.
Theoretical vs. Empirical Probability
Theoretical Probability, also known as classical probability, refers to the probability of an event happening based on a theoretical model, without direct observation. This kind of probability is calculated under idealized conditions and is used to determine the likelihood of events in situations where outcomes are well-defined. An example includes the probability (P) of drawing a heart from a standard deck of cards:
\[ P(\text{B}) = \frac{\text{Number of Hearts}}{\text{Total Number of Cards}} = \frac{13}{52} = \frac{1}{4} = .25 = 25\% \]
Empirical Probability (also referred to as experimental probability or observed probability) is calculated based on observed data. The formula (i.e., probability of event D — \(P(D)\) is given by the ratio of the number of times event D occurs to the total number of observations.
\[ P(D) = \frac{\text{Number of times event D occurs}}{\text{Total number of observations}}\ \]
For instance, if a random sample of teenagers was drawn and it was found that 225 out of 300 surveyed teenagers used at least one social media platform, the empirical probability of a teenager using social media is:
\[ \small P(\text{Using social media}) = \frac{\text{Number of sampled teenagers who use social media}}{\text{Total number of teenagers sampled}}= \frac{225}{300} = 0.75\ \]
By multiplying this probability by 100 — then we can express the result as a percentage of the sample — that is, 75% of the adolescents in the sample use social media.
Delineating between theoretical and empirical probability is important for several reasons, each contributing to a deeper understanding and application of probability in different contexts:
Foundation vs. Application:
Theoretical Probability provides a foundational understanding based on mathematical models, principles, and assumptions. It deals with what should happen in an ideal or perfectly modeled scenario. For example, the theoretical probability of rolling a six on a fair die is 1/6, derived from the assumption that all six outcomes are equally likely.
Empirical Probability (or experimental probability), on the other hand, is based on actual observations or experiments. It deals with what does happen when an experiment is conducted. For instance, if you roll a die 600 times and get a six 100 times, the empirical probability of rolling a six is 100/600 = 1/6, which may or may not match the theoretical probability, depending on the fairness of the die and the adequacy/rigourousness of the experiment.
Ideal Conditions vs. Real-World Data:
Theoretical probabilities are calculated under the assumption of ideal conditions. This can include assumptions of perfect symmetry, fairness, or randomness, which may not always hold true in real-world situations.
Empirical probabilities are derived from real-world data and therefore incorporate the randomness, bias, or imperfections present in practical scenarios. This makes empirical probability crucial for validating theoretical models and for applications in fields where actual data is paramount.
Predictive Modeling vs. Descriptive Analysis:
Theoretical probability is often used in predictive modeling, where we want to understand the likelihood of future events based on a set of assumptions. It is essential for designing experiments, simulations, and models that predict outcomes before they are observed.
Empirical probability is used for descriptive analysis, to describe patterns observed in past data. It is vital for statistical inference, where conclusions about a population are drawn based on sample data.
Understanding Uncertainty:
- Both approaches are crucial for understanding uncertainty in different contexts. Theoretical probability helps us understand the nature of randomness and make predictions in controlled conditions or determine the sample size for a proposed study, while empirical probability allows us to measure and analyze the uncertainty observed in real-world phenomena.
Validation and Refinement of Models:
- A critical aspect of scientific research and statistical analysis is the validation and refinement of theoretical models based on empirical evidence. Discrepancies between theoretical and empirical probabilities can lead to deeper investigations into the assumptions of a model, potential biases in data collection, or the discovery of new phenomena.
In summary, distinguishing between theoretical and empirical probability enhances our ability to interpret outcomes, make predictions, and apply probability concepts accurately across various disciplines. Each approach provides unique insights, and together, they form a comprehensive framework for understanding and applying probability in both theoretical and practical contexts.
Simulation of a Probability Model: Blood Type Example
Blood type classification is an excellent application of probability theory. The ABO system is used to determine an individual’s blood type. It categorizes blood types based on the presence or absence of two antigens, A and B, on the surface of red blood cells. This results in four primary ABO blood types: A, B, AB, and O. Each person possesses only one of these blood types, making them what’s called exclusive instances or elementary events in the context of probability theory.
An elementary event represents a unique outcome within the entirety of a probability experiment’s sample space, and it cannot be decomposed into simpler outcomes. Within the framework of the ABO blood type system, the types A, B, AB, and O epitomize these elementary events in the sample space encompassing all possible ABO blood types. Thus, the sample space here is defined by the set of all ABO blood types: A, B, AB, and O. From this, we might then ask — “What is the probability of a certain blood type (i.e., an event), for example, type O blood?”
The prevalence of these blood types has been determined through large-scale studies, revealing the following population probabilities:
Probability of having Type A blood: 0.42 (42% of the population)
Probability of having Type B blood: 0.10 (10% of the population)
Probability of having Type AB blood: 0.04 (4% of the population)
Probability of having Type O blood: 0.44 (44% of the population)
For an event X, such as having blood type O, the probability of that event, denoted \(P(X)\), is a value between 0 and 1. The higher the probability, the more likely the event is to occur. Probabilities are never negative. A probability of 0 means the event will never happen, while a probability of 1 means the event will always happen.
Since each individual has exactly one blood type and these types cover all possibilities for anyone’s blood type, the probabilities of the elementary events sum to 1. This summation demonstrates the law of total probability, which states that the sum of the probabilities of all mutually exclusive and exhaustive events (i.e., the elementary events) in a sample space is 1.
So, for example, the sum of the probabilities of blood types A, B, AB and O equals:
\(0.42 + 0.10 + 0.04 + 0.44 = 1.0\).
Simulation
Simulations serve as powerful tools in applied statistics, enabling us to explore statistical properties, conduct preparatory analyses for forthcoming studies (e.g., power analysis), and gain insights into the behaviors of models under study. A fundamental aspect of simulations is their reliance on data generating processes (DGPs) — the theoretical or conceptual mechanisms that describe how data points are produced. By simulating data, we essentially replicate a simplified version of these underlying processes, allowing us to study their outcomes under controlled conditions.
For our blood type example, the DGP can be thought of as the biological and genetic rules that determine an individual’s blood type. These rules, while complex in reality, can be abstracted into probabilities for simulation purposes, reflecting the relative frequencies of each blood type in a population.
For our initial foray into simulations, we’ll engage with a straightforward example that applies these concepts. We aim to create a data frame representing a random sample of 1000 individuals from the population, with their blood types distributed according to the previously mentioned probabilities. This simulated dataset not only embodies the practical application of probability theory but also illustrates how DGPs are conceptualized and utilized in statistical simulations. By constructing and analyzing this data frame, we’ll gain valuable experience in how statistical tools can be used to model and understand the data generating processes that underpin observable phenomena.
As we progress through this course, we’ll delve deeper into various types of simulations, each designed to mimic different aspects of DGPs, from simple random sampling to more complex models that account for interactions between multiple variables. Through these exercises, we will enhance our understanding of both the power and limitations of simulations in statistical analysis, preparing us to apply these techniques to real-world data challenges.
Let’s load the libraries that will need for this module. You’ve already encountered the gt and tidyverse packages. We’ll utilize a new package call gtsummary in this module, this useful package creates flexible and highly useful summary tables.
Here’s a code chunk that creates the desired simulated data frame.
set.seed(1234)
# Simulate a random sample of size 1000
<-
blood_types_df tibble(
blood_type = sample(
size = 1000,
replace = TRUE,
x = c("A", "B", "AB", "O"),
prob = c(0.42, 0.10, 0.04, 0.44)
)
)
|>
blood_types_df head(n = 100)
Let’s break down the parts of this code:
Creating a Data Frame (
tibble()
): The tibble() function is used to create a data frame. Here, a data frame called blood_types_df is created with one column/variable named blood_type.Sampling (
sample()
): The sample() function is used to generate random samples. In this context, it’s used to simulate the blood types of 1000 people. Let’s look at the arguments passed to this function:size = 1000
: This specifies the number of samples to draw, which is 1000 in this case. It means we’re simulating the blood types for 1000 individuals.replace = TRUE
: This indicates that sampling is done with replacement. In other words, once a blood type is selected for an individual, it’s not removed from the pool of options for the next individual. This makes sense here since an individual’s blood type doesn’t affect another’s.x = c("A", "B", "AB", "O")
: This is the vector of possible outcomes or categories from which samples are drawn. Here, it represents the four blood types.prob = c(0.42, 0.10, 0.04, 0.44)
: This vector provides the probabilities associated with each category inx
. These values must sum to 1, representing a complete probability space for the blood types. The probabilities correspond to the likelihood of each blood type in the population.
In summary, this code simulates a random sample of 1000 people’s blood types based on the given probabilities for each blood type in the population. This is a practical example of how probabilistic models can be used to simulate real-world scenarios in statistics.
Using this simulated data, we can calculate the proportion of people with each blood type. The count() function calculates the number of cases in the simulated data that have each blood type, then the mutate() function is used to turn the counts in proportions. Here, the proportion of people with each blood type should be very close to the probabilities that we supplied for the simulation. And, in fact, that is precisely what we find.
# Calculate the probability of each blood type in the simulated sample
|>
blood_types_df count(blood_type) |>
mutate(proportion = n / sum(n))
With this fundamental understanding of probability, let’s now build upon this foundation to learn the basic rules of probability.
Basic rules of probability
The basic rules of probability help us to understand and calculate the likelihood of various events in a consistent and reliable manner.
Addition Rule of Probability
The addition rule is utilized to determine the probability of any one of several events occurring. For example, for events Y and Z — the addition rule helps us determine the probability of either event Y occurring or event Z occurring. This is often referred to as the “union” (denoted \(\cup\)) of two events.
When events are mutually exclusive
For mutually exclusive events, meaning no two events can occur at the same time, the addition rule is applied by summing the probabilities of each individual event. Therefore, for two mutually exclusive events (Y and Z), the formula for the addition rule is as follows:
\[ P(Y \text{ or } Z) = P(Y \cup Z) = P(Y) + P(Z) \]
The example that we just considered — ABO blood type — is a prime example of this. As mentioned earlier, these four ABO types are mutually exclusive, since each person is classified as having just one type.
We might ask the question: “What is the probability of having Type A or Type AB blood?” We can use the addition rule of probability for mutually exclusive events to answer this question, here, we plug in the probabilities of these two blood types from the known population data:
- Probability of having Type A blood: 0.42 (42% of the population)
- Probability of having Type AB blood: 0.04 (4% of the population)
\[ P(\text{Type A or Type AB}) = P(\text{Type A}) + P(\text{Type AB}) = 0.42 + 0.04 = 0.46 \]
When events are not mutually exclusive
For non-mutually exclusive events, that is, events that can happen at the same time, the addition rule states that the probability of either event Y or event Z occurring is the sum of their individual probabilities minus the probability of them occurring together. This adjustment is necessary to avoid double-counting the probability of both events happening together.
\[ P(Y \text{ or } Z) = P(Y \cup Z) = P(Y) + P(Z) - P(Y \text{ and } Z) \]
Let’s consider an applied example that builds on blood type. You might know that your blood type isn’t just about the ABO status (i.e., whether you’re A, B, AB, or O) — it also involves your Rh status, making you either Rh Positive (Rh+) or Rh negative (Rh-). In the population, the probability of Rh Positive (Rh+) blood is 0.85, while the probability of Rh negative (Rh-) blood is 0.15. When the ABO status is crossed with the Rh status, there are 8 different combinations that are possible — and the bullets below present the probability of each type, along with the corresponding percentage of the population (i.e., \(probability \times 100\)).
- \(P(O+)\) = 0.374 or 37.4% of the population
- \(P(O-)\) = 0.066 or 6.6% of the population
- \(P(A+)\) = 0.357 or 35.7% of the population
- \(P(A-)\) = 0.063 or 6.3% of the population
- \(P(B+)\) = 0.085 or 8.5% of the population
- \(P(B-)\) = 0.015 or 1.5% of the population
- \(P(AB+)\) = 0.034 or 3.4% of the population
- \(P(AB-)\) = 0.006 or 0.6% of the population
We might ask the question: “What is the probability of having Type O blood or Rh negative blood?” We can use the addition rule of probability for non-mutually exclusive events to answer this question.
To calculate the probability of having Type O blood or Rh negative blood using the addition rule for non-mutually exclusive events, we use the following formula:
\[ P(\text{Type O or Rh-}) = P(\text{Type O}) + P(\text{Rh-}) - P(\text{Type O and Rh-}) \]
From the given information above, we know the following:
- The probability of having Type O blood (either O+ or O-) is the sum of the probabilities of O+ and O-, which is (0.374 + 0.066), or 0.440.
- The probability of being Rh negative is the sum of the probabilities of all Rh negative blood types (O-, A-, B-, AB-), which is (0.066 + 0.063 + 0.015 + 0.006), or 0.150.
- The probability of having Type O blood and being Rh negative is 0.066.
Thus, the probability of having Type O blood or being Rh negative, accounting for the overlap between these two conditions, is calculated as follows:
\[ P(\text{Type O or Rh-}) = 0.440 + 0.150 - 0.066 = 0.524 \]
This demonstrates that the probability of having Type O blood or being Rh- is about .52 — in other words there is a 52% chance of having Type O blood or being Rh negative.
Multiplication Rule of Probability
The multiplication rule applies to finding the probability of both event Y and event Z occurring, denoted as \(P(Y \text{ and } Z)\). The term “joint probability” (denoted \(\cap\)) refers to the probability of both events occurring together.
When events are independent
For independent events, that is two events that are unrelated, the rule is:
\[ P(Y \text{ and } Z) = P(Y \cap Z) = P(Y) \times P(Z) \]
In the context of probability theory, when we describe two events, Y and Z, as independent, we mean that the occurrence of one event has no effect on the probability of the occurrence of the other event. Their outcomes are not influenced by each other.
The ABO blood group system classifies blood into A, B, AB, or O types based on the presence of antigens on the surface of red blood cells. The Rh system classifies blood as either Rh-positive or Rh-negative based on the presence of the Rh D antigen. These systems are determined by different genes located on different chromosomes and are independent of one another.
Let’s calculate the probability of an individual having Type AB for their ABO status and also having a negative Rh status (i.e., Rh-).
Given the probabilities:
- The probability of having Type AB blood is 0.04.
- The probability of being Rh- is 0.15.
Applying the multiplication rule for independent events:
\[ P(\text{Type AB and Rh-}) = P(\text{Type AB}) \times P(\text{Rh-}) = 0.04 \times 0.15 = 0.006 \]
This calculation shows that the probability of an individual being Type AB and Rh- (i.e., in other words, there is a .6% chance).
When events are dependent
In situations where events are dependent, meaning the occurrence of one event influences the likelihood of another event occurring, the formula for calculating the probability of both events happening together needs to be adjusted to reflect this dependency:
\[ P(Y \text{ and } Z) = P(Y \cap Z) = P(Y) \times P(Z | Y) \]
Here, \(P(Z∣Y)\) represents the conditional probability of event Z occurring given that event Y has already occurred. The notation \(Z∣Y\) denotes “Z given Y”, underscoring the dependency of Z’s occurrence on the prior occurrence of Y. This adjustment is essential in scenarios where events are not independent; the probability of the second event (Z) is recalculated based on the outcome of the first event (Y).
The modified multiplication rule to account for dependent events is used to compute conditional probabilities. To compute the joint probability of two dependent events occurring together, one must account for the initial probability of the first event (Y) and then multiply it by the adjusted probability of the second event (Z), given the first event’s occurrence. This approach is critical for accurately assessing the likelihood of sequential events where the outcome of one has a direct impact on the probability of the next.
Let’s consider the probability of an individual having Type A blood and being more susceptible to COVID-19. There is some evidence that people with Type A blood are more likely to contract COVID-19. This example illustrates dependent events because the risk of COVID-19 is influenced by the individual’s blood type (i.e., blood type and risk of COVID-19 are dependent).
Let’s make some assumptions to carry out the computation:
- We know that the probability of having Type A blood \(P(\text{Type A})\) is 0.42.
- Based on research findings, let’s assume that individuals with Type A blood have a 1.2 times higher relative risk of contracting COVID-19 compared to the average risk in the population. If the average population risk (i.e., the probability) of contracting COVID-19, \(P(\text{COVID-19})\) in a certain situation is 0.05, then the probability of developing COVID-19 given the individual has Type A blood \(P(\text{COVID-19} \mid \text{Type A})\) could be adjusted proportionally.
We can calculate the \(P(\text{COVID-19} \mid \text{Type A})\) using the stated relative risk:
\[ \small P(\text{COVID-19} \mid \text{Type A}) = P(\text{COVID-19}) \times \text{Relative Risk} = 0.05 \times 1.2 = 0.06 \]
This means that the probability of contracting COVID-19 given Type A blood is 0.06.
Thus, the probability of an individual having Type A blood and contracting COVID-19 is:
\[ \small P(\text{Type A and COVID-19}) = P(\text{Type A}) \times P(\text{COVID-19} \mid \text{Type A}) = 0.42 \times 0.06 = 0.0252 \]
That is, the probability of having Type A blood and contracting COVID-19 is approximately 0.0252, or a 2.52% chance.
Importantly, we should note that it is equivalent to express the joint probability of events Y and Z occurring together as the product of the probability of event Y occurring and the conditional probability of event Z given Y. In mathematical terms, this can be written as:
\[ P(Y \cap Z) = P(Y) \times P(Z \mid Y) = P(Z) \times P(Y \mid Z) \]
This equation illustrates the principle of conditional probability, ensuring that regardless of the order in which the probabilities are considered, the joint probability remains consistent.
Contingency tables
Building on our foundational understanding of probability, let’s delve deeper into the application of these principles using contingency tables. A contingency table, often referred to as a cross-tabulation or crosstab, displays the frequency distribution of variables in matrix form, facilitating a clear view of the relationship between two categorical variables. In the next section we will simulate data based on specified probabilities and processes — thereby incorporating the idea of the DGP into the construction and analysis of contingency tables. The DGP, which describes the underlying mechanism producing the observed data, plays a critical role in determining the expected frequencies within the table’s cells. Understanding the DGP allows us to not only populate the contingency table with observed frequencies but also to anticipate patterns and relationships that might emerge based on the theoretical foundations of the variables involved.
Example 1: ABO status and Rh status (independent events)
For a contingency table that cross-tabulates blood type (ABO system) against Rh status (positive or negative), let’s construct a table where rows represent the ABO blood types and columns represent the Rh status:
We’ll simulate data for 10,000 people based on the known prevalence of ABO status and Rh status. The results are presented in the table below.
set.seed(8675309)
# Simulate a random sample of size 10000
<-
abo_rh tibble(
blood_type = sample(
size = 10000,
replace = TRUE,
x = c("A+", "A-", "B+", "B-", "AB+", "AB-", "O+", "O-"),
prob = c(.357, .063, .085, .015, .034, .006, .374, .066)
) |>
) separate(blood_type, into = c("ABO_type", "Rh_status"), sep = "(?=[+-])", remove = FALSE) |>
mutate(Rh_status = case_when(Rh_status == "+" ~ "Positive",
== "-" ~ "Negative"))
Rh_status
|> head(n = 10) abo_rh
We can use the tbl_cross() function from the gtsummary package to create a cross table. I’m also making use of the set_variable_labels() from the labelled package to set labels for the variables so the table is easier to read. For each variable we define a label (e.g., for variable ABO_type we assign the label “ABO blood type”). Once defined, these labels, instead of the variable names, will be printed in the table. The tbl_cross() function syntax is straightforward — just list the variable that you want to appear on the rows (e.g., row = ABO_type
) and the variable you want to appear on the columns (e.g., col = Rh_status
).
|>
abo_rh select(ABO_type, Rh_status) |>
::set_variable_labels(ABO_type = "ABO blood type",
labelledRh_status = "Rh status") |>
tbl_cross(row = ABO_type, col = Rh_status)
Rh status
|
Total | ||
---|---|---|---|
Negative | Positive | ||
ABO blood type | |||
A | 615 | 3,544 | 4,159 |
AB | 58 | 342 | 400 |
B | 143 | 806 | 949 |
O | 665 | 3,827 | 4,492 |
Total | 1,481 | 8,519 | 10,000 |
The intersection cells of the table contain the frequency counts of individuals that correspond to each combination of ABO blood type and Rh status.
The contingency table gives us quite a lot of information. First, take a look at the rows, which represent blood type:
A Blood Type: 615 have Rh Negative, 3,544 have Rh Positive, making a total of 4,159 individuals with type A blood.
AB Blood Type: 58 have Rh Negative, 342 have Rh Positive, resulting in 400 individuals with AB blood.
B Blood Type: 143 have Rh Negative, 806 have Rh Positive, totaling 949 individuals with B blood.
O Blood Type: 665 have Rh Negative, 3,827 have Rh Positive, for a total of 4,492 individuals with O blood.
Now, take a look at the columns, which represent Rh status. New information (i.e., not mentioned in the summary of rows) appears in the Total row. 1,481 individuals have Rh Negative blood, 8,519 have Rh Positive blood, with a grand total of 10,000 individuals sampled. Notice also that the Total column entries sum to the same grand total (i.e., 4,159 + 400 + 949 + 4,492 = 10,000).
Terminology
There is some important terminology used to describe contingency tables that is useful:
Values in the Cells
- Cell Frequencies (or Counts): These are the values found in the individual cells of the table, representing the number of observations that fall into each category defined by the intersection of the rows and columns. In our first example table, cell frequencies are the numbers indicating how many individuals have each combination of ABO blood type and Rh status (e.g., 615 for A-, 3,544 for A+, etc.).
Values in the Margins
The term “margin” refers to the totals that appear along the outer edge (either the bottom or the right side) of the table. These margins summarize the data across rows or columns, providing totals that are used for further analysis or to understand the distribution of the data. Margins are essentially the sums of the cell frequencies in a particular direction (row-wise or column-wise).
Row Totals: These are the sums of the cell frequencies across each row, giving the total number of observations for each level of the row variable without regard to the column categories. In our table, the row totals indicate the total number of individuals for each ABO blood type (e.g., 4,159 for blood type A).
Column Totals: These are the sums of the cell frequencies down each column, providing the total number of observations for each level of the column variable without regard to the row categories. In our table, the column totals show the total number of individuals with Negative and Positive Rh status (e.g., 1,481 for Rh-, 8,519 for Rh+).
Grand Total: This is the overall sum of all cell frequencies in the table, giving the total number of observations in the data frame. The grand total is also the sum of all row totals or all column totals (e.g., 10,000 in our table).
In the next section, we will apply this cross table of ABO status and Rh status to the basic rules of probability. Recall that in describing these rules earlier, we had different methods for applying the rules depending on whether the independence assumption was met. We can easily use the table to study the tenability of the independence assumption in our example. The criterion for demonstrating independence between the ABO blood type and Rh status is that the proportion (or percentage) of Rh Positive individuals should be approximately the same across all ABO blood types.
To check for independence, we can calculate the expected frequencies for each cell in the contingency table under the assumption that ABO blood type and Rh status are independent. The expected frequency (E) for each cell is calculated as:
\[ E_{ij} = \frac{(R_i \times C_j)}{N} \]
Where \(E_{ij}\) is calculated by multiplying the row total \(R_i\) for the blood type by the column total \(C_j\) for the Rh status and dividing by the grand total \(N\), which is the total number of observations (10,000 in this example).
For example, to find the expected value for the A+ cell, we take:
\[ E_{\text{A+}} = \frac{(4159 \times 8519)}{10000} = 3543.05 \]
Thus, if ABO status and Rh status are independent, we expect about 3543 cases to be A+.
Using this formula, let’s calculate the expected cell count for all cells in the table:
For ABO type A:
Negative expected = 615.95
Positive expected = 3543.05
For ABO type AB:
Negative expected = 59.24
Positive expected = 340.76
For ABO type B:
Negative expected = 140.55
Positive expected = 808.45
For ABO type O:
Negative expected = 665.27
Positive expected = 3826.73
The table below shows the observed and expected cell counts based on the simulated data (for observed) and the calculated expected counts.
A comparison of the observed cells and our calculated expected cells demonstrates a very close match. Indeed, for each blood type, about 85% of cases are Rh+. This provides evidence that, as expected, ABO type and Rh status are independent of one another.
Apply the contingency table to find corresponding probabilities
Given the contingency table, we can calculate the probabilities for various conditions related to the elements of probability that we’ve studied so far in this Module.
1. What is the probability of Type AB blood?
To find the probability of having Type AB blood, we divide the number of individuals with Type AB blood by the total number of individuals.
\[ P(\text{Type AB}) = \frac{\text{Total with Type AB}}{\text{Total observations}} = \frac{400}{10,000} = 0.0400 \]
This calculation shows that the probability that an individual has Type AB blood is 0.04, by multiplying this probability by 100, we can state that about 4% of the population has Type AB blood.
2. What is the probability of Rh Negative blood?
The probability of being Rh negative is calculated by dividing the total number of individuals with Rh negative blood by the total number of individuals.
\[ P(\text{Rh-}) = \frac{\text{Total Rh Negative}}{\text{Total observations}} = \frac{1,481}{10,000} = 0.1481 \]
This indicates that ~ 15% of the population is Rh negative.
3. What is the probability of having Type O blood or being Rh Positive?
This question involves the union of two events, that is, the probability that a person either has Type O blood, or is Rh+. To calculate \(P(\text{Type O} \cup \text{Rh+})\) with the contingency table you add up the total number of cases of Type O blood and the total number of cases with Rh+ blood, and then from this sum subtract the number of cases with both Type O blood and Rh+ blood (i.e., O-) to avoid double counting. This quantity is then divided by the total number of cases (10,000).
In terms of counts from the cross table:
total Type O cases = 4492
total Rh+ cases = 8519
both Type O and Rh+ cases = 3827
Thus, the probability of Type O or Rh+ is calculated as follows:
\[ P(\text{Type O or Rh+}) = \frac{(4,492 + 8,519) - 3,827}{10,000} = 0.9184 \]
4. What is the probability of having Type A blood and Rh Negative blood?
This question involves the intersection of two events – or the joint probability. It can be written as: \(P(\text{Type A} \cap \text{Rh-})\). Answering this question with the contingency table is quite easy — we just need to find the cell that provides this joint count (N = 615), and divide by the total people observed:
\[ \small P(\text{Type A and Rh-}) = \frac{\text{Total with Type A and Rh Negative}}{\text{Total observations}} = \frac{615}{10,000} = 0.0615 \]
Therefore, the probability of having Type A blood and being Rh Negative, assuming independence, is 0.0615 — or expressed as a percentage — ~6% of the sample has Type A blood and is Rh Negative.
Example 2: ABO status and COVID-19 (dependent events)
In this example, we explore the concept of dependent events in probability using the context of blood types and the risk of contracting COVID-19. Unlike the independent events of ABO and Rh blood status, here we consider the dependency between having Type A blood and the likelihood of contracting COVID-19. We simplify the blood types into two categories: Type A and not Type A (which includes Types B, AB, and O).
Let’s consider simulated data for 10,000 people based on the following parameters:
Probability of having Type A blood, \(P(\text{Type A})\) = 0.42.
Average population risk of contracting COVID-19 in a certain situation, \(P(\text{COVID-19})\) = 0.05.
Individuals with Type A blood have a 1.2 times higher relative risk of contracting COVID-19 compared to the average population risk.
The simulated data are presented in the table below1.
|>
a_covid select(blood_type, covid_status) |>
tbl_cross(row = blood_type, col = covid_status)
Individual has COVID-19?
|
Total | ||
---|---|---|---|
No | Yes | ||
Blood Type | |||
Not Type A | 5,588 | 258 | 5,846 |
Type A | 3,883 | 271 | 4,154 |
Total | 9,471 | 529 | 10,000 |
Apply the contingency table to find corresponding probabilities
Given the contingency table, we can calculate the probabilities for various conditions related to the elements of probability that we’ve studied so far in this module.
1. What is the probability of Type A blood?
To find the probability of having Type A blood, we divide the number of individuals with Type A blood by the total number of individuals.
\[ P(\text{Type A}) = \frac{\text{Total with Type A}}{\text{Total observations}} = \frac{4154}{10,000} = 0.4154 \]
This calculation shows that there’s about a 42% chance of an individual having Type A blood.
2. What is the probability of contracting COVID-19 among people with Type A blood?
To find the probability of contracting COVID-19 among people with Type A blood, we use the conditional probability formula:
From the provided contingency table, we have:
Number of Type A individuals with COVID-19 = 271
Total number of Type A individuals = 4,154
Let’s calculate the probability:
\[ \small P(\text{COVID-19} | \text{Type A}) = \frac{\text{Number of Type A individuals with COVID-19}}{\text{Total number of Type A individuals}} = \frac{271}{4154} = .065 \]
The probability of contracting COVID-19 among people with Type A blood, based on the simulated data, is approximately 0.065. This calculation indicates that out of all individuals with Type A blood in the simulated population, about 6.5% are expected to contract COVID-19.
3. What is the probability of contracting COVID-19 among people without Type A blood?
To find the probability of contracting COVID-19 among people with a blood type other than Type A, we use the conditional probability formula:
From the provided contingency table, we have:
Number of Type B/AB/O individuals with COVID-19 = 258
Total number of Type B/AB/O individuals = 5,846
Let’s calculate the probability:
\[ \small P(\text{COVID-19} | \text{not Type A}) = \frac{\text{Number of not Type A individuals with COVID-19}}{\text{Total number of not Type A individuals}} = \frac{258}{5846} = .044 \]
The chances of contracting COVID-19 among people with Type B, AB or O blood, based on the simulated data, is approximately 4.4%. This calculation indicates that out of all individuals with a blood type other than Type A blood (i.e., Type O, B, or AB) in the simulated population, about 4.4% are expected to contract COVID-19.
Notice that the probability of contracting COVID-19 for people with Type A blood and for people with all other blood types is different — people with Type A blood are more likely to contract COVID-19. This demonstrates the dependence of these two events.
4. What is the probability of having Type A blood or contracting COVID?
To calculate the probability of having Type A blood or contracting COVID-19, we use the formula for the union of two events, denoted as \(P(\text{Type A} \cup \text{COVID-19})\). This represents the probability that a person either has Type A blood, contracts COVID-19, or experiences both.
Given from the contingency table and initial parameters:
- 4154 individuals have Type A blood.
- 529 individuals contracted COVID-19.
- 271 individuals have Type A blood and contracted COVID-19.
\[ P(\text{Type A} \cup \text{COVID}) = \frac{(4,154 + 529) - 271}{10,000} = .4412 \]
The probability of either having Type A blood or contracting COVID-19, denoted as \(P(\text{Type A} \cup \text{COVID-19})\), is calculated by adding the number of people in each individual event and then subtracting the number of people who experienced both events, as they are not mutually exclusive. Subtracting the overlap is necessary to avoid double counting those who fall into both categories. The resulting calculation shows that approximately 44% of the population either have Type A blood or have contracted COVID-19.
A final example
To cap off our journey through the basics of probability, let’s explore a final practical example. Let’s imagine that you work for a large corporation, and you have been notified that there has been an outbreak of COVID-19 at your workplace. Now, you seek to determine if you have the virus too. You have a vacation planned in 3 days — and you are very hopeful that you do not have the virus. You take a rapid antigen test. Based on this scenario, imagine two events:
- Event A: Testing positive for COVID-19
- Event B: Actually having COVID-19
Consider the contingency table below that summarizes the relationship between these two events across 1,000 people in your workplace. The table categorizes the individuals based on their actual COVID-19 status (which makes up the columns of the table — “No” signifies the individual doesn’t have COVID-19, “Yes” signifies the individual has COVID-19) against their COVID-19 test results (which makes up the rows of the table — “Negative” signifies the individual tested negative for COVID-19 or “Positive” signifies they tested positive for COVID-19). We’ll use this information to consider your scenario as stated above.
Individual has COVID-19?
|
Total | ||
---|---|---|---|
No | Yes | ||
Result of test | |||
Negative | 721 | 19 | 740 |
Positive | 76 | 184 | 260 |
Total | 797 | 203 | 1,000 |
Defining correct and incorrect test results
From this contingency table, we can categorize the cells as follows:
- True Positive (TP): Individuals who actually have COVID-19 and the test correctly identifies them as positive. From the table, there are 184 true positives.
- True Negative (TN): Individuals who do not have COVID-19 and the test correctly identifies them as negative. The table shows 721 true negatives.
- False Positive (FP): Individuals who do not have COVID-19, but the test incorrectly identifies them as positive. According to the table, there are 76 false positives.
- False Negative (FN): Individuals who have COVID-19, but the test incorrectly identifies them as negative. We see 19 false negatives in the table.
As such, two of the cells represent “correct” results of the test (TP and TN cells), but two of the cells represent “incorrect” results of the test (FP and FN).
We can also note that the probability of getting a positive test result in this example, that is the \(P(A) = 260/1000 = .26\) and the probability of having COVID-19 given the workplace outbreak is \(P(B) = 203/1000 = .20\).
Sensitivity
The primary purpose of a screening test, including a COVID-19 rapid antigen test, is to detect the presence of the virus. This is commonly referred to as the test’s sensitivity — that is, its ability to correctly identify those with the disease — and answers the question, “What is the probability of testing positive for COVID-19 given that I actually have the virus?” This is mathematically represented as a conditional probability, specifically \(P(A|B)\), and can be calculated using the contingency table as follows:
\[\text{Sensitivity} = \frac{TP}{TP + FN}\]
Using the provided data:
\[\text{Sensitivity} = \frac{184}{184 + 19} = .91\]
This indicates that the test correctly identifies 91% of individuals who actually have COVID-19. It’s important to note that sensitivity is a critical metric for test manufacturers because it reflects the test’s accuracy in identifying positive cases of the disease.
Specificity
While sensitivity measures a test’s ability to correctly identify those with the disease, specificity is another critical metric in evaluating a screening test. Specificity measures a test’s ability to correctly identify those without the disease, essentially answering the question, “What is the probability of testing negative for COVID-19 given that I actually do not have the virus?” This metric is crucial for ensuring that individuals without the disease are not incorrectly identified as having it, which can lead to unnecessary anxiety, further testing, and potentially unwarranted treatment. Given these categories, specificity is mathematically represented as the conditional probability of testing negative given no presence of the disease. In probability theory, conditional probabilities can also be expressed for complementary events. If \(P(A|B)\) denotes the probability of event A occurring given that event B has occurred (i.e., the sensitivity of the test in our example), then expressing the probability of the complement of A given the complement of B would typically be denoted as \(P(A^c | B^c)\), where the c superscript denotes the complement of the defined event – that is, it represents the probability of not having the condition (the complement of A) given that one is truly free of the disease (the complement of B).
The specificity measures the proportion of true negatives out of all those who do not have the disease. It’s defined as:
\[ \text{Specificity} = \frac{TN}{TN + FP} \]
Plugging in the numbers:
\[ \text{Specificity} = \frac{721}{721 + 76} = .90 \]
This calculation yields the specificity of the test, indicating the probability that the test gives a negative result given that the individual does not have COVID-19. High specificity is particularly important in reducing the number of false positives, ensuring that individuals who are disease-free are not mistakenly identified as having the condition.
Specificity, along with sensitivity, provides a comprehensive view of a test’s accuracy. While sensitivity focuses on correctly identifying disease cases, specificity is concerned with the accurate identification of those without the disease, making both metrics fundamental to understanding and evaluating the performance of diagnostic tests.
Positive Predictive Value (PPV)
While both sensitivity and specificity are important, we might argue that neither of these specifically answer the question that you may have after testing positive for COVID-19. In this scenario — you most likely want to have the following question answered: “Given that I’ve tested positive for COVID-19, what is the probability that I actually have the virus?” This question seeks to understand the likelihood of having the disease based on the test result, mathematically represented as \(P(B|A)\), the probability of Event B occurring given that Event A is true.
In this example, it’s crucial to recognize that the \(P(A|B)\) (the sensitivity of the test) is not inherently equal to \(P(B|A)\). This latter probability is called the positive predictive value (PPV), and it measures the probability that individuals who test positive for COVID-19 actually have the virus. It’s defined mathematically as:
\[ PPV = \frac{TP}{TP + FP} \]
Using the provided data:
\[ PPV = \frac{184}{184 + 76} = .71 \]
If you tested positive for COVID-19, then the probability that you actually have COVID-19 is .71. Despite the high accuracy of the test, nearly 1/3 of people who test positive won’t actually have the virus — this is sometimes referred to as the False Positive Paradox.
Bayes’ Theorem
By defining PPV as the \(P(B|A)\) and calculating it using the given data, we employ the principles of Bayes’ theorem. Bayes’ theorem is a fundamental concept in probability theory that describes how we can update the probabilities of hypotheses when given evidence.
Bayes’ theorem is written as:
\[ P(B|A) = \frac{P(A|B) \times P(B)}{P(A)} \]
Notice that we need three pieces of information to solve for \(P(B|A)\) here — which we defined earlier as the positive predictive value (PPV) of the test:
- \(P(A|B)\), which, in the context of our example is the sensitivity of the COVID-19 rapid antigen test.
- \(P(B)\), which is the probability of having COVID-19.
- \(P(A)\), which is the probability of testing positive for COVID-19.
The power of Bayes’ theorem lies in its ability to incorporate prior knowledge about the prevalence of the disease, \(P(B)\), and specific test information, \(P(A∣B)\) and \(P(A)\), to yield a practical and highly relevant probability \(P(B∣A)\) — the PPV. This means that if we know the sensitivity of a test, the prevalence of the disease, and the overall rate of positive tests, we can accurately determine the likelihood that a positive test result truly indicates the presence of the disease, even if we don’t have raw data like we did in our example.
Key Takeaway
Understanding the distinction between the probability of having COVID-19 given a positive test result (\(P(\text{COVID+} | \text{Test+}) = 0.71\)) and the probability of testing positive given the presence of COVID-19 (\(P(\text{Test+} | \text{COVID+}) = 0.91\)) is crucial. The former, known as the Positive Predictive Value (PPV), indicates the likelihood that one actually has the disease after testing positive and is what most people are interested in when they receive a test result. The latter, often referred to as the sensitivity of the test, is commonly reported by test manufacturers as a measure of a test’s accuracy.
However, sensitivity alone does not account for the prevalence of the disease or the possibility of false positives — people who test positive without actually having the disease. This is where PPV provides a more comprehensive picture by considering both the accuracy of the test and the context in which it is used, including the prevalence of the disease. The PPV can vary significantly depending on the disease’s prevalence in the population, illustrating the importance of contextual factors in interpreting test results.
Bayes’ Theorem plays a crucial role in this interpretation by enabling the calculation of PPV through the incorporation of prior knowledge about the disease’s prevalence and the test’s characteristics.
To solidify these ideas, please watch the following Crash Course Statistics video on Bayesian updating.
Wrap-up
In this Module, we explored the foundations of probability, emphasizing their foundational role in statistics and various scientific fields. We defined probability and distinguished between theoretical and empirical probabilities, and along the way, we illustrated their applications through examples such as coin tosses and social media usage among teenagers.
Key aspects of the module include:
Theoretical Foundations: Probability as a measure of event likelihood, ranging from 0 (impossible) to 1 (certain). Theoretical probability is contrasted with empirical probability, which is derived from actual data.
Inferential Statistics: Probability forms the backbone of inferential statistics, allowing statisticians to make predictions about a population based on sample data.
Basic Rules of Probability: We covered the basic rules of probability — including the Addition Rule and the Multiplication Rule. We all learned how these basic rules coincide with cross tabulations of observed data.
Bayesian Statistics: We explored Bayes’ theorem in the context of updating probabilities based on new information, enhancing decision-making processes.
Practical Applications: Through examples like blood type distribution and the impact of ABO and Rh blood types on disease susceptibility, we applied probability concepts to real-world scenarios.
By the end of this Module, you should have a good understanding of both the theoretical underpinnings and practical applications of basic probability, equipped to apply these concepts in various statistical and real-life contexts.
Credits
- In writing about the basic rules of probability, I drew from the excellent book entitled Advanced Statistics from an Elementary Point of View as well as The Cartoon Guide to Statistics.
- I was inspired by the Quantitude Podcast Season 3 Episode 16 to create the Bayes’ Theorem example of testing positive for COVID-19 and actually having COVID-19.
Footnotes
The simulation of these data is a little more involved and require some code that you haven’t yet learned about — so for now, we’ll just take a look at a contingency table generated from the simulated data.↩︎