Probability distributions

Module 6

Learning objectives

  • Define and apply the concepts of probability distributions
  • Distinguish between discrete and continuous probability distributions
  • Interpret and calculate probabilities using the normal distribution
  • Utilize the qnorm() and pnorm() functions in R for probabilistic inquiries
  • Translate theoretical probabilities into practical applications
  • Employ the Empirical Rule to describe data spread in normal distributions
  • Convert raw scores to z-scores for standardization and interpretation

So far in thie course, we’ve progressively built a foundation in data science, starting with the essentials of data handling in Modules 2 through 4. These initial modules introduced us to the art of summarizing observations, organizing them into informative tables, and visually interpreting data through various graphical representations. Module 5 marked a shift towards understanding the basic rules of probability, where we learned to simulate data based on known probabilities, compute and analyze contingency tables, and apply probability rules within the framework of these tables, culminating in an introduction to Bayes’ theorem.

Building upon this groundwork, Module 6 advances into the more specialized realm of probability distributions. Here, we aim to explore both discrete and continuous probability distributions. Through a comprehensive study of probability distributions, we will enhance our ability to predict outcomes, assess risks, and make informed decisions grounded in the statistical likelihood of various events.

Probability distributions are at the heart of both probability theory and statistics. Essentially, a probability distribution is a function or rule that assigns probabilities to each possible outcome of a random variable. This random variable could represent anything from individuals’ ABO blood type, the height of individuals in a population, or the results of a COVID-19 test.

Probability distributions are fundamental to statistics and data analysis for several reasons, serving multiple key purposes:

  1. Describing Data Patterns: Probability distributions provide a systematic way to describe the patterns and behaviors observed in data. By understanding the distribution that fits a dataset, analysts can succinctly summarize its characteristics, such as its center, variability, and the shape of its distribution.

  2. Modeling Uncertainty: At the heart of probability theory is the modeling of uncertainty. Probability distributions help in quantifying the likelihood of different outcomes, enabling us to make informed predictions and decisions under conditions of uncertainty.

  3. Inferring Population Properties: Through the use of probability distributions, data scientists can infer the properties of a larger population from a sample. This is the basis of inferential statistics, which allows us to estimate population parameters (like means and variances), test hypotheses, and make predictions about future observations.

  4. Guiding Scientific Experiments and Research: Understanding probability distributions aids in the design of experiments and studies by allowing researchers to anticipate the distribution of possible outcomes. This is crucial for determining sample sizes, assessing the power of a study, and interpreting the results of experiments and surveys.

  5. Evaluating Statistical Models: Many statistical models assume that the data follows a specific probability distribution. Knowing how to apply and interpret these distributions is key to selecting appropriate models, diagnosing model fit, and making accurate predictions. For example, the normal distribution underlies many statistical tests and procedures, such as t-tests, that you may already have studied.

  6. Predicting Future Events: Probability distributions are used to predict the likelihood of future events based on historical data. This is fundamental in fields like meteorology for weather forecasting, in finance for forecasting stock prices, in operations research for planning and optimization, and for prognostics in social sciences and medicine.

  7. Improving Decision Making: By quantifying uncertainty and risk, probability distributions support better decision-making, policy-making, and personal choices. They provide a framework for evaluating different options and outcomes, helping decision-makers choose actions with the best expected results.

In summary, probability distributions are essential for understanding and managing the inherent uncertainty in data. They serve as foundational tools in statistics, enabling us to summarize data, make predictions, and ultimately, make informed decisions based on empirical evidence.

Introduction to Probability Distributions

In Module 5, we introduced basic probability concepts using the distribution of ABO blood types as a key example. The ABO system divides individuals into four categories, each occurring with different frequencies in the population:

  • Type A: 42%

  • Type B: 10%

  • Type AB: 4%

  • Type O: 44%

These percentages represent theoretical probabilities, providing a mathematical framework to gauge the likelihood of encountering each blood type in the general population. For instance, if you were to select someone at random, there is a 42% chance they would have Type A blood, a 10% chance for Type B, a 4% chance for Type AB, and a 44% chance for Type O.

Theoretical probabilities are based on established knowledge or assumptions about a group’s characteristics. In the blood type example, they are derived from meta-analytic studies or historical health records that summarize the blood type distribution in a broader population. These figures give us a model for predicting outcomes before any actual data collection, allowing us to estimate the likelihood of various blood types among individuals in a hypothetical random sampling.

Contrast this with empirical probabilities, which are derived from actual data gathered through observation or experimentation. For example, if a researcher sampled 1,000 people and found that 440 had Type O blood, the empirical probability of selecting someone with Type O blood from this group would be 0.44. This kind of probability is grounded in observed outcomes and provides real-world verification of theoretical probabilities.

To illustrate, imagine we conduct a study to determine the distribution of blood types among participants randomly selected from a defined population. By documenting the frequency of each type, we generate empirical probabilities that can then be compared to our theoretical expectations. This approach not only tests our initial assumptions but also refines our understanding of how likely we are to encounter each blood type in a given population.

In essence, the distinction between theoretical and empirical probabilities lies in their foundation: theoretical probabilities are predicted based on prior knowledge and assumptions, while empirical probabilities are determined through actual observation and measurement. Both forms of probability help us make informed predictions and decisions about likely outcomes in real-life scenarios.

The barplot below serves as a visual guide to our theoretical probability distribution, with the height of each bar representing the probability of encountering a person in the population with a certain blood type. For instance, a taller bar for blood type O compared to blood type AB visually communicates that we’re more likely to encounter a person with O than one with AB.

Picture a scenario where we’ve recorded the letters A, B, AB or O on 1,000 ping pong balls — matching the prevalence in the population — so 420 are marked with an A, 100 with a B, 40 with an AB, and 440 with an O. Now, imagine placing all 1,000 of these uniquely marked ping pong balls into a large bag. This bag is thoroughly shaken, ensuring that every ping pong ball has an equal chance of being drawn. The act of drawing one ball from this well-mixed collection represents a random selection process, mirroring the essence of probability in action.

The probabilities displayed in the barplot provide us with a quantifiable insight into what to expect from a single draw. The bar representing blood type O reaches higher than the rest, this signifies a greater likelihood of pulling an O-marked ball from the bag on a random draw.

In this way, the bag of ping pong balls becomes a physical manifestation of the discrete probability distribution we’ve been discussing. Each draw from the bag is an independent event, yet the collective probabilities — the composition of our bag — tell us the story of our data’s distribution. It’s an illustration of how probability underpins our expectations and predictions about random events. As we move through the next few modules, we’ll replace the bag with a population, and the balls with people that we randomly select from the population.

Probability Mass Function (PMF)

The barplot of ABO blood types presented above illustrates a fundamental statistical concept known as the Probability Mass Function (PMF), which is applicable to discrete random variables. A PMF assigns exact probabilities to each possible outcome of a discrete variable. For instance, our barplot visually communicates the probability associated with each blood type based on its frequency relative to the overall population.

Cumulative Distribution Function (CDF)

Building on the PMF, the Cumulative Distribution Function (CDF) represents the probability that a discrete random variable is less than or equal to a certain value. Unlike the PMF, which gives the probability for each specific value, the CDF accumulates probabilities up to a certain threshold. This provides a way to see the total probability of achieving a result up to a certain category, ordered typically by prevalence or some other logical sequence.

Let’s apply the CDF concept using our ABO blood type example:

  • Type O: The most common blood type, standing at a probability of 0.44.

  • Adding the probability of Type A (0.42) to Type O’s gives us 0.44 + 0.42 = 0.86. This implies that the probability of an individual having either Type O or Type A blood is 0.86.

  • Including Type B (0.10) to the sum of Types O and A yields a cumulative probability of 0.44 + 0.42 + 0.10 = 0.96, suggesting that 96% of the population is expected to have either Type O, A, or B blood.

  • Incorporating Type AB (0.04) results in a total cumulative probability for Types O, A, B, and AB of 0.44 + 0.42 + 0.10 + 0.04 = 1.0, which logically covers the entire population since these are the only blood types available.

Below is the CDF plot for blood type as just described:

The CDF provides insights into the distribution of a variable, showing not just the likelihood of each specific value (as the PMF does) but how probabilities accumulate over the range of possible values given a particular order.

Before diving into the next components of the Module, please watch the following Crash Course Statistics video on Randomness.



Types of Probability Distributions

There are two main types of probability distributions: discrete and continuous.

Discrete probability distributions apply to scenarios where the outcomes can be counted and are finite or countably infinite, such as the number of heads in a series of coin tosses or the number of times in a 24 hour period the Old Faithful geyser in Yellowstone National Park erupts. Each outcome has a specific probability (i.e., a probability mass function or PMF) associated with it, and the sum of all these probabilities equals one. Our example of ABO blood type is an example of a discrete probability distribution.

Continuous probability distributions, on the other hand, deal with outcomes that can take on any value within an interval or range. These distributions are described using probability density functions (PDFs) rather than probabilities for each individual outcome (defined as masses — thus the term probability mass function). The area under the PDF curve between any two points gives the probability of the random variable falling within that interval. This concept is crucial for analyzing measurements and data that vary continuously, like temperature readings or income.

Discrete Probability Distributions

Let’s begin with discrete probability distributions, in particular, we’ll focus on distributions that apply to binary variables. Binary variables encapsulate outcomes in their simplest, most elemental form. Binary variables, also known as dichotomous variables, play a significant role in psychological research, offering a way to categorize individuals, behaviors, or outcomes into two distinct groups. These variables are essential for investigating the presence or absence of certain characteristics, conditions, or responses — for example, diagnosis with a psychological disorder (i.e., positive diagnosis vs. negative diagnosis), completion of a treatment protocol (i.e., completed vs. dropped-out), or a correct answer for a quiz question (i.e., correct vs. incorrect).

Bernoulli Trials

At the heart of discrete probability distributions lies the concept of Bernoulli trials, named after the Swiss mathematician Jacob Bernoulli. A Bernoulli trial is defined as a random experiment that yields one of two mutually exclusive outcomes: success or failure.

A quintessential example is the act of flipping a fair coin. In this scenario, the coin’s landing on heads might be deemed a “success,” while landing on tails could be considered a “failure.” Despite the apparent simplicity of these trials, becoming familiar with Bernoulli trials can provide important insights that serve as a stepping stone to more complex probabilistic models.

Key Properties of Bernoulli Trials

  1. Binary Outcomes: The defining feature of a Bernoulli trial is that it has only two outcomes. These outcomes are typically labeled as “success” and “failure,” but they can represent any two mutually exclusive outcomes, such as “heads” or “tails,” “yes” or “no,” “positive” or “negative,” etc.

  2. Fixed Probability: The probability of success, denoted by \(p\), is the same every time the trial is conducted. The probability of failure is then \(1 - p\). Additionally, the values of \(p\) must lie between 0 and 1. For example, if we flip a fair coin, the probability of heads is 0.5, and the probability of tails is 0.5.

  3. Independence: Each Bernoulli trial is independent of the other trials. This means the outcome of one trial does not influence or change the outcomes of subsequent trials. The probability of success remains constant across trials. For example, each time you would flip a coin — there is an equal probability of it coming up heads vs. tails, and the act of it coming up heads on one flip does not impact the likelihood that it will come up heads on a subsequent flip.

  4. Randomness: The outcome of each trial is determined by chance. The process is random, meaning that while the probability of each outcome is known, the actual outcome of any single trial cannot be predicted with certainty.

Example: COVID-19 Rapid Antigen Testing

To illustrate the concept of Bernoulli trials in a real-world context relevant to public health, consider the following scenario which builds on the COVID-19 exposure at your workplace example from Module 5.

Imagine you’ve been notified of a COVID-19 outbreak at your workplace. Concerned about the risk of infection, and anxious to know if you have been infected in the days leading up to your much anticipated vacation, you decide to assess the situation using rapid antigen tests. You purchase a handful of identical saliva-based tests. You administer two of the tests simultaneously, following the test protocol precisely.

In this experiment:

  • Each test constitutes a Bernoulli trial with two possible outcomes: “Positive” (indicating the presence of the virus) or “Negative” (indicating the absence of the virus).

  • The “success” in this context is defined as obtaining a positive result, reflecting the experiment’s objective to detect the virus, if present.

  • The assumption here is that by using the same sample and conducting both tests simultaneously, the probability of success (\(p\)) — a positive test result — remains constant across trials, satisfying the fixed probability condition.

  • The independence of each trial is maintained by the nature of the tests themselves, assuming the results of one test doesn’t directly influences the other test.

The Binomial Distribution

Before we dive into the details of the Binomial Distribution, please watch the following Crash Course Statistics video on this very topic:



Binomial random variables

The binomial random variable (\(X\)) represents the number of successes in \(n\) repeated Bernoulli trials, where each trial has a fixed probability \(p\) of success.

Let’s make the following assumptions as we progress through this example:

  • Let’s imagine that the outbreak at your workplace was extreme, and the probability of a positive result on your test (\(p\)) is 0.5.

  • The number of tests conducted (\(n\)) is 2 (i.e., you will take two rapid tests).

  • The results of both tests were conclusive (i.e., neither test kit produced an unusable result).

With these two tests — there are three possible outcomes:

  1. Both tests come up negative (i.e., \(k=0\) positive results)

  2. One test comes up positive, while the other is negative (i.e., \(k=1\) positive result)

  3. Both tests are positive (i.e., \(k=2\) positive results).

We are interested in calculating \(P(X = k)\), the probability of observing k successes (positive tests) out of \(n\) trials, for three possible values of \(k\) — that is: \(k = 0,1,2\). In other words, we are interested in knowing the probability that neither test is positive (\(k = 0\)), the probability that one of the two tests is positive (\(k = 1\)), and the probability that both tests are positive (\(k = 2\)).

The probability of observing exactly \(k\) successes in \(n\) trials is given by the binomial probability formula1:

In summary, \(P\) is used as a function or operator to denote the probability of events, while \(p\) represents a specific probability value, often associated with the success of a Bernoulli trial or a similar parameter in statistical distributions.

Though it’s not critical to understand this, for those interested, the binomial probability formula is:

\[P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}\]

Where:

  • \(\binom{n}{k}\) is the binomial coefficient, representing the number of ways to choose \(k\) successes out of \(n\) trials.
  • \(p\) is the probability of success on a single trial.
  • \(1-p\) is the probability of failure on a single trial.

The notation \(\binom{n}{k}\), read as “n choose k,” represents the binomial coefficient in mathematics. It calculates the number of ways to choose \(k\) successes out of \(n\) trials, regardless of the order in which those successes occur. This coefficient is a fundamental part of combinatorics and probability theory, especially in the context of binomial distributions.

The binomial coefficient presented in the equation above is calculated using the formula:

\[\binom{n}{k} = \frac{n!}{k!(n-k)!}\]

where:

  • \(n!\) (n factorial): This represents the total number of ways to arrange \(n\) items in a specific order. For example, if \(n\) is 2, then \(n! = 2 \times 1 = 2\). As another example, if \(n\) is 6, then \(n! = 6 \times 5 \times 4 \times 3 \times 2 \times 1 = 720\).
  • \(k!\) (k factorial): Similar to \(n!\), this is the number of ways to arrange \(k\) successful outcomes. For example, if \(k = 1\), then \(1! = 1\). If \(k = 2\), then \(2! = 2 \times 1 = 2\).
  • \((n-k)!\): This term calculates the number of ways to arrange the remaining \(n-k\) items after \(k\) items have been chosen. If \(n = 2\) and \(k = 1\), then \((n-k) = 1\) and \(1! = 1\). If \(k = 2\), \((n-k) = 0\) and \(0! = 1\) (since \(0!\) is defined as 1).

Example 1: As an example, let’s imagine that we wanted to calculate \(\binom{2}{2}\), representing the number of ways to achieve 2 positive test results out of 2 attempts, assuming each test is a Bernoulli trial.

Using the formula for the binomial coefficient:

\[\binom{n}{k} = \frac{n!}{k!(n-k)!}\]

we set \(n=2\) and \(k=2\), which gives us:

\[\binom{2}{2} = \frac{2!}{2!(2-2)!} = \frac{2 \times 1}{(2 \times 1) \times 0!}\]

Remembering that \(0! = 1\) by definition, this simplifies to:

\[\binom{2}{2} = \frac{2}{2 \times 1} = 1\]

\(\binom{2}{2} = 1\) means there is exactly 1 way to achieve 2 successes (positive tests) out of 2 trials. This result is intuitive — when you have two tests and you’re looking for the number of ways to get positive results on both, there’s only one scenario where this happens: both tests must be positive.

Example 2: As a slightly more complex example, let’s consider a scenario where 6 tests are conducted and you want to know the number of ways to achieve 4 positive tests out of those attempts. We can apply the same binomial coefficient formula used earlier.

Using the formula for the binomial coefficient:

\[\binom{n}{k} = \frac{n!}{k!(n-k)!}\]

we set \(n=6\) and \(k=4\), which gives us:

\[\binom{6}{4} = \frac{6!}{4!(6-4)!} = \frac{6 \times 5 \times 4 \times 3 \times 2 \times 1}{(4 \times 3 \times 2 \times 1) \times (2 \times 1)}\]

Simplifying the factorials, we notice that \(4! = 4 \times 3 \times 2 \times 1 = 24\) and \((6-4)! = 2! = 2 \times 1 = 2\).

Thus, the calculation becomes:

\[\binom{6}{4} = \frac{720}{24 \times 2} = \frac{720}{48} = 15\]

\(\binom{6}{4} = 15\) means there are exactly 15 ways to achieve 4 successes (positive tests) out of 6 trials. This result provides us with the number of different combinations where 4 tests can be positive.

Putting it all together

Now, that we understand the binomial coefficient formula, let’s use the formula \(P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}\) to calculate the probability that \(k = 0, k = 1, k = 2\) for our original example of 2 COVID-19 rapid antigen tests:

  • For \(k=0\) (No Positive Tests)

    • Here, we’re calculating the probability that both tests are negative.

    • Using the formula: \(\binom{2}{0}(0.5)^0(1-0.5)^{2-0} = 1 \times 1 \times (0.5)^2 = 0.25\)

    • Interpretation: There’s a 25% chance that none of the tests will return a positive result.

  • For \(k=1\) (One Positive Test)

    • This calculates the probability of exactly one test being positive, either the first or the second.

    • Using the formula: \(\binom{2}{1}(0.5)^1(1-0.5)^{2-1} = 2 \times 0.5 \times 0.5 = 0.5\).

    • Interpretation: There’s a 50% chance of getting exactly one positive result out of the two tests. This is because there are two scenarios that result in one positive and one negative test, and each has a probability of 0.25; summed together, they yield 0.5.

  • For \(k=2\) (Two Positive Tests)

    • This calculates the probability of both tests being positive.

    • Using the formula: \(\binom{2}{2}(0.5)^2(1-0.5)^{2-2} = 1 \times (0.5)^2 \times 1 = 0.25\).

    • Interpretation: There’s a 25% chance that both tests will return a positive result.

That was easy enough to calculate by hand, but in scenarios where \(n\) and \(k\) are larger, automation is helpful (and, let’s be honest, you’re probably never going to calculate these by hand). Let’s revisit the example where you conducted 6 rapid COVID-19 tests (i.e., 6 Bernoulli trials), and calculate the probability that \(k = 0, k = 1, k = 2, k = 3, k = 4, k = 5, k = 6\). We’ll use R to automate this process. This R code snippet demonstrates how to use the binomial distribution via the dbinom() function to calculate and visualize the probabilities of observing a range of successful outcomes (in this case, positive test results) from a series of Bernoulli trials (COVID-19 tests). We’ll stick with the probability of success of .5 (i.e., for any one test, the probability of a positive result is .5).

# Define parameters for the binomial distribution
n <- 6  # Number of trials
p <- 0.5  # Probability of success

# Calculate probabilities
probabilities <- dbinom(0:n, size = n, prob = p)

# Create a data frame for plotting
results_df <- tibble(
  positive_results = 0:n,
  probability = probabilities
)

results_df
Tip

Detailed code explanation for the curious

  1. Define Parameters for the Binomial Distribution:

    • n <- 6: This sets the number of trials in the binomial distribution to 6. A trial is an individual instance of an experiment (e.g., flipping a coin) – for our example, this represents the 6 rapid tests. This value is read in later when calculating the probability.

    • p <- 0.5: This sets the probability of success on each trial to 0.5. In a binomial context, “success” could mean any outcome of interest, such as flipping heads in a coin toss. In our example, success means the COVID-19 test is “positive”. This value is read in later when calculating the probability.

  2. Calculate Probabilities:

    • probabilities <- dbinom(0:n, size = n, prob = p): This line uses the dbinom() function to calculate the probability of each possible number of successes (from 0 to n) in n trials of a binomial experiment.

      • 0:n generates a sequence from 0 to n (here, 0 to 6), representing all possible counts of successes.

      • size = n specifies the total number of trials.

      • prob = p sets the probability of success in each trial.

    • The result is a vector of probabilities where each element corresponds to the probability of obtaining 0, 1, 2, …, up to n successes out of n trials.

  3. Create a Data Frame for Plotting:

    • results_df <- tibble(positive_results = 0:n, probability = probabilities): This line creates a data frame. The data frame is intended to organize and later facilitate displaying the data.

      • positive_results = 0:n creates a column named positive_results that lists the number of successes from 0 to n (inclusive).

      • probability = probabilities creates a column named probability that stores the computed probabilities corresponding to each count of successes from the probabilities vector.

    • This data frame, results_df, is structured to clearly display the number of successes alongside their associated probabilities, making it suitable for visualization or presentation in a table format.

We can plot the tabled results using a bar chart as follows:

results_df |> 
  ggplot(mapping = aes(x = factor(positive_results), y = probability)) +
  geom_col(fill = "#2F9599") +
  labs(title = "Binomial probabilities for 6 trials",
       x = "Number of positive test results",
       y = "Probability") +
  theme_minimal()

These results demonstrate the binomial distribution’s characteristic symmetry, especially evident when the probability (\(p\)) of success in each trial is 0.5. Notice that the distribution is bell-shaped, with the likelihood of outcomes decreasing as they move away from the median number of positive results (3 in this case). This point will become important later in the Module.

The Mean and Variance of a Binomial Distribution

Now that we’ve taken a look at a binomial distribution, let’s dig in a little further to describe the distribution using familiar descriptive statistics — namely the mean and variance.

Example 1
Imagine you’re playing a game of basketball, where each shot you take is an independent event with a chance of scoring (success) or missing (failure). If you know your success rate (i.e., which we can define as the probability of scoring on each shot) and how many shots you’re going to take, you could predict how many shots you’re likely to make. This prediction is essentially the mean or expected value of your basketball shots, representing the average outcome or central tendency over many repetitions of the game. Mathematically, it’s calculated as:

\[ \mu = n \cdot p \]

where:

  • \(\mu\) (mu) is the mean number of successes you’d expect. We use the Greek letter to denote this is a parameter of the entire population (i.e., a theoretical probability). When we talk about a “population” here, we mean the concept applies broadly, like to every series of shots you could possibly take under these conditions, not just one series of shots on one particular day.

  • \(n\) represents the total number of shots (trials) you plan to take.

  • \(p\) is the probability of scoring (success) each time you take a shot.

The mean gives us an idea of where the distribution is centered (remember our discussion of measures of Central Tendency from Module 2).

Variance tells us about the spread or variability of outcomes around this mean. In our basketball example, variance helps us to understand how consistent our scoring is: “Will most games see a score close to our expected mean, or could the number of successful shots vary widely from game to game?” In the context of a binomial distribution, variance is calculated as:

\[ \sigma^2 = n \cdot p \cdot (1 - p) \]

where:

  • \(\sigma^2\) is the variance of the binomial distribution.

  • \(n\) is the number of trials.

  • \(p\) is the probability of success on each trial.

  • (1 - \(p\)) is the probability of failure on each trial.

The variance gives us an idea of how spread out the distribution of successes is.

While variance is a key measure of dispersion, the standard deviation (\(\sigma\)), which is the square root of the variance, is often more intuitive as it is in the same units as the mean. For the binomial distribution, the formula for the standard deviation is:

\[ \sigma = \sqrt{n \cdot p \cdot (1 - p)} \]

The mean and variance (or standard deviation) of a binomial distribution provide valuable insights into the expected outcomes of a series of Bernoulli trials and the variability around these expected outcomes.

Building on the basketball example, imagine you’re playing in a basketball game, and based on past performance, you know that you have a 40% chance of making any given shot. In today’s game, you plan to take 30 shots.

Calculate the Mean: The mean (\(\mu\)) of a binomial distribution tells us the expected number of successes (in this case, successful shots) and is calculated as:

\(\mu = n \cdot p\)

Where:

  • \(n = 30\) shots

  • \(p = 0.40\) (the probability of making each shot)

\(\mu = 30 \cdot 0.40 = 12\)

This means you’d expect, on average, to make 12 shots in the game.

Calculate the Variance: The variance (\(\sigma^2\)) of a binomial distribution provides a measure of the variability in the number of successful outcomes around the mean. It is calculated as:

\(\sigma^2 = n \cdot p \cdot (1 - p)\)

For our basketball example:

\(\sigma^2 = 30 \cdot 0.40 \cdot (1 - 0.40) = 30 \cdot 0.40 \cdot 0.60 = 7.2\)

This variance indicates the extent to which your actual number of successful shots might differ from the expected 12 across different experiments.

Calculate the Standard Deviation: The standard deviation (\(\sigma\)) is the square root of the variance, giving us a measure of spread in the same units as the mean:

\(\sigma = \sqrt{\sigma^2} = \sqrt{7.2} \approx 2.68\)

This suggests that the number of successful shots you make in most games will typically vary by about 2.7 shots around the mean of 12 shots.

Conclusion: Based on these calculations, you can expect to make around 12 shots in a game, give or take about 2 to 3 shots.

Example 2
Now, let’s apply these formulas for the mean and variance to our COVID-19 rapid antigen test example (with 2 tests). Given that \(n = 2\) in the example (2 repeated tests) and \(p\) = .5 (which we set earlier), the mean is:

\[ \mu = n \cdot p = 2 \cdot 0.5 = 1 \]

The variance is:

\[ \sigma^2 = n \cdot p \cdot (1 - p) = 2 \cdot 0.5 \cdot (1 - 0.5) = 0.5 \]

There is an equivalent method for obtaining the mean and variance of a binomial distribution, which is only really tenable if you have a small number of trials. However, I think it may be helpful in further understanding these concepts, so I’ll share it here. Let’s stick with the rapid antigen test for COVID-19 scenario. For 2 trials — earlier we calculated the probability of 0, 1 and 2 successes (i.e., positive tests) as 0.25, 0.5, and 0.25 respectively. The mean (or expected value) of a binomially distributed random variable can also be calculated by taking the sum of each possible outcome \(x\) multiplied by the probability of \(x\), which is represented by the formula below:

\[ \text{Mean} = \sum (x \times P(x)) \]

Where:

  • \(x\) represents the possible outcomes (in this case, 0, 1, and 2 successes).

  • \(P(x)\) is the probability of achieving \(x\) successes.

Thus, we can calculate the mean as follows:

\[ \text{Mean} = (0 \times 0.25) + (1 \times 0.5) + (2 \times 0.25) \]

By taking the sum of each possible outcome \(x\) multiplied by the probability of \(x\), we also find that the mean of the random variable is 1.0. This confirms that both methods — using the formula \(n \times p\) and summing over all \(x \times P(x)\) — yield the same result for the mean of a binomial distribution with \(p = 0.5\) and \(n = 2\).

Likewise, the variance of a binomially distributed random variable can be calculated using the formula:

\[ \text{Variance} = \sum ((x - \mu)^2 \times P(x)) \]

where:

  • \(x\) represents each possible outcome.

  • \(\mu\) is the mean (or expected value) of the distribution.

  • \(P(x)\) is the probability of \(x\) occurring.

Given that the mean (\(\mu\)) we calculated earlier is 1.0 and the probabilities for 0, 1, and 2 successes are 0.25, 0.5, and 0.25 respectively, the variance can be calculated as follows:

\[\text{Variance} = (0 - 1)^2 \times 0.25 + (1 - 1)^2 \times 0.5 + (2 - 1)^2 \times 0.25\]

This calculation considers the squared difference between each outcome and the mean, weighted by the probability of each outcome, yielding a variance of 0.5. This indicates the spread of the distribution around the mean.

How do probabilities vary as a function of \(n\) and \(p\)?

We just learned that the binomial distribution is defined by two parameters: \(n\) and \(p\). Let’s explore how these parameters influence the distribution, including its mean and variance, and an explanation of how to generate a cross table in R that explores the relationship between \(k\) (the number of successes) and \(p\) (the probability of success) for a fixed \(n\).

Let’s consider the scenario of conducting 6 COVID-19 tests, where \(n = 6\) represents the total number of tests conducted, and \(p\) varies to reflect different levels of test accuracy or the prevalence of the disease in the population being tested.

Generate a Cross Table in R

To explore how the number of positive results (k) varies with different probabilities of success (\(p\)), we can use R to generate a cross table for k ranging from 0 to 6 and \(p\) ranging from 0.1 to 0.9, still sticking with our fixed \(n\) (i.e., \(n\) = 6 for 6 COVID-19 tests).

# Define parameters
n <- 6
p_values <- seq(0.1, 0.9, by = 0.1)

# Generate cross table of probabilities
results_df <- expand.grid(k = 0:n, p = p_values) |>
  mutate(probability = dbinom(k, size = n, prob = p))

# Pivot wider to create a matrix form where rows are p and columns are k
results_wide <- results_df |>
  pivot_wider(names_from = k, values_from = probability, names_prefix = "k=") |>
  arrange(p)
Tip

Detailed code explanation for the curious

This R code snippet is used to calculate and organize the probabilities of different outcomes in a binomial distribution across a range of probabilities of success. Here’s a breakdown of what each part of the code is doing:

  1. Define Parameters:

    • n <- 6: This line sets n as 6, which represents the number of trials in the binomial distribution.

    • p_values <- seq(0.1, 0.9, by = 0.1): This line creates a sequence of probability values from 0.1 to 0.9, increasing in increments of 0.1. These values represent different probabilities of success on each trial.

  2. Generate Cross Table of Probabilities:

    • results_df <- expand.grid(k = 0:n, p = p_values): The expand.grid() function is used to create a data frame, results_df, that contains all combinations of \(k\) (number of successes, ranging from 0 to 6) and \(p\) (the probabilities defined earlier).

    • mutate(probability = dbinom(k, size = n, prob = p)): This line adds a new column called probability — which is calculated using dbinom().

  3. Pivot Wider to Create a Matrix Form:

    • results_wide <- results_df |>: This line begins the process of transforming the data frame to a wider format, where each row represents a different probability of success, and each column represents the number of successes.

    • pivot_wider(names_from = k, values_from = probability, names_prefix = "k="): The pivot_wider() function is used here to reformat the data. It takes the names for the new columns from the k values (prefixed by “k=”), and the values in these columns come from the probability column.

    • arrange(p): Finally, the data is arranged in ascending order based on the probability p.

The end result, results_wide, is a data frame where each row corresponds to a different \(p\) value, and the columns (labeled as k=0, k=1, …, k=6) represent the probabilities of observing 0, 1, …, 6 successes, respectively, given that \(p\).

The results are presented below.

Values of P(X = k) for various levels of (p)
p
Number of Successes (k)
k=0 k=1 k=2 k=3 k=4 k=5 k=6
0.1 0.531 0.354 0.098 0.015 0.001 0.000 0.000
0.2 0.262 0.393 0.246 0.082 0.015 0.002 0.000
0.3 0.118 0.303 0.324 0.185 0.060 0.010 0.001
0.4 0.047 0.187 0.311 0.276 0.138 0.037 0.004
0.5 0.016 0.094 0.234 0.312 0.234 0.094 0.016
0.6 0.004 0.037 0.138 0.276 0.311 0.187 0.047
0.7 0.001 0.010 0.060 0.185 0.324 0.303 0.118
0.8 0.000 0.002 0.015 0.082 0.246 0.393 0.262
0.9 0.000 0.000 0.001 0.015 0.098 0.354 0.531

This approach demonstrates the dependency of the binomial distribution, including its mean and variance, solely on the parameters \(n\) and \(p\). By examining different values of \(p\) for a fixed \(n\), we can understand how the likelihood of various outcomes changes with the probability of success, offering insights into the distribution’s behavior under different conditions. Note that when \(p\) = .5 in the table above — the probabilities for each level of k match what we computed earlier.

Just as we did before, we can present this same information in a bar chart.

# Generate the plot using ggplot2
results_df |> 
  mutate(p = paste0("probabability of success (p) = ", p)) |> 
  ggplot(mapping = aes(x = factor(k), y = probability, fill = factor(p))) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~p, scales = "free_y", ncol = 1) +
  labs(title = "Distribution of P(X = k) for Different Levels of p",
       x = "Number of Successes (k)",
       y = "Probability",
       fill = "Probability of Success (p)") +
  theme_minimal() +
  theme(legend.position = "none") 

Each facet shows the distribution of probabilities for obtaining \(k\) successes (from 0 to 6) given the specific probability of success (\(p\)) for the set of 6 trials. This visualization allows for an easy comparison across different values of \(p\), highlighting how the likelihood of various outcomes shifts as the probability of success changes.

The binomial distribution is symmetric only when \(p = 0.5\). This symmetry occurs because, at \(p = 0.5\), the probability of success and failure for each trial is the same, making outcomes equally likely on both sides of the distribution’s mean. As a result, the distribution of the number of successes in a fixed number of trials is mirrored around the mean value.

For values of \(p\) other than 0.5, the distribution becomes skewed:

  • If \(p < 0.5\), the distribution is skewed to the right, meaning there’s a longer tail on the right side of the distribution. This skewness reflects the greater likelihood of fewer successes (since successes are less likely than failures).

  • If \(p > 0.5\), the distribution is skewed to the left, with a longer tail on the left side. This indicates a higher likelihood of more successes (since successes are more likely than failures).

The Binomial Distribution & the Law of Large Numbers

Let’s consider one last example. Let’s imagine that your workplace is deeply concerned about the COVID-19 outbreak and your Human Resources department springs into action, compiling a list of employees and initiating a random testing protocol using rapid antigen tests for COVID-19. For the purpose of our exploration, we’ll assume a relatively low probability of a positive test result, setting our success probability at \(p\) = 0.1. Note that in this context, the trials (i.e., \(n\)) are not the repeated COVID-19 tests that you administered to yourself, but rather the testing of various randomly selected employees, making the trials (\(n\)) in this case the number of people sampled.

Initially, let’s focus on a small group of 10 randomly selected employees. With such a limited sample, the distribution of positive tests is anticipated to be highly skewed, primarily due to the low probability of success (\(p\) = 0.1). This skewness is natural; with few chances for success, outcomes will likely cluster at lower values (few or no positives), though there remains a slight possibility for more positive results. This scenario showcases the characteristics of a binomial distribution when \(n\) is small and \(p\) is low.

We can use R to produce the probability distribution when \(n\) = 10 and \(p\) = 0.1. Here, we see a distribution where the bulk of the data is concentrated on the lower end of the distribution.

# Parameters
n <- 10
p <- 0.1

# Generate data
k <- 0:n
probability <- dbinom(k, n, p)

# Create a data frame
data_df <- data.frame(k, probability)

# Plot the data
ggplot(data_df, aes(x = factor(k), y = probability)) +
  geom_bar(stat = "identity", fill = "#A7226E") +
  labs(title = "Binomial Distribution for n = 10, p = 0.1",
       x = "Number of Successes (k)",
       y = "Probability") +
  theme_minimal()

What unfolds as we increase \(n\)? Rather than sampling 10 employees, let’s see what happens if we sample 100 employees.

# Parameters
n <- 100
p <- 0.1

# Generate data
k <- 0:n
probability <- dbinom(k, n, p)

# Create a data frame
data_df <- data.frame(k, probability)

# Plot the data
ggplot(data_df, aes(x = factor(k), y = probability)) +
  geom_bar(stat = "identity", fill ="#A7226E") +
  scale_x_discrete(breaks = seq(0, n, by = 5)) + # Adjust by parameter as needed
  labs(title = "Binomial Distribution for n = 100, p = 0.1",
       x = "Number of Successes (k)",
       y = "Probability") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better visibility

As we increase the number of sampled employees to 100, the narrative begins to shift. The distribution of positive tests starts to smoothen, edging closer to a more symmetrical shape. This phenomena is described as the Law of Large Numbers (LLN). With a larger sample size, the proportion of positive results becomes more consistent and closer to the expected 10% (p = .1). The randomness and variability seen in our initial small sample decrease.

Taking another leap, let’s see what happens when the sample size reaches 1,000.

# Parameters
n <- 1000
p <- 0.1

# Generate data
k <- 0:n
probability <- dbinom(k, n, p)

# Create a data frame
data_df <- data.frame(k, probability)

# Plot the data
ggplot(data_df, aes(x = factor(k), y = probability)) +
  geom_bar(stat = "identity", fill = "#A7226E") +
  scale_x_discrete(breaks = seq(0, n, by = 50)) + # Adjust by parameter as needed
  labs(title = "Binomial Distribution for n = 1000, p = 0.1",
       x = "Number of Successes (k)",
       y = "Probability") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better visibility

As demonstrated in the graph above, the distribution under this very large sample size, now not only smoothens but also begins to resemble the familiar bell shape of a normal distribution.

Putting it all together
The detailed explanation of the Binomial distribution and its properties might seem overwhelming, and it’s okay if you don’t grasp every detail right away. However, there are a few key takeaways that will be crucial for our understanding of statistical modeling and inference as we move forward.

  1. Central Limit Theorem (CLT): One of the most important principles in statistics is the Central Limit Theorem. It tells us that as we increase the number of trials n, while keeping the probability of success p constant, the distribution of the sample mean (or the proportion of successes) will tend towards a normal (bell-shaped) distribution. This happens regardless of the original shape of the population distribution. So, even if we’re starting with a binary outcome (like a coin flip), the overall distribution of results will become more symmetric and normal as we conduct more trials.

  2. Application of Normality: The phenomenon we observe with the Binomial distribution becoming more normal as n increases is an illustration of the CLT. This tendency towards normality allows us to use a wide range of statistical tools that assume normality. These tools will be important when we make inferences about data later in the course, whether we’re working within a frequentist or Bayesian framework.

  3. Implications for Statistical Inference: The key point to understand is that the CLT explains why the distribution of the sum (or average) of many independent random variables tends toward a normal distribution. This concept is fundamental because it means that, in practice, the combined effect of many random variables will be approximately normally distributed, even if the original variables themselves are not. This property makes it possible to apply normal distribution-based techniques to a wide variety of data types.

Understanding these points is crucial for the inferential statistics that we’ll examine later in the course. If you take away nothing else from this module except for the importance of the Central Limit Theorem and its role in making data analysis manageable, you’ll have grasped one of the most important concepts in statistics.

Continuous Probability Distributions

Probability Density Function (PDF)

The Probability Density Function (PDF) illustrates the distribution of probabilities across a continuum of potential outcomes for a continuous random variable. Unlike discrete outcomes, which are countable, the outcomes for continuous variables are infinitely divisible, meaning the variable can assume any value within a specific range. Individuals’ height in inches is an example of a continuous random variable. The PDF is valuable for describing these variables’ behaviors by showing the likelihood of the variable falling within various intervals. Using intervals is crucial because it’s more practical to know the probability of a variable falling within a range of heights (e.g., between 74.0 and 74.9 inches) rather than at a single, precise height (e.g., exactly 74.4598 inches). This approach helps in understanding and interpreting the spread and concentration of continuous data more effectively.

Definition

I’m going to provide some equations in this section to provide a formal definition — but it’s not necessary to understand and study them. The concepts are the important element here.

Formally, the PDF, also written as \(f(x)\), of a continuous random variable \(X\) (e.g., height) is a function that satisfies the following properties:

  1. Non-negativity: \(f(x) \geq 0\) for all \(x\) in the domain of \(X\). This just means the density function can never take on a negative value, aligning with the concept that probabilities are always positive or zero. For example, for every height \(f(x) \geq 0\). This should be intuitive because it doesn’t make sense to have a negative probability of being a certain height. A negative value would imply a negative likelihood, which is nonsensical (e.g., there can’t be less than a 0 probability that someone will be between 50 and 55 inches tall).

  2. Area Equals 1: The total area under the PDF curve and above the x-axis equals 1. Mathematically, this is expressed as \(\int_{-\infty}^{\infty} f(x) dx = 1\). Though complicated looking, it’s just important to know that the total probability of all possible outcomes for a random variable adds up to 1. So, if you consider all possible values a continuous variable could take, the likelihood that the variable will take on some value within that range is a complete certainty (or 100%).

  3. Probabilities for Intervals: The probability that \(X\) falls within an interval \(a \leq X \leq b\) is given by the integral of \(f(x)\) over that interval: \(P(a \leq X \leq b) = \int_{a}^{b} f(x) dx\), for example, the probability that someone is between 74.0 (a) and 74.9 (b) inches tall. This differs from discrete random variables, where probabilities are assigned to specific outcomes rather than intervals.

  4. In the PDF, the y-axis represents the probability density of the continuous random variable at each value of \(x\). This density is not a probability itself but indicates how densely the probability mass is distributed around \(x\). The area under the PDF curve between two points on the x-axis corresponds to the probability of the random variable falling within that interval. Because of this, the units of the y-axis in a PDF plot are “probability per unit of \(x\)” (for example, probability per inch in the height example).

An example

Imagine you’re analyzing the duration of visits to an online paper (e.g., the New York Times), measured in minutes. This duration is a continuous random variable because a visit can last any length of time within a range. The PDF for this variable might show a peak at around 15 minutes, indicating a high density (and thus a higher probability) of visits lasting close to 15 minutes. As the curve falls off towards longer durations, the density decreases, indicating that longer visits become progressively less likely.

The value of the PDF at any specific point does not give the probability of the variable taking that exact value (in continuous distributions, the probability of any single, precise outcome is 0) but rather indicates the density of probability around that value. When considering a range of values, the PDF allows us to calculate the likelihood of the variable falling within that range by integrating the PDF over that interval. For example, the probability of a visitor spending between 4 and 6 minutes on the website is approximately 0.057 — or about 5.7% of visitors spend 4 to 6 minutes on the news website (you’ll see how to calculate the probability of falling with a certain interval in just a bit). This is displayed in the graph below.

Key Takeaways

  • The PDF is crucial for understanding and modeling the behavior of continuous random variables.
  • It provides a graphical representation of where probabilities are “concentrated” across the range of possible values.
  • While the PDF itself doesn’t give probabilities for exact values, it enables the calculation of probabilities over intervals.

Cumulative Distribution Function (CDF)

The Cumulative Distribution Function (CDF) for a continuous random variable builds directly on the concept of the Probability Density Function (PDF) by providing a cumulative perspective on the distribution of a continuous random variable. While the PDF shows the density of probabilities at different points, the CDF aggregates these probabilities to give us a comprehensive view of the likelihood of a variable falling below a certain threshold.

Definition of the CDF

The CDF, denoted as \(F(x)\), for a continuous random variable \(X\), is defined as the probability that \(X\) will take a value less than or equal to \(x\), that is, \(F(x) = P(X \leq x)\). This function starts at 0 and increases to 1 as \(x\) moves across the range of possible values, reflecting the accumulation of probabilities.2

Key Characteristics of the CDF

  1. Monotonically Increasing: The CDF is always non-decreasing, meaning it never decreases as \(x\) increases. This reflects the accumulating nature of probabilities as we consider larger and larger values of the random variable.
  2. Bounds: \(F(x)\) ranges from 0 to 1. At the lower end of the range, \(F(x) = 0\) indicates that no values are less than the minimum value. At the upper end, \(F(x) = 1\) signifies that all possible values of the random variable have been accounted for.
  3. Slope Indicates Density: The slope of the CDF at any point is related to the density at that point in the PDF. Where the PDF is high (indicating a high probability density), the CDF rises steeply, reflecting a rapid accumulation of probability.
  4. For the CDF, the y-axis denotes the probability that the random variable \(X\) takes a value less than or equal to \(x\). As you move from left to right along the x-axis, the y-value at any point \(x\) gives the total probability of \(X\) being in the range up to and including \(x\). The y-axis in a CDF plot ranges from 0 to 1, where 0 represents the probability that the random variable takes on values less than the minimum of its range (which, in practical terms, is zero probability), and 1 represents the certainty that the random variable takes on a value within its entire possible range.

Applying CDF to the News Website Example

In the context of analyzing time spent on a website, the CDF can answer questions such as, “What proportion of visitors spend less than 10 minutes on the site?” To find this, you would look at the value of \(F(10)\), which gives the cumulative probability of a visit lasting 10 minutes or less.

Graphically, the CDF starts at 0 (indicating that 0% of visitors spend no time on the site) and progresses toward 1 (indicating that 100% of visitors are accounted for) as the duration increases. By plotting the CDF, we can easily visualize how the distribution of visit durations accumulates. For example, if \(F(10) = 0.25\), it means that 25% of the visitors spend 10 minutes or less on the website (see the black shaded area in the graph below).

Key Insights from the CDF for our Example

  • The CDF (Cumulative Distribution Function) shows the probability that a visitor spends a certain amount of time or less on the website.
  • It provides a cumulative perspective, allowing us to determine the total proportion of visitors who spend up to a certain duration on the website.
  • By observing the CDF curve, we can understand how visit durations accumulate: a flatter section indicates a slow accumulation of visitors with those visit durations, while a steeper section indicates a rapid accumulation of visitors with those visit durations.

Together, the PDF and CDF offer complementary views of the distribution of continuous random variables like time spent on a website. The PDF shows the likelihood of specific durations, highlighting where data points are most and least concentrated. The CDF, on the other hand, provides the cumulative probability, helping us understand the total proportion of observations that fall within or below specific ranges. By using both, we gain a comprehensive understanding of the variable’s distribution and behavior.

The Mean and Variance of a Continuous Random Variable

In many practical scenarios, you will have a dataset and compute the mean and variance of a continuous random variable (like heights) directly from the raw data. This is common in empirical data analysis, where you’re dealing with a finite number of observations (like the heights of 1,000 people). In such cases, we calculate the mean and variance using simple arithmetic formulas based on the data we have.

However, just as we have methods for calculating the mean and variance using a binomial distribution, there are also methods for finding the mean and variance of a continuous distribution. In these cases, the integral approach is used for theoretical and analytical purposes. It allows us to understand the underlying distribution of a continuous random variable when we don’t have specific data points but instead have knowledge about the overall behavior of the variable. This method helps us derive the distribution’s properties directly from its probability density function.

For example, if we know that a certain variable follows a normal distribution with a specific mean and standard deviation, the probability density function (PDF) helps us understand its behavior without needing to gather a large sample. We can use the PDF to describe the distribution’s properties, such as its mean and variance, through mathematical expressions.

The mean, or expected value, of a continuous random variable \(X\) with PDF (also denoted as \(f(x)\)) over its range is given by the formula:

\[ \mu = E(X) = \int_{-\infty}^{\infty} x f(x) dx \]

This looks frightful (and it’s certainly not necessary to commit to memory) — but let’s break it down as understanding these elements may aid in greater understanding:

  1. \(x\): This represents the variable of the function, which is the value of the random variable \(X\). For example, some scores for \(x\) for the website example are 10.2 minutes, 11.5 minutes, 2.1 minutes.

  2. \(f(x)\): This is the probability density function (PDF) of \(X\). It describes the relative likelihood for this random variable to take on a given value. The function \(f(x)\) essentially tells us how “dense” the probability is at each point \(x\).

  3. \(\int_{-\infty}^{\infty}\): This integral sign, along with the limits of \(-\infty\) to \(\infty\), indicates that we’re considering all possible values of \(X\) from negative infinity to positive infinity, which is typical when the random variable can take any value along the real number line.

  4. \(dx\): This is a small slice along the x-axis. When we integrate a function, we’re essentially summing up infinitely many infinitesimally small values of the function \(x f(x)\) across the range of \(X\). The \(dx\) denotes that this summing (or integration) is happening over the variable \(x\).

So, the whole expression \(\int_{-\infty}^{\infty} x f(x) \, dx\) is calculating the expected value (mean) of the random variable \(X\), weighted by its probability density. It’s finding the “balance point” of the distribution defined by \(f(x)\).

The variance of a continuous random variable measures the spread of its possible values around the mean, and is calculated as:

\[ \sigma^2 = Var(X) = \int_{-\infty}^{\infty} (x - \mu)^2 f(x) dx \]

In simple terms, the formula states:

  • Take every possible value of \(X\),

  • Looks at how far it is from the mean \((x - \mu)\),

  • Square this distance \((x - \mu)^2\) to emphasize larger deviations and treat positive and negative deviations equally,

  • Weight these squared distances by how likely each value is \(f(x)\),

  • Then average these weighted squared distances by integrating \(\int_{-\infty}^{\infty} ... dx\) across all possible values to get a single number that represents the overall dispersion of the distribution.

The formulas for the mean and variance seem complex, and it’s not necessary to memorize, but the important element is to note that the formulas shift from the discrete sums (which we used in the discrete example earlier) to integrals, acknowledging the continuum of potential values. The PDF delineates the relative likelihood of \(X\) assuming specific values, with the integral calculations extending over all conceivable values \(X\) might take. While we can’t pinpoint exact outcomes as with discrete variables (e.g., it doesn’t make sense to find the probability of spending exactly 7 minutes and 5 seconds on the website), we can ascertain the probabilities of landing within specific intervals. You’ll get to practice this later in the Module under the context of a normally distributed variable.

The Normal Distribution

Imagine standing at the peak of a smooth, perfectly symmetrical hill, with the ground rolling gently down on either side. This hill is not just any hill — it’s a graphical representation of the normal distribution, also fondly known as the “bell curve” due to its distinctive shape.

Why does the normal distribution matter so much? Every day phenomena, from the heights of people in a city to test scores on an exam, tend to cluster around a central value, with fewer occurrences the farther you move away from this center. The normal distribution captures this universal pattern, making it a cornerstone of probability and statistics.

But what sets the normal distribution apart from the distributions we’ve explored so far, like the binomial distribution? First off, while the binomial distribution deals with discrete outcomes (like flipping a coin), the normal distribution smoothly handles continuous data—measurements that can take on any value within a range. This makes it incredibly versatile.

Furthermore, the normal distribution is defined by two key parameters: the mean (μ) and the standard deviation (σ). The mean guides us to the center of the hill, indicating where the average value lies, while the standard deviation tells us how spread out the data is, shaping the width of the bell curve. A smaller standard deviation means a steeper hill, indicating that most data points are close to the mean. Conversely, a larger standard deviation results in a wider hill, reflecting greater variability in the data.

What truly makes the normal distribution fascinating is its role in the Central Limit Theorem — a concept that we will dig into later in the course. For now, gaining a thorough understanding of the properties of the normal distribution will set us up for smooth transition to this more complex application of the normal distribution when the time comes.

To set the stage, please watch the following Crash Course Statistics videos on Z-scores/Percentiles and the Normal Distribution.




An Example

Let’s explore the properties of a normal distribution. Recall that earlier in the course we used the National Health and Nutrition Examination Study (NHANES) data frame. This national study is conducted by the Centers for Disease Control and Prevention, and is one of their surveillance initiatives to monitor the health and morbidity of people living in the US. In Module 2 we used these data to examine systolic blood pressures. Now, we’ll use the same data frame to explore heights of U.S. adults.

We’ll use the following variables:

Variable Description
sex.f Biological sex of respondent
age Age of respondent
height_cm Height of respondent, in centimeters


The code below, imports the data frame that we’ll explore in this section:

nhanes <- read_rds(here("data", "nhanes_heights.Rds")) 

We’re going to use a subset of the data. Let’s construct a data frame that includes just the cases and variables that we need for this module. We have a variety of tasks that we need to accomplish:

  1. First, we’re going to focus on people 20 years or older, we can accomplish this with filter() to choose just rows of data (i.e., people) where age is greater than or equal to 20.

  2. Then we’ll transform height_cm, which is expressed in cm, to a new variable called ht_inches that is expressed in inches using mutate().

  3. We will subset columns with select() to keep just the variables needed.

  4. Finally, we’ll use drop_na() to exclude cases missing on ht_inches or sex.f (the two variables we will use in this Module).

height <- 
  nhanes |> 
  filter(age >= 20) |> 
  mutate(ht_inches = height_cm/2.54) |> 
  select(ht_inches, sex.f) |>
  drop_na()

Visualize the data

Let’s begin our descriptive analyses by examining the distribution of heights by sex.

height |> 
  ggplot(mapping = aes(x = ht_inches, group = sex.f, fill = sex.f)) +
  geom_density() +
  theme_bw() +
  labs(title = "What is the distribution of heights for males and females?",
       fill = "Sex",
       y = "Density",
       x = "Height in inches")

The highest density of data points for males is around 69 inches and for females around 64 inches. But, we have a reasonably large range in both groups, representing the large variation in heights of people in our population. As we would expect, males are, on average, taller than females.

Let’s use skim() to calculate the descriptive statistics for height by sex.

height |> 
  group_by(sex.f) |> 
  skim()
Data summary
Name group_by(height, sex.f)
Number of rows 3561
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables sex.f

Variable type: numeric

skim_variable sex.f n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ht_inches female 0 1 63.76 2.91 52.95 61.80 63.82 65.63 72.64 ▁▂▇▅▁
ht_inches male 0 1 69.19 3.01 59.88 67.32 69.02 71.06 78.90 ▁▃▇▃▁


Let’s consider males first. The mean height is 69.2 inches and the standard deviation is 3.0 inches. The median is listed under p50, referring to the 50th percentile score — 69.0 inches in our example. The range is ascertained through the p0 and p100 scores, referring to the 0th (lowest) and 100th (highest) percentile score — 59.9 inches to 78.9 inches in our example.

To make these percentile scores a little more concrete, let’s randomly select 11 males from the data frame. Once 11 are selected — I am going to sort by height, and create a value denoting the rank order from shortest to tallest of these 11 males.

set.seed(1234567)

pick11 <- height |> 
  filter(sex.f == "male") |> 
  select(ht_inches) |> 
  sample_n(11) |> 
  arrange(ht_inches) |> 
  mutate(ranking = as.integer(rank(ht_inches)))

Here’s a table of these 11 selected males.

ht_inches ranking
66.25984 1
66.29921 2
67.48031 3
68.18898 4
68.34646 5
69.44882 6
69.76378 7
70.23622 8
70.94488 9
72.48031 10
72.67717 11


We see that the shortest male in this random sample is 66.3 inches, and the tallest male is 72.7 inches. Let’s request the skim output for this subset of males.

pick11 |> skim()
Data summary
Name pick11
Number of rows 11
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ht_inches 0 1 69.28 2.21 66.26 67.83 69.45 70.59 72.68 ▇▅▅▅▅
ranking 0 1 6.00 3.32 1.00 3.50 6.00 8.50 11.00 ▇▅▅▅▅


Take a look at the row for ht_inches. The p0 score is the 0th percentile of the data points — which represents the shortest male in the subsample. The p100 score is the 100th percentile of the data points — which represents the tallest male in the subsample. Notice that the 50th percentile is the middle score (a ranking of 6 — where 5 scores fall below and 5 scores fall above) — that is, 69.4 inches (see that this height corresponds to ranking = 6 in the table of the 11 male heights printed above). Note that 50% of the scores in our subsample are below the 50th percentile and 50% of the scores in our subsample are above the 50th percentile.

Going back now to the full sample of males, let’s explore the full distribution of heights for males. Inside the summarize() function, we can use the quantile() function to identify any quantile that we are interested in. For example, we might have an interest in finding the height associated with the 2.5th percentile and the 97.5th percentile.

height |> 
  filter(sex.f == "male") |> 
  select(ht_inches) |> 
  summarize(q2.5 = quantile(ht_inches, probs = .025),
            q97.5 = quantile(ht_inches, probs = .975))

This indicates that for the observed heights of males in the NHANES study — a male who is about 63 inches is at the 2.5th percentile of the distribution, while a male who is about 75 inches is at the 97.5th percentile. This means that about 2.5% of males are shorter than 63 inches, and 2.5% of males are taller than 75 inches. This also tells us that about 95% of males are between 63 and 75 inches tall.

We can mark these percentiles on our density graph of male heights:

height |> 
  filter(sex.f == "male") |> 
  ggplot(mapping = aes(x = ht_inches)) +
  geom_density(fill = "#00BFC4") +
  geom_vline(xintercept = 63.07087, linetype = "dashed") +
  geom_vline(xintercept = 75.07874, linetype = "dashed") +
  theme_bw() +
  labs(title = "What is the distribution of heights for males?",
       subtitle = "Dashed lines denote the 2.5th and 97.5th percentiles of the observed distribution",
       x = "Height in inches")

This density graph is akin to a Probability Density Function (PDF) for the heights of males. Here’s why:

  • PDF Definition: Recall that a PDF describes the relative likelihood for a continuous random variable to take on a given value. It shows how probability is distributed over the values of the variable.

  • Shape and Interpretation: The density graph depicts the distribution of male heights, where the area under the curve represents the total probability, which is always equal to 1. The shape of the curve shows where the values are most concentrated. Peaks in the graph indicate the heights where the probability density is highest.

  • Probabilities: While the exact probability of a specific height (e.g., exactly 70 inches) is technically zero in a continuous distribution, the PDF helps us understand the probability of height falling within an interval. For example, the area under the curve between 63 inches and 75 inches represents the probability of a male’s height falling within this range.

  • Percentiles: The vertical dashed lines at the 2.5th and 97.5th percentiles illustrate specific points in the distribution. These lines help visualize that 95% of the male heights lie between 63 and 75 inches.

By using this density graph, we get a visual representation of the distribution of heights, akin to a PDF, which helps us understand the likelihood and distribution of different height values in the sample.

Now, let’s take our analysis a step further by exploring the Cumulative Distribution Function (CDF) of male heights. Unlike the density graph, which shows the likelihood of specific height values, the CDF provides a cumulative perspective, illustrating the probability that a male’s height will fall at or below a particular value. By plotting the CDF, using the stat_ecdf() we can easily determine the proportion of males whose heights are less than or equal to any given value, offering a comprehensive view of how heights accumulate across the population. This visualization helps us understand not just the distribution of heights but also the overall trend and distribution shape, making it easier to interpret percentiles and cumulative probabilities.

height |> 
  filter(sex.f == "male") |> 
  ggplot(mapping = aes(x = ht_inches)) +
  stat_ecdf(geom = "step", color = "#00BFC4") +
  stat_ecdf(geom = "area", fill = "#00BFC4", alpha = 0.5) +
  geom_vline(xintercept = 63.07087, linetype = "dashed") +
  geom_vline(xintercept = 75.07874, linetype = "dashed") +
  theme_bw() +
  labs(title = "Cumulative distribution of heights for males",
       subtitle = "Dashed lines denote the 2.5th and 97.5th percentiles of the observed distribution",
       x = "Height in inches",
       y = "Cumulative Probability")

In the CDF graph for male heights, we observe the cumulative distribution of heights in the NHANES study. The x-axis represents the height in inches, while the y-axis represents the cumulative probability, ranging from 0 to 1. The CDF curve, shown in blue, steps upwards as we move from left to right, indicating the increasing cumulative probability.

The filled area under the curve provides a visual representation of the accumulation of probabilities. For instance, the steepness of the curve shows how rapidly the cumulative probability increases for certain height ranges. Flatter sections of the curve indicate slower accumulation, meaning fewer males have heights in those ranges.

The dashed vertical lines mark the 2.5th and 97.5th percentiles. The line at approximately 63 inches shows that about 2.5% of males are shorter than this height. The line at around 75 inches indicates that 97.5% of males are shorter than this height, meaning only 2.5% are taller. Therefore, about 95% of male heights fall between 63 and 75 inches. By 80 inches, we’ve accounted for 100% of the sample.

Are heights normally distributed among US adults?

A variable with a Gaussian3 (i.e., bell-shaped) distribution is said to be normally distributed. A normal distribution is a symmetrical distribution. The mean, median and mode are in the same location and at the center of the distribution.

Plotted below are two histograms of height in inches for participants in NHANES — the first for females and the second for males — and overlaid on each is a perfect normal curve. The resulting graphs, shown below, demonstrate that the distribution of heights for females and males is quite close to normal.

Recall from Module 2 that when a variable is normally distributed, we can use the Empirical Rule to describe it’s spread given the mean and standard deviation. In the graph below, the mean of a variable called \(z\) is 0 and the standard deviation is 1 (denoted on the x-axis). The Empirical Rule states that when a variable is normally distributed, we can except that:

  • 68% of the data will fall within about one standard deviation of the mean
  • 95% of the data will fall within about two standard deviations of the mean
  • Almost all (99.7%) of the data will fall within about three standard deviations of the mean

Since we verified that female heights are normally distributed, let’s use the Empirical Rule to describe the distribution of female heights. The average height of females is 63.8 inches, with a standard deviation of 2.9 inches. The Empirical Rule states that 95% of the data will fall within about two standard deviations of the mean. Therefore, we expect 95% of all females to be between 58.0 inches and 69.6 inches tall (i.e., 63.8 \(\pm\) 2 \(\times\) 2.9). Likewise, only about 2.5% of females are expected to be shorter than 58.0 inches, and only about 2.5% of females are expected to be taller than 69.6 inches.

Take a moment to calculate the same quantities for the males in NHANES.

The PDF for a normal distribution

Understanding the distribution of a normally distributed variable, and estimating the probability of observing a value within a specific range, can be achieved by knowing just two parameters: the mean and the standard deviation. Remarkably, this process doesn’t require access to the raw data. By knowing the mean and standard deviation of a continuous random variable and confirming its normal distribution, we can utilize normal distribution tools to analyze aspects of its distribution. This approach allows for the study of the variable’s behavior and the likelihood of specific outcomes without needing the individual data points.

A continuous random variable, like height, can have an infinite number of possible values — think about a very precise height of an individual rather than a value rounded to the nearest inch (e.g. 74.25197 inches tall). Thus, the probability that a continuous random variable takes on any particular value is essentially 0. Therefore, in describing heights, it’s not useful to calculate the probability of an individual having some very precise height, but rather it’s more useful to find the probability that a continuous random variable falls in some interval (e.g., greater than 70 inches, or between 60 and 65 inches, or even between 70 and 71 inches). We can use a probability density function to calculate the probability of a score falling within any desired interval.

A probability density function (PDF) is a function associated with a continuous random variable (for example, heights of females in the US). It helps us to understand the probability of a continuous random variable falling in some interval of the density curve. In a PDF, the area under the curve is equal to 1, akin to the density plots that we studied in Module 3. The graph below approximates a PDF for female heights.


By utilizing the PDF of a normal distribution, we can gauge the probability of a case (e.g., a female) having a score within a certain interval, given the variable is normally distributed. This is what we just did when we applied the Empirical Rule to discern that 95% of all female heights fall within the interval of 58.0 inches and 69.6 inches. In practical terms, this means only about 2.5% of females will be shorter than 58 inches, and a mere 2.5% will be taller than 69.6 inches. Moreover, 95% of females will be between 58.0 and 69.6 inches tall. These properties are visually represented in the graph below.



Using the Empirical Rule, we managed to determine the middle 95% of female heights. However, there’s also a convenient function in R, named qnorm(), that helps us calculate this range. The qnorm() function, standing for the quantile of the normal distribution, allows us to find a quantile value where the probability, denoted as \(p\), of observing a value equal to or less than that quantile is certain, given a normal distribution with a specific mean and standard deviation.

To better understand this concept, examine the graph below that showcases a normal distribution. The intervals linked to the Empirical Rule are marked, but there’s also additional information in the form of percentiles. These percentiles denote the percentile rank of a given score on the x-axis of the density plot, i.e., the percentage of scores in the frequency distribution that are lower than the score, which is a representation of the CDF. Percentile ranks are ubiquitous in everyday life. For example, your pediatrician likely calculated your percentile rank for height, or you might have received a percentile score when taking the SAT. For example, take a look at the figure and note that at a standard deviation of -2, the cumulative percentages row indicates that 2.3% of the distribution will fall below -2 standard deviations, and at a standard deviation of +2, the cumulative percentages row indicates that 97.7% of the distribution will fall below +2 standard deviations. Building on the latter, if 97.7% of the distribution is below +2 standard deviations, then 2.3% of the distribution is above +2 standard deviations.

The qnorm() function leverages these percentile ranks (which are a type of quantile), using the CDF to find quantiles based on percentile ranks. For instance, if we want to find the height marking the 2.5th percentile for female heights, we could use the following code. Please note that 2.5/100 = .025, therefore, p = .025 in the code signifies that we are seeking the height below which 2.5% of heights fall (and 97.5% exceed). Also, the mean and sd in the code represent the mean and standard deviation of our normally distributed variable — female heights in the population. Hence, the code qnorm(p = .025, mean = 63.8, sd = 2.9) means we’re looking for the height below which the lowest 2.5% of all heights fall, given the distribution of heights is normal with a mean of 63.8 and standard deviation of 2.9. This application of qnorm() directly leverages the properties of the CDF to compute the desired quantile.

qnorm(p = .025, mean = 63.8, sd = 2.9)
[1] 58.1161

Executing this function yields 58.1, so we expect 2.5% of females to be shorter than 58.1 inches, and 97.5% to be taller than 58.1 inches. This quantity is depicted in the graph below.

Let’s try another. Here, we’ll find the score for height in which 97.5% of scores will fall below. In this example, we set p to .975 (again 97.5/100 = .975). Using the code qnorm(p = .975, mean = 63.8, sd = 2.9) means we’re looking for the height below which 97.5% of all heights fall, assuming that the distribution of heights is normal with mean 63.8 and standard deviation 2.9. Note that this also means that 2.5% of the population would be expected to be taller than the desired value, under the same assumptions.

qnorm(p = .975, mean = 63.8, sd = 2.9)
[1] 69.4839

Execution of the function yields 69.5, indicating that we expect 97.5% of females to be shorter than 69.5 inches tall (and 2.5% will be taller). This quantity is depicted in the graph below.

We can compare these to the values we calculated using the Empirical Rule (that is, by computing 63.8 \(\pm\) 2 \(\times\) 2.9). Using the Empirical Rule and calculating the range by hand, we obtained 58.0 for the lower bound and 69.6 for the upper bound. Using qnorm(), we obtained 58.1 for the lower bound and 69.5 for the upper bound. You can see they match up quite closely — the difference is because qnorm() is more precise. The hand calculations using the Empirical Rule specified that 95% of the distribution falls within 2 standard deviations of the mean. While the Empirical Rule is a useful guideline for understanding the spread of data in a normal distribution, it approximates that 95% of the data falls within 2 standard deviations of the mean. In actuality, the precise value for a standard normal distribution is closer to 1.96 standard deviations away from the mean to capture 95% of the data.

Where does this 1.96 value come from?

  1. Standard Normal Distribution: First, understand that the standard normal distribution is a special case of the normal distribution where the mean is 0 and the standard deviation is 1. Values in this distribution are referred to as z-scores.

  2. Percentiles and Probability: When we talk about capturing the middle 95% of the data in a normal distribution, we’re really saying that we want the boundaries where 2.5% of the data lies below (left tail) and 2.5% lies above (right tail). This leaves 95% of the data in between these two tails.

  3. Introduction to Alpha: The term “alpha” (\(\alpha\)) refers to the proportion of the distribution that lies in the tails. For a standard normal distribution, if we want to capture the middle 95%, the remaining 5% is split equally between the two tails, with 2.5% in each tail. So, \(\alpha\) is 0.05 in this case, with 0.025 (2.5%) in each tail.

  4. Finding \(z_{\alpha/2}\): The notation \(z_{\alpha/2}\) is used to denote the z-score that corresponds to the cumulative probability of \(\alpha/2\) in the left tail of the distribution. For \(\alpha = 0.05\), \(\alpha/2\) is 0.025. Using statistical tables (like the z-table for the standard normal distribution) or software functions (like qnorm() in R), we find that the z-score corresponding to the cumulative probability of 0.025 is -1.96. Similarly, the z-score corresponding to the cumulative probability of 0.975 (which is \(1 - 0.025\)) is +1.96. Therefore, \(z_{\alpha/2}\) is 1.96 in this context.

For example, here’s how you can use R to find the appropriate z-scores that denote the middle 95% of a standardized normal distribution (i.e., mean = 0, standard deviation = 1).

qnorm(p = c(.025, .975), mean = 0, sd = 1)
[1] -1.959964  1.959964

Now, with this in hand, if we use the more precise value for the middle 95% (~1.959964) in our hand calculations, we obtain the same values produced by qnorm().

63.8 \(\pm\) 1.96 \(\times\) 2.9 = (58.1, 69.5).

It’s also important to note that we aren’t limited to finding the middle 95%. For example, we might be interested in finding the middle 99%. For the middle 99%, we substitute 1.96 with 2.58 as 99% of values of a normal distribution fall within ~2.58 standard deviations from the mean.

Here’s a helpful tip — for any desired range, you can find the critical value by which you multiply the standard deviation by using the qnorm()4 function in R.

# for 95% coverage - take 1 - .95, and then divide by 2 to get middle 95%
qnorm(p = (1-.95)/2, lower.tail = FALSE)
[1] 1.959964
# for 99% coverage - take 1 - .99, and then divide by 2 to get middle 99%
qnorm(p = (1-.99)/2, lower.tail = FALSE)
[1] 2.575829
# for 80% coverage - take 1 - .80, and then divide by 2 to get middle 80%
qnorm(p = (1-.80)/2, lower.tail = FALSE)
[1] 1.281552

We can use qnorm() in this way to calculate any interval that we wish. For example, to get the middle 80% we’d multiply the standard deviation by 1.28 (the value calculated using qnorm() above for 80% coverage) and then add and subtract that value from the mean: 63.8 \(\pm\) 1.28 \(\times\) 2.9 = 59.9 for the lower bound and 67.3 for the upper bound.


Let’s Practice

Let’s practice these techniques with a few examples.

Example 1

How tall does a female need to be in order to be taller than 75% of females in the population?

qnorm(p = .75, mean = 63.8, sd = 2.9)
[1] 65.75602

A female needs to be about 65.7 inches tall in order to be taller than 75% of the female population.

Example 2

What range of heights encompasses the middle 50% of females?

We need two values here, the lower and upper bound of the specified interval.

qnorm(p = .25, mean = 63.8, sd = 2.9)
[1] 61.84398
qnorm(p = .75, mean = 63.8, sd = 2.9)
[1] 65.75602

25% of females are shorter than ~61.8 inches, 25% are taller than ~65.7 inches. Thus, the interval for the middle 50% is about 61.8 inches to 65.7 inches tall.

Example 3

We might also be interested in framing a question like this: “What is the probability that a randomly selected female from the population will be shorter than 65 inches tall?” This is a little different than the previous examples. We need a different function to solve this problem — the pnorm() function in R can help us to calculate this. While qnorm() finds the corresponding quantity, pnorm() finds the corresponding probability. To employ the function, supply the value of the variable of interest (65), as well as the mean and the standard deviation (sd) of the variable under study.

pnorm(q = 65, mean = 63.8, sd = 2.9)
[1] 0.6604872

The probability is .66 that a randomly selected female would be less than 65 inches tall. Or, we can expect that about 66% of females are shorter than 65 inches. You could equivalently say that a female who is 65 inches tall is at the 66th percentile for the distribution of female heights.

Example 4

Let’s try one more. Let’s ask the question: “What is the probability that a randomly selected female is taller than 70 inches?” Here, we want the proportion of the distribution above 70, so the upper tail. The default of qnorm() and pnorm() is to calculate the lower tail of the distribution. However, if you desire the upper tail, you simply need to include the argument lower.tail = FALSE, which then provides the upper tail.

pnorm(q = 70, mean = 63.8, sd = 2.9, lower.tail = FALSE)
[1] 0.01626117

The probability is about .016 that a randomly selected female would be taller than 70 inches. Or, we can expect that about 1.6% of females are taller than 70 inches.

In contrasting qnorm() and pnorm(), we find that qnorm() will return a score associated with the normally distributed variable that is likely to encompass some specified area of the distribution (e.g., minimum height to be in the top 10%). On the flip slide, pnorm() will return the area of the distribution that is encompassed by some interval of scores (e.g., percent of the distribution that is above 65 inches).

The Standard Normal Distribution

Recall from Module 2 that z-scores are a standardized version of a variable that has been transformed by first subtracting the mean, and then dividing by the standard deviation. Thus, a z-score has a mean of 0 and a standard deviation of 1. A z-score essentially tells us how many standard deviations away a data point is from the mean. For instance, a z-score of +1.0 indicates a data point that is one standard deviation above the mean, and a z-score of -1.5 denotes a data point that is 1.5 standard deviations below the mean.

A standard normal distribution, also known as the z-distribution, is a special case of the normal distribution. It is a type of normal distribution that has a mean of 0 and a standard deviation of 1. In other words, a standard normal distribution is a distribution of z-scores.

We can apply the concept of z-scores to the NHANES example. Among females, the mean height is 63.0 inches, and the standard deviation is 2.9 inches. Let’s consider a female height of 70 inches:

\[ \text{z-score} = \frac{70 - 63.8}{2.9} = 2.1 \]

This means that a female who is 70 inches tall is 2.1 standard deviations above the mean height for females.

What is your z-score for height?

If we’d like to calculate each person’s z-score for height in the NHANES data frame, we can use the code below to convert the raw scores to z-scores. Because we use group_by() first, the sex specific mean and sd is used to form the z-scores for males and females respectively.

zheight <- height |>
  group_by(sex.f) |>
  mutate(zht_inches = (ht_inches - mean(ht_inches))/sd(ht_inches)) |>
  ungroup() |> 
  select(sex.f, ht_inches, zht_inches)

zheight |> head()
zheight |>  group_by(sex.f) |> skim()
Data summary
Name group_by(zheight, sex.f)
Number of rows 3561
Number of columns 3
_______________________
Column type frequency:
numeric 2
________________________
Group variables sex.f

Variable type: numeric

skim_variable sex.f n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ht_inches female 0 1 63.76 2.91 52.95 61.80 63.82 65.63 72.64 ▁▂▇▅▁
ht_inches male 0 1 69.19 3.01 59.88 67.32 69.02 71.06 78.90 ▁▃▇▃▁
zht_inches female 0 1 0.00 1.00 -3.72 -0.67 0.02 0.64 3.05 ▁▂▇▅▁
zht_inches male 0 1 0.00 1.00 -3.09 -0.62 -0.06 0.62 3.22 ▁▃▇▃▁

Notice in the skim() output above that the z-score versions of height have a mean of 0 and a standard deviation of 1.

We can use either the unstandardized (raw heights) or standardized (z-scores of heights) scores with qnorm() and pnorm(). For example: What is the probability that a randomly selected female will be shorter than 70 inches? In the code below, I am going to be more precise with the mean and standard deviation so that you can see the close match.

pnorm(q = 70, mean = 63.76, sd = 2.91)
[1] 0.9839968

You can obtain the same answer using z-scores (notice the change to mean and sd — the mean and sd of a z-score is 0 and 1 respectively, and I replaced q = 70 with q = 2.14 — the corresponding z-score for females).

pnorm(q = 2.14, mean = 0, sd = 1)
[1] 0.9838226

The values are just slightly different because of rounding — if I used even more decimal places to record the precise scores for the raw mean and sd, then the two would match perfectly.

Let’s use the z-scores for a few more examples. How many standard deviations above the mean does a person need to be so that they are in the top 2.5% of the distribution?

qnorm(p = .975, mean = 0, sd = 1)
[1] 1.959964

How many standard deviations below the mean does a person need to be so that they are in the bottom 2.5% of the distribution?

qnorm(p = .025, mean = 0, sd = 1)
[1] -1.959964

We can map these last two values (-1.96 and +1.96) onto a graph. Notice the similarities with the Empirical Rule. The Empirical Rule is derived from the CDF for the standard (z-scores) normal distribution.

Wrap-up

Module 6 has deepened our understanding of probability distributions, an essential concept in both theoretical and applied statistics. We systematically explored the distinctions between discrete and continuous distributions.

We began by differentiating between discrete and continuous probability distributions, emphasizing that discrete distributions deal with countable outcomes, whereas continuous distributions pertain to outcomes that can take on any value within a range, such as measurements of height. This distinction is crucial because it influences the type of probability functions we use: probability mass functions (PMF) for discrete variables and probability density functions (PDF) for continuous variables.

We also delved into the cumulative distribution Function, or CDF, which is fundamental in understanding the behavior of both discrete and continuous variables. The CDF represents the probability that a random variable takes on a value less than or equal to a specific value. It is a powerful tool because it accumulates the probabilities of a variable up to that point, providing a running total of probabilities as one moves along the range of possible values. For discrete distributions, this involves summing the probabilities of all outcomes up to a certain point, while for continuous distributions, it involves integrating the probability density function up to that point. The CDF is especially useful because it gives a clear and immediate understanding of the probability of observing values within certain ranges, making it invaluable for statistical inference and decision-making.

A significant portion of our discussion was dedicated to the normal distribution, a fundamental concept in statistics due to its properties and the central limit theorem (CLT). The normal distribution, with its bell-shaped curve, describes how data points are distributed around the mean in many natural phenomena. Understanding the normal distribution is key to performing many types of statistical analyses because it underpins various statistical tests and confidence interval calculations, which you will study in later Modules.

Through practical examples, we applied functions like qnorm() and pnorm() in R to calculate critical values and probabilities associated with the normal distribution. These functions allow us to transition from theoretical distributions to practical applications, such as determining the probability of a particular outcome or understanding the variability of data in terms of standard deviations from the mean.

In summary, Module 6 has equipped us with the tools to not only understand but also apply probability distributions in real-world scenarios. By grasping the nuances of different types of distributions and mastering the normal distribution, we are better prepared to analyze and interpret the data that pervade our lives and work. This foundation will be invaluable as we continue to explore more complex statistical methods and their applications in subsequent modules, particularly with regard to statistical inference.

Credits

Footnotes

  1. In the context of probability and statistics, \(p\) and \(P\) represent different concepts, and their distinction is important:

    \(P\): The uppercase \(P\) is typically used to denote a probability function or probability measure. It represents the likelihood of a particular event or set of outcomes occurring. For example, \(P(A)\) denotes the probability of event \(A\) happening. This notation is used for general probabilities and can apply to various events or conditions.

    \(p\): The lowercase \(p\), on the other hand, is often used to represent a specific probability value, especially the probability of success in a single trial of a Bernoulli process. In the context of binomial distribution, \(p\) is the parameter that defines the probability of getting a “success” in each of the Bernoulli trials. It’s a fixed value that describes a characteristic of the process being studied, such as the probability of flipping a coin and it landing on heads.↩︎

  2. The notation \(f(X)\) for the Probability Density Function (PDF) and \(F(X)\) for the Cumulative Distribution Function (CDF) reflects a convention in statistics and probability theory to differentiate between these two fundamental concepts associated with continuous random variables. In the PDF the lowercase \(f\) denotes the density function. In the CDF the uppercase \(F\) signifies the cumulative aspect of the distribution function That is, the uppercase \(F\) highlights the aggregation or summation nature of the function, distinguishing it from the density-focused \(f(X)\).↩︎

  3. The normal distribution is often called the Gaussian distribution in honor of the German mathematician and scientist Johann Carl Friedrich Gauss. Gauss made significant contributions to many fields, including mathematics, statistics, astronomy, and physics. One of his notable achievements in the realm of statistics was his development of the method of least squares for data analysis in 1809, which led to the formal description of the normal distribution.

    The term “Gaussian distribution” acknowledges Gauss’s work in establishing the mathematical foundations of the distribution. While the normal distribution had been previously noted and used by other mathematicians like Abraham de Moivre, Gauss’s work in applying it to astronomical data and error analysis was pivotal. He showed how observational errors in astronomical measurements are normally distributed, meaning that most observations are near the mean value, with fewer and fewer observations appearing as one moves farther away from the mean.

    This discovery was revolutionary because it provided a solid mathematical basis for dealing with uncertainties in measurements and for making inferences about populations from sample data. The name “Gaussian distribution” not only commemorates Gauss’s contributions but also highlights the distribution’s central role in the field of statistics and its widespread application across various scientific disciplines.↩︎

  4. The argument lower.tail = FALSE in the qnorm() function in R is used to specify which tail of the standard normal distribution we are interested in. The qnorm() function returns the quantile (or z-score) associated with a given cumulative probability for the standard normal distribution.

    Here’s a breakdown of how lower.tail works in qnorm():

    1. lower.tail = TRUE (default)

      • When you provide a probability p with lower.tail = TRUE, the function gives you the z-score such that a proportion p of the data is to the left of that z-score (i.e., the lower tail).

      • For instance, qnorm(0.025) would give the z-score for which 2.5% of the data is to the left (which is approximately -1.96).

    2. lower.tail = FALSE

      • When you set lower.tail = FALSE, the function gives you the z-score such that a proportion (p) of the data is to the right of that z-score (i.e., the upper tail).

      • For example, qnorm(0.025, lower.tail = FALSE) would give the z-score for which 2.5% of the data is to the right (which is also approximately 1.96).

    ↩︎