We’ll consider data that covers the modern Olympic Games, from Athens 1896 to Rio 2016, and was originally scraped from www.sports-reference.com. It was then formatted for data analysis by Kaggle, a popular platform for data science competitions.
Variables
Variable
Class
Description
id
double
Athlete ID
name
character
Athlete name
sex
character
Athlete sex
age
double
Athlete age
height
double
Athlete height in cm
weight
double
Athlete weight in kg
team
character
Country/Team competing for
noc
character
NOC region
games
character
Olympic Games name
year
double
Year of Olympics
season
character
Season (either winter or summer)
city
character
City of Olympic host
sport
character
Sport
event
character
Specific event
medal
character
Medal (Gold, Silver, Bronze, or NA)
Import the data
Let’s import the data:
A small subset of the data
Here’s a subset of the data for you to peruse.
Probability Distributions for Discrete Variables
What is a Probability Mass Function (PMF)?
A probability mass function (PMF) provides the probability that a discrete random variable takes on a specific value. Recall the blood type example from the Module 6 handout.
What is the probability that a randomly selected individual from the population will have type AB blood?
An example from the Olympics data
A very small percentage of the global population gets the opportunity to participate in the Olympic Games, with an even more limited number going on to compete in multiple Games (e.g., more than one year). We can create an empirical PMF for number of appearances at Olympic Games.
For example, Dara Torres appeared in 5 Games (1984, 1988, 1992, 2000, 2008).
Carl Lewis appeared in 4 Games (1980, 1984, 1988, 1992).
Compute the number of appearances
Let’s compute the number of times each Olympic athlete has participated in the Olympic Games. This is a discrete variable. The code chunk below accomplishes this task.
We group by athlete id to calculate the number of distinct Olympic Games for each athlete. Using n_distinct() with summarize() counts the number of unique Olympic Games the athlete participated in (i.e., the number of distinct Olympic Games for each athlete). If an athlete competed in multiple events during the same Olympics, these are consolidated to ensure the count reflects the number of distinct Olympic Games, not the total number of events. In the created data frame, each unique athlete has one row of data, and we see the number of Olympic Games they participated in. .groups = "drop" in the summarize function is used to automatically ungroup the data after the summarization is complete. This isn’t actually necessary here — however, if you group by more than one variable, dplyr will automatically tell you that it has retained one of the grouping structures. This just suppresses that message.
Compute proportion of each level of num_games
Next, we summarize the data to calculate the proportion of athletes associated with each count of appearances. This allows us to answer questions like: What proportion of athletes participated in just one Olympic Games? What proportion participated in two Games, and so on?
First we group the data by the number of Olympic Games each athlete participated in, and then the code summarize(count = n()) counts the number of athletes in that group. Finally, we compute a variable called probability by dividing the variable count by the number of rows in the data frame (i.e., nrow(how_many_games) — where nrow is short for number of rows and in this instance captures the number of unique athletes). For example, there is only one person who appeared 10 times — so here, probability will be 1/135571 = 0.000007, where 135571 is the number of unique athletes.
Create an Empirical PMF for number of Games
Now, we can create a chart.
Please create a bar chart using the geom_col() geometry, you’ll start with the summarized data frame that we just created (e.g., summary_how_many_games).
There are a couple of new elements here that you might find useful. First, notice that I define num_games as a factor on the fly here — this just forces the x-axis to be displayed in discrete units. Second, there’s some fancy stuff going on with the geom_text() function call in order to include labels above the bars. aes(label = sprintf("%.5f", probability)) creates a label for each bar using the probability column. The sprintf("%.5f", probability) formats the probability values to display 5 decimal places, ensuring that the values are consistent in precision. vjust = -0.3 vertically adjusts the position of the text above the bars. A value of -0.3 moves the labels slightly above the tops of the bars to avoid overlap. size = 3 sets the font size for the text labels and color = "black" sets the text color to black.
What is a Cumulative Distribution Function (CDF)
While the probability mass function (PMF) shows the probability of each specific outcome (e.g., how many athletes participated in 1, 2, or 3 Games), the cumulative distribution function (CDF) for a discrete variable takes it a step further. The CDF gives us the probability that the variable takes a value less than or equal to a given number.
An Empirical CDF for number of Games
In order to produce an empirical CDF for number of Olympic Games, we first need to compute the cumulative probabilities:
In order to compute the cumulative probabilities, we use the cumsum() function, which calculates the cumulative sum of the probability column. The cumulative sum at each row is the sum of all probabilities up to that row. In this context, it’s creating a cumulative distribution based on the probability values.
Create the graph
In the code chunk below, replace XXX with the cumulative probability that we just created.
Probability Distributions for Continuous Variables
What is a Probability Density Function?
A probability density function (PDF) is used to describe the distribution of a continuous random variable.
Unlike a discrete variable, where we can directly calculate the probability of specific outcomes using a PMF (e.g., the probability of two appearances at Olympic Games), a PDF represents the probability that a continuous variable falls within a particular interval.
Why do we need a different function?
For a continuous variable, the probability of the variable taking any exact value is technically zero, because there are infinitely many possible values it could take.
Instead, the PDF helps us calculate the probability that the variable falls within an interval.
The area under the curve of the PDF over a given range represents the probability of the variable being in that interval.
An example empirical PDF
The Olympics data provides the height and weight for each athlete, and from this, we can calculate the body mass index (BMI) — which is a continuous random variable.
Let’s compute the BMI for the male athletes.
We will create a subsetted data frame called male_athletes — in this data frame we will keep only the athlete’s first appearance in the Olympic Games (if they were in multiple).
There are a couple of new functions in this code. First, distinct(id, year, .keep_all = TRUE) removes duplicates for each athlete (id) in the same year (year), ensuring that only one row per athlete per Olympic year is retained, even if the athlete participated in multiple events. The keep_all argument ensures all columns are retained in the result, not just the columns specified in distinct(). Second, slice_min(order_by = year, n = 1) selects the earliest (smallest/minimum) year for each athlete (the first Olympic Games they participated in).
Create an empirical PDF of BMI for males
The corresponding CDF
The Cumulative Distribution Function (CDF) for a continuous random variable describes the probability that the continuous random variable takes on a value less than or equal to a certain value.
In other words, the CDF gives the cumulative probability up to a specific point.
The value of the CDF at a specific point represents the area under the PDF curve up to that point.
Calculate and plot the empirical CDF
The stat_ecdf() function computes the empirical CDF for a given variable, in this case, the BMI of male athletes. It shows the cumulative probability of the variable (BMI) being less than or equal to a given value. For each BMI value on the x-axis, the y-axis shows the proportion of observations that have a BMI less than or equal to that value.
Interpreting the CDF
The value of the CDF function at any point represents the cumulative probability up to that point. The ECDF ranges from 0 to 1, and the y-axis gives the probability that a randomly selected observation is less than or equal to the corresponding x-value.
Comparison of PDF and CDF
The graphs below both display the \(P(BMI \leq 20)\)
For the PDF, the total area under the curve represents 100% of the distribution of BMI values for male Olympic athletes. The pink shaded area corresponds to the probability that a randomly selected male athlete has a BMI of 20 or less, which is approximately 7% of the total distribution.
Using the distribution to compute probabilities
The ecdf() function in R computes the empirical CDF (ECDF) — e.g., BMI. This allow us to use the distribution to answer questions about the probability of randomly selecting a case within a certain interval. Press Run Code on the code chunk below to set up the ecdf_bmi() function for this example. Then, we’ll use the function in the next slides.
What is the probability of a BMI less than 20?
By calling the ecdf_bmi() function just created, the code below will return the probability that the BMI of a randomly selected male athlete is less than or equal to 20.
This is the BMI score that we studied on the prior empirical PDF and CDF graphs.
What is the probability of a BMI greater than 30?
To calculate the probability of a BMI greater than 30, we can leverage the complement of the ECDF for the value 30. The ECDF gives the probability of a BMI being less than or equal to a specified value, so the probability of a BMI being greater than 30 is:
\[
P(BMI \geq 30) = 1 - P(BMI \leq 30)
\]
Comparison of PDF and CDF
The graphs below both display the \(P(BMI \geq 30)\)
What is the probability of a BMI between 20 and 30?
To calculate the probability of a BMI between 20 and 30, we can subtract the ECDF value at 20 from the ECDF value at 30. The ECDF gives the probability of a BMI being less than or equal to a certain value, so:
Now that we’ve explored the Empirical PDF and Empirical CDF for BMI, which are based on our observed data, let’s move on to understanding the PDF and CDF for a normal distribution.
A normal distribution (also known as a Gaussian distribution) is symmetric, bell-shaped, and is fully described by its mean (center of the distribution) and standard deviation (which controls the spread of the data).
Once we know a variable is normally distributed, we no longer need the raw data to calculate probabilities or understand its distribution.
In short, knowing a variable is normally distributed allows us to leverage the mathematical properties of the distribution to calculate probabilities, rather than needing the original dataset.
Is BMI of male Olympic athletes normally distributed?
Is BMI of male Olympic swimmers normally distributed?
Calculate Mean and Standard Deviation (SD)
Please subset the male_athletes data frame to include just swimmers (call the data frame swimmers), then compute the mean and SD of BMI for this group.
The probability is about .01 that a male Olympic swimmer has a BMI of 19 or less. Or, we can say — there is about a 1% chance that a male Olympic swimmer has a BMI of 19 or less.
What is the probability that a male Olympic swimmer will have a BMI greater than or equal to 22?
Answer the question using pnorm()
The lower.tail = FALSE argument in the pnorm() function is necessary when you want to find the upper tail probability (i.e., the probability that a normally distributed variable is greater than a certain value). By default, pnorm() calculates the lower tail probability, meaning it gives you the probability that the random variable X is less than or equal to a specified value q. However, if you are interested in the upper tail, you need to calculate the complement of the lower tail probability. Setting lower.tail = FALSE does this automatically.
The probability is 0.39 that a male Olympic swimmer has a BMI between 21 and 23.
Calculate a quantile from a probability
We might ask a different type of question in this context:
What BMI represents the 90th percentile of the distribution?
For this, we use the qnorm() function, which finds a quantile (q) based on a probability (p).
A BMI of 25.1 represents the 90th percentile of the distribution. At this point, 90% of the scores fall below and 10% fall above.
A graph to depict the 90th percentile
The Standard Normal Distribution
What is the Standard Normal Distribution?
A standard normal distribution is a distribution of z-scores. This is where the Empirical Rule (i.e., the 68-95-99.7 Rule) comes from. Recall that a z-score distribution has a mean of 0 and a standard deviation (sd) of 1. A z-score tells you how many standard deviations a value is from the mean.
What scores (i.e., quantiles) of the standard normal distribution mark the middle 95% of the distribution? Here we want to solve for a quantile, so we use qnorm().
Graph of a Standard Normal Distribution
95% of a normal distribution falls within 1.96 standard deviations of the mean. A z-score of −1.96 means the value is 1.96 standard deviations below the mean. A z-score +1.96 means the value is 1.96 standard deviations above the mean.
What is the probability that a z-score is \(\leq -1.96\)?
Here, we want to solve for a probability, so we use pnorm():
About 2.5% of scores are less than or equal to -1.96.
What is the probability that a z-score is \(\geq +1.96\)?
About 2.5% of scores are greater than or equal to +1.96.
Summary for the Standard Normal Distribution
This is a key concept in statistics that we’ll rely on a lot in the coming Modules, as it highlights the range within which most values fall in a standard normal distribution — approximately 95% of data points.