Confidence Intervals for Descriptive Statistics

Module 9

Learning objectives

Describe confidence intervals (CIs)
Describe why calculating the precision of parameter estimates is important
Describe how the level of confidence for a CI affects the width of the interval
Calculate bootstrap CIs and theory-based (i.e., parametric) CIs using R
Distinguish between bootstrap CIs and theory-based CIs
Describe degrees of freedom and the Student’s t-distribution
Differentiate between a frequentist and Bayesian approach to uncertainty
Describe the general process of calculating Bayesian Credible Intervals
Calculate a Bayesian Credible Interval using R

Overview

In Module 8 we learned about the sampling distribution, a foundational construct to understand how parameter estimates of interest (e.g., the proportion of students with a past-year Major Depressive Episode (MDE) and depression severity for those who experienced a MDE) would likely vary across many random samples drawn from the defined population. We simulated the process of drawing numerous random samples from the population, each time estimating our key parameters. These simulations illuminated the variability in our estimates across different random samples, demonstrating the inherent fluctuations in the proportion of students with an MDE, and their average depression severity, across the many possible random samples that could have been drawn. The variability was quantified by the standard deviation of the sampling distribution, providing us with an estimate of the standard error of each parameter estimate. In this way, we were able to obtain a quantification of the uncertainty in our estimates of interest.

It should be evident that this process of drawing an infinite number of samples, and estimating our parameters of interest in each sample, in order to compute the sampling distribution is impractical in real-world research. Therefore, we need to adopt a more feasible strategy to quantify uncertainty in our estimates. In this Module we will consider several strategies to quantify uncertainty.

Introduction to Bootstrap Resampling

To begin, we’ll consider bootstrap resampling, an innovative approach for estimating sampling variability that builds on the concepts we explored in Module 8. Unlike the earlier method, which imagined access to many random samples from the population, bootstrapping operates under the premise that we can generate numerous random samples from our single observed sample. This technique allows us to mimic the process of sampling from the population by repeatedly sampling from our available data, with replacement, to create “bootstrap resamples.”

We’ll apply this technique to consider uncertainty in two of the descriptive statistics that we estimated in Module 7 — namely the proportion of adolescents reporting a past-year MDE in the 2019 National Survey on Drug Use and Health Study, as well as the depression severity among the adolescents with a past-year MDE.

Through bootstrap resampling, we will draw numerous samples from this original data frame, each time recalculating our parameter estimates in the resample. This process not only approximates the variability of these estimates across the resamples, which we can use as an estimate of uncertainty, but also provides insights into the confidence we can have in our findings.

We’ll work with the data frame called NSDUH_adol_depression.Rds. The data frame includes the following variables:

Variable	Description
record_id	Unique ID for participant
year	Survey year
sex	Biological sex of respondent
age	Age of respondent
raceeth	Race/ethnicity of respondent
mde_past_year	Did the respondent meet criteria for a major depressive episode in the past-year? negative = did not meet criteria, positive = did meet criteria.
mde_lifetime	Did the respondent meet criteria for a major depressive episode in their lifetime? negative = did not meet criteria, positive = did meet criteria.
mh_sawprof	Did the respondent see a mental health care professional in the past-year?
severity_chores	If mde_pastyear is positive, how severely did depression interfere with doing home chores?
severity_work	If mde_pastyear is positive, how severely did depression interfere with school/work?
severity_family	If mde_lifetime is positive, how severely did depression interfere with family relationships?
severity_chores	If mde_pastyear is positive, how severely did depression interfere with social life?
mde_lifetime_severe	If mde_pastyear is positive, was the MDE classified as severe?
substance_disorder	Did the respondent meet criteria for a past-year substance use disorder (alcohol or drug)? – available for 2019 data only
adol_weight	Sampling weight - from sampling protocol
vestr	Stratum – from sampling protocol
verep	Primary sampling unit – from sampling protocol

Let’s load the packages that we will need for this Module, including a few that you haven’t yet seen — these are needed for the Bayesian estimation that we’ll learn about in the latter part of the Module (brms, bayesplot, tidybayes, and broom.mixed).

library(here)
library(brms)
library(infer)
library(bayesplot)
library(tidybayes)
library(broom.mixed)
library(tidyverse)

Let’s import the data and subset the data frame to select just year 2019. For our work, we also need to create a numeric version of past-year MDE status since the current variable (mde_pastyear) is a character variable (i.e., “positive” for MDE, and “negative” for MDE). We won’t be able to take the mean of this variable to quantify the proportion of positive cases — therefore we’ll create a new version called mde_status where an individual is coded 1 if they experienced a past-year MDE and a 0 otherwise. We’ll also drop cases missing on the MDE indicator. Once done, this leave us with 12,950 adolescents in the nsduh_2019 data frame. In this code chunk, we will also form the depression severity scale.

nsduh_20092019 <- read_rds(here("data", "NSDUH_adol_depression.Rds"))

nsduh_2019 <-
  nsduh_20092019 |> 
  filter(year == 2019 & !is.na(mde_pastyear)) |> 
  mutate(mde_status = case_when(mde_pastyear == "positive" ~ 1,
                                mde_pastyear == "negative" ~ 0)) |> 
  drop_na(mde_status) |> 
  mutate(severity_scale = rowMeans(pick(starts_with("severity_")))) 

nsduh_2019 |> glimpse()

Rows: 12,950
Columns: 19
$ record_id           <chr> "2019_58786143", "2019_76127143", "2019_19108143",…
$ year                <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
$ sex                 <chr> "male", "male", "female", "male", "male", "female"…
$ age                 <dbl> 16, 16, 16, 14, 17, 17, 17, 12, 15, 16, 17, 14, 13…
$ raceeth             <chr> "Non-hispanic White", "Non-hispanic White", "Black…
$ mde_lifetime        <chr> "negative", "negative", "negative", "negative", "n…
$ mde_pastyear        <chr> "negative", "negative", "negative", "negative", "n…
$ mde_pastyear_severe <chr> "negative", "negative", "negative", "negative", "n…
$ mh_sawprof          <chr> NA, NA, NA, NA, NA, NA, "did not receive care", NA…
$ severity_chores     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5,…
$ severity_work       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, 5, …
$ severity_family     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, NA, 4, …
$ severity_social     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, 7, …
$ substance_disorder  <chr> "negative", "negative", "negative", "negative", "n…
$ adol_weight         <dbl> 1453.7924, 1734.7987, 1107.9873, 839.4021, 711.593…
$ vestr               <dbl> 40002, 40002, 40002, 40047, 40044, 40007, 40041, 4…
$ verep               <dbl> 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2,…
$ mde_status          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,…
$ severity_scale      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5.…

Let’s initiate our analysis by creating a single bootstrap resample. In doing so, we will match the size of our observed sample — 12,950 participants. This process involves randomly selecting participants with replacement. What does drawing with replacement entail? Each time we select a participant, we note their inclusion in the resample and then return them to the original pool, making them eligible for reselection. Consequently, within any given resample, it’s possible for some participants to be chosen multiple times, while others from the original sample may not be selected at all. This method ensures every draw is independent, preserving the integrity of the bootstrap approach and enabling us to effectively simulate the sampling variability.

Choosing a bootstrap resample of the same size as our original data frame — 12,950 participants — is crucial for maintaining the statistical properties of the original sample. By matching the resample size to the original, we ensure that the variability and distribution in the resample closely mirror those in the full data frame. This fidelity is essential for accurately estimating the sampling distribution of statistics calculated from the resamples.

resample1 <- 
  nsduh_2019 |> 
  select(record_id, mde_pastyear, mde_status, severity_scale) |> 
  slice_sample(n = 12950, replace = TRUE) |> 
  arrange(record_id)

The table below presents the first resample just drawn. In order to limit the size of the data frame in this Module handout — I will print just the first 100 rows of data. Please scan down the list and notice that many people (indexed by record_id) were drawn several times in this resample.

Let’s compare the distribution of mde_status (incidence of past-year MDE) in the full observed sample and our drawn resample. For the original data frame and the bootstrap resample, we compute the proportion of adolescents who met the criteria for a past year MDE (named prop_positive in the code below). Based on the results below, we find that in the full sample 16% of the adolescents had a past year MDE. And, this estimate is nearly identical for the bootstrap resample.

Proportion of students with a past-year MDE in full sample

nsduh_2019 |> select(mde_status) |> summarize(prop_positive = mean(mde_status))

Proportion of students with a past-year MDE in bootstrap resample

resample1 |> select(mde_status) |> summarize(prop_positive = mean(mde_status))

Now, instead of drawing one resample, let’s draw 30 resamples. The table below presents the proportion of students who reported a past-year MDE in each of the 30 resamples (indexed by the variable replicate in the code below). Please scan through and notice that, although each resample produces a slightly different proportion, they are all quite similar.

resamples_30 <- 
  nsduh_2019 |> 
  select(record_id, mde_pastyear, mde_status) |> 
  rep_slice_sample(n = 12950, replace = TRUE, reps = 30) |> 
  arrange(record_id)

resamples_30 |> 
  group_by(replicate) |> 
  summarize(prop_positive = mean(mde_status))

Now, let’s go really big and draw 5000 resamples. Given the large number of resamples, we won’t print out the proportion of students who experienced a past-year MDE in each resample, but rather, we’ll create a histogram of these resampled proportions.

set.seed(12345)

resamples_5000proportions <- 
  nsduh_2019 |> 
  select(record_id, mde_pastyear, mde_status) |> 
  rep_slice_sample(n = 12950, replace = TRUE, reps = 5000) |> 
  group_by(replicate) |> 
  summarize(prop_positive = mean(mde_status))
  
resamples_5000proportions |> 
  ggplot(mapping = aes(x = prop_positive)) +
  geom_histogram(binwidth = .001, fill = "azure2") +
  theme_minimal() +
  labs(title = "Proportion of students with a past year MDE across 5000 bootstrap resamples", 
       x = "Proportion of students with a past year MDE in the resample",
       y = "Number of resamples")

This distribution of the estimated proportion of adolescents with a past-year MDE across our 5000 bootstrap resamples mimics the sampling distribution that we studied in Module 8. But, we created it without having to pretend that we have access to the entire population. We created it simply with our single observed sample via the bootstrap procedure.

Introduction to a Confidence Interval (CI)

We can use this mimic of our sampling distribution to gain a sense of how much sampling variability is likely to exist for our parameter of interest. For example, the histogram of the proportion of adolescents with a past-year MDE across the 5000 bootstrap resamples varies from about 0.150 (15% of students) to 0.175 (17.5% of students) — see the x-axis of the histogram. The largest concentration of estimates of the proportion is right around 0.162 (16.2% of adolescents with a past-year MDE). Notice that this is the same proportion in the full 2019 sample that we estimated earlier.

Moreover, it looks like about 95% of the bootstrap resamples produce an estimate of the proportion that is between about 0.155 and 0.170 (15.5% to 17.0%). This range provides a set of plausible values for the true population proportion. When analysts construct a range of plausible values for a statistic of interest using the bootstrap method, it is referred to as a bootstrap percentile confidence interval (CI). In this scenario, the CI for a statistic is a range of values that, based on the resampled data, is likely to contain the true population parameter with a specified level of confidence, typically 95%.

To formally calculate the CI of a statistic we first need to choose a desired level of confidence. For example, a 95% CI is common in Psychological research. This implies that if we were to draw many, many random samples from the population, estimate our parameter of interest in each, and then construct a 95% CI around each parameter estimate, 95% of the random samples would produce a 95% CI that includes the true population parameter. This means that 5% of the time our 95% CI won’t include the true population parameter.

It’s of critical importance to note that, when we describe a 95% CI, it doesn’t mean there’s a 95% chance the true parameter value (i.e., the parameter in the population of interest) lies within that interval. Instead, it means that if we were to repeatedly draw samples and compute a 95% CI for each of them, we expect about 95% of those intervals to contain the true population parameter.

The two ends of the CI are called limits or bounds (i.e., the lower bound and the upper bound of the CI). If we aren’t comfortable knowing that the true population parameter will be included in the CI only 95% of the time, and instead we prefer it to be included 99% of the time, we could (for example) construct a 99% CI instead.

Our eyeball method of examining the histogram of the sampling distribution and making our best guess about the lower bound and upper bound of the desired CI isn’t terribly principled (e.g., eyeballing the histogram to roughly identify the middle 95% of scores). We can do better. To do this, let’s now construct a CI from our bootstrap sampling distribution using a principled approach. We’ll consider two methods — one is called the percentile method and the other is called the standard error method.

Construct a bootstrap CI

Percentile bootstrap method

The most common method to define the CI in this setting is to use the middle 95% of values from the distribution of parameter estimates across our bootstrap resamples. To do this, we simply find the 2.5th and 97.5th percentiles of our distribution — that will allow 2.5% of parameter estimates to fall below our 95% CI and 2.5% to fall above our 95% CI. This is quite similar to the activity we did in Module 6 to determine a range of plausible values of height that would encompass the middle 95% of female adults based on the normal distribution.

Calculate a 95% confidence interval

We can use the quantile() function to calculate this range across the bootstrap resamples for us — it takes the corresponding proportion/probability rather than percentile, so we need to divide our percentiles by 100 — for example, 2.5/100 = .025 for the lower bound and 97.5/100 = .975 for the upper bound. We then plug these values into the code below, where the argument probs = provides the desired percentiles.

resamples_5000proportions |> 
  summarize(lower = quantile(prop_positive, probs = .025),
            upper = quantile(prop_positive, probs = .975))

The two values in the table above give us the lower and upper bound of our CI. That is, our 95% CI for the proportion of adolescents with a past-year MDE is 0.156 to 0.169. In other words, the estimated 95% CI for the proportion of adolescents with a past-year MDE is 15.6% to 16.9%.

Let’s plot our bootstrap sampling distribution, and overlay this calculated 95% CI. In this way, rather than eyeballing the bootstrap resampling distribution to figure out the range of proportions that encompasses the middle 95%, we can calculate the precise estimates that we calculated using quantile().

resamples_5000proportions |> 
  ggplot(mapping = aes(x = prop_positive)) +
  geom_histogram(binwidth = .001, fill = "azure2") +
  geom_vline(xintercept = 0.1557811, lwd = 1, color = "pink2", linetype = "dashed") +
  geom_vline(xintercept = 0.1685452, lwd = 1, color = "pink2", linetype = "dashed") +
  theme_minimal() +
  labs(title = "Proportion of students with a past year MDE across 5000 bootstrap resamples", 
       subtitle = "Pink lines represent lower and upper bounds of the 95% CI",
       x = "Proportion of students with a past year MDE in the resample",
       y = "Number of samples")

Calculate a 90% confidence interval

Finding the middle 95% of scores as we just did corresponds to a 95% CI. This is a very common interval, but we certainly aren’t tied to this. We could choose the middle 90% of scores, for example. For the middle 90%, we’d use the same method, but plug in .05 (for the 5th percentile) and .95 (for the 95th percentile) — allowing 90% of the distribution to be in the middle and 10% in the tails (with 5% in the lower tail and 5% in the upper tail of the distribution).

resamples_5000proportions |> 
  summarize(lower = quantile(prop_positive, probs = .05),
            upper = quantile(prop_positive, probs = .90))

Calculate a 99% confidence interval

Or, we can find a 99% CI, which will be the widest of all that we’ve constructed so far since it is set to contain all but 1% of the full distribution of parameter estimates.

In this example, we want 1% in the tails — 1% divided by two equals .5% in each tail. This equates to the .5th and 99.5th percentiles.

resamples_5000proportions |> 
  summarize(lower = quantile(prop_positive, probs = .005),
            upper = quantile(prop_positive, probs = .995))

Summary of percentile method

How should we interpret a confidence interval? A 95% CI is constructed from sample data and provides a range of values that, under repeated sampling of the population and construction of new CIs in the same way, would contain the true population parameter (in this case, the proportion of students with a past-year MDE) 95% of the time. The “95%” part refers to the method used to construct the interval, not the probability that a fixed interval contains the population parameter. The confidence level (95% in this case) quantifies the proportion of CIs that will contain the population parameter if you were to repeat the study many, many times.

A commonly employed, but incorrect, way to interpret the 95% CI is to say: “There is a 95% chance the population proportion falls between 0.156 and 0.169.” This interpretation is incorrect because it implies a probability statement about the population parameter itself, suggesting that the population parameter has a 95% chance of lying within this specific interval. This interpretation is a misunderstanding of what the CI represents. In frequentist statistics (the type we are employing now), the population parameter is not considered a random variable; rather, it is a fixed (though unknown) value. Therefore, it doesn’t make sense to talk about the probability of the parameter falling within a specific interval. Rather, the CI itself is random because it is derived from sample data, which varies from sample to sample. Once the data is collected and the interval is calculated, the true population parameter either is or is not within the calculated interval — there is no probability about it.

The “confidence level” specifies the probability that the interval will contain the true parameter if we were to take many samples and compute a CI from each of them. The width of this range is influenced by the level of confidence we desire. For the given data, the 90% CI ranges from 0.157 to 0.166 (i.e,. 15.7% to 16.6% of adolescents). The 95% CI, which provides a higher level of confidence, has a slightly wider range, spanning from 0.156 to 0.169. Lastly, the 99% CI offers the highest level of confidence, but in turn, is the broadest, with limits of 0.154 to 0.171.

The choice between different confidence levels (e.g., 90%, 95%, 99%) depends on the context of the research and how conservative one wishes to be. A 99% CI provides a higher degree of confidence in the sense that, over many repetitions of the study, it would encompass the true population parameter more often than a 95% or 90% CI. However, the trade-off is that the 99% CI is wider, which might be too imprecise in some contexts. Conversely, a 90% CI is narrower (more precise) but would encompass the true population parameter in fewer repetitions than a 95% or 99% CI. The 95% CI is commonly used as it strikes a balance between precision and confidence.

Standard error bootstrap method

A variable with a normal distribution (i.e., a bell-shaped curve) adheres to the Empirical Rule – which states that 95% of values fall between two standard deviations of the mean. Our bootstrap distribution of the proportions across the 5000 resamples is roughly a normal distribution; therefore, we can use the Empirical Rule to construct CIs using the standard error method.

To use this method, we simply calculate the mean and standard deviation of our proportions across the 5000 bootstrap resamples.

resamples_5000proportions |> 
  select(prop_positive) |> 
  summarize(mean_prop_positive = mean(prop_positive),
            sd_prop_positive = sd(prop_positive))

We find the average proportion is 0.162 and the standard deviation is 0.003. Recall from Module 8 that the standard deviation of a sampling distribution (i.e., our distribution of bootstrap resamples) is called a standard error. Therefore, we can expect 95% of scores to fall within ~1.96 standard errors of the mean. That is \(0.162 \pm1.96\times0.003\). This corresponds to a range of the proportion of students with a past-year MDE between .0156 (lower bound) and 0.168 (upper bound). This estimate is very close to our 95% CI that we calculated using the percentile method.

For a 99% CI, we substitute 1.96 with 2.58 as 99% of values of a normal distribution fall within ~2.58 standard errors from the mean. Please take a moment to calculate the 99% CI yourself and compare it to the 99% CI calculated using the percentile method. Please do the the same with a 90% CI.

Summary of bootstrap methods

Bootstrap resampling offers a powerful and versatile method for estimating standard errors and constructing CIs across a wide range of contexts, particularly when traditional parametric assumptions are not met. One of the most significant benefits of bootstrap methods is their ability to apply to virtually any statistic, whether it’s a mean, median, variance, or more complex measures like a regression coefficient (which we’ll study in later Modules).

For example, here’s a modification of the code that was created for the proportion of students with a past-year MDE to consider the severity of depression (among students with a past-year MDE). It creates a 95% CI for the mean using the percentile method. Note that there are 1,821 students in the data frame who reported a past-year MDE and who have observed data for the depression severity scale.

resamples_5000means <- 
  nsduh_2019 |> 
  filter(mde_status == 1) |> 
  select(severity_scale) |> 
  drop_na() |> 
  rep_slice_sample(n = 1821, replace = TRUE, reps = 5000) |> 
  group_by(replicate) |> 
  summarize(mean_severity_scale = mean(severity_scale))
  
resamples_5000means |> 
  ggplot(mapping = aes(x = mean_severity_scale)) +
  geom_histogram(binwidth = .001, fill = "azure3") +
  theme_minimal() +
  labs(title = "Distribution of the mean depression severity scale across 5000 \nbootstrap resamples",
       x = "Mean depression severity in the resample",
       y = "Number of resamples")

resamples_5000means |> 
  summarize(lower = quantile(mean_severity_scale, probs = .025),
            upper = quantile(mean_severity_scale, probs = .975))

To interpret this CI — we can say: “95% of the 95% CIs that we could construct across an infinite number of random samples will include the true population mean.” You can roughly think of this as giving us a range of values that are plausible for the true mean depression severity score in the population, based on the sample data.

This flexibility in bootstrap resampling to apply to any statistic of interest arises because the method does not rely on specific distributional assumptions about the data. Instead, it generates numerous resamples by randomly sampling with replacement from the original data frame, allowing for the estimation of the sampling distribution of almost any statistic directly from the data. As a result, it can provide more accurate and robust CIs, especially in situations where the underlying distribution of the data is unknown or complicated. Furthermore, the bootstrap approach is relatively straightforward to implement with modern computing resources, making it accessible for a wide range of applications. This method empowers researchers to confidently estimate the variability and reliability of their estimates, enhancing the precision and interpretability of statistical analyses.

Introduction to Theory-Based Quantification of Uncertainty

In our journey to understand uncertainty in statistics, we began with a technique called bootstrapping. This powerful method involves simulating many random samples from our original data. By doing this, we can observe how our estimates might change with different samples. From these simulations, we created a 95% confidence interval (CI), which is a range of values that are plausible for the true parameter in the population, based on the sample data.

Now, let’s explore a different approach directly rooted in frequentist statistics. Unlike bootstrapping, which generates many resamples to simulate variability, the frequentist methods we will study here use theoretical formulas to calculate uncertainty based on the single (original) sample. This approach relies on established theories about how data behaves over many repetitions of the same experiment.

In frequentist statistics, probability is defined as the long-run frequency of events. This means we look at what happens over many trials or samples to understand likely outcomes. A key principle in frequentist statistics is the Central Limit Theorem (CLT). This theorem explains that if we take a large enough sample (commonly, more than 30 data points are adequate), the estimates we calculate from these samples will form a normal distribution (shaped like a bell curve), regardless of the original distribution of data in the population. This predictable pattern allows us to use specific formulas to calculate standard errors and CIs.

Because the CLT suggests that our sample estimates will approximate a normal distribution, we can use it to directly calculate the standard error and confidence intervals without needing to simulate multiple samples, as we did with bootstrapping. This makes the frequentist approach efficient, especially when assumptions about the data are met.

The Practical Upshot: Estimating Parameters

Whether we’re estimating the proportion of students experiencing a major depressive episode (MDE) in the past year, or the average severity of depression among those students, the CLT provides two crucial insights:

Convergence of Sample Estimates: As our sample size gets larger, our sample averages (or proportions) get closer to the true average (or proportion) in the whole population.
Calculating Uncertainty: The standard error (SE), which measures how much we expect our estimate to vary from sample to sample, can be derived directly from our sample data without resampling by applying a parametric formula.

The SE formula for for means/averages is a follows:

\[ SE = \frac{s}{\sqrt{n}} \]

where \(s\) is the sample standard deviation for the variable of interest (e.g., depression severity) and \(n\) is the sample size.

For proportions, the SE formula adjusts to account for the binary nature of the data: \[ SE = \sqrt{\frac{p(1-p)}{n}} \]

where \(p\) is the sample proportion for the variable of interest (e.g., proportion with a past-year MDE), and \(n\) is the sample size.

This frequentist methodology enables us to obtain an estimate of the SE, and from this calculate CIs, directly from our sample — providing insight into the uncertainty of our estimates without the need to create multiple resamples (as we performed when taking a bootstrap approach). It’s a streamlined process that relies heavily on theoretical principles, specifically those that apply to parametric statistics. The term “parametric” refers to methods that assume the data follows a specific distribution, such as the normal distribution. These assumptions about the shape and characteristics of the data distribution allow us to use mathematical formulas to estimate population parameters (like the mean or variance) more precisely. When these parametric assumptions hold true, they provide a robust foundation from which we can make inferences about the entire population based on just one sample.

Thus, frequentist statistics offers a practical and efficient way to estimate the uncertainty of our results, so long as the underlying assumptions about the data’s distribution are met. By leveraging these established theories, we can infer population characteristics confidently without the complexity of generating new samples through methods like bootstrapping.

These theory-based (i.e., parametric) formulas rely on several key assumptions. We can confidently use the Central Limit Theorem (CLT) to calculate theory-based standard errors and confidence intervals under the following conditions:

Large Sample Size: These formulas work best when the sample size is large, typically considered to be at least 30. Even if the population parameter does not follow a normal distribution, the larger the sample size, the more the sampling distribution of the sample mean will approximate a normal distribution, as stated by the CLT.
Random Sampling: The sample should be randomly selected from the population, meaning every individual has an equal chance of being included in the sample. This ensures that the sample is representative of the population, reducing sampling bias.
Population Distribution is Known or Symmetrical: If the population parameter follows a normal distribution, the CLT (and thus the theory-based formulas) can be applied even for smaller samples. For moderately large samples, the population distribution does not need to be perfectly normal; the CLT can still be applied if the distribution is not heavily skewed and does not contain extreme outliers. The larger the sample size, the more the sampling distribution of the sample mean will approximate a normal distribution, regardless of the shape of the population distribution.
Independence: The sampled observations should be independent. This means that the inclusion of one observation in the sample should not influence the inclusion of another (e.g., no sibling pairs, no nesting of cases within organizations or geographies, no repeated measures). If observations are not independent, multilevel techniques or other methods should be used to account for the lack of independence, though this is beyond the scope of this course.

By ensuring these conditions are met, we can use parametric formulas to calculate standard errors and confidence intervals with confidence, relying on the CLT to justify these calculations. This approach helps ensure the accuracy and reliability of our inferential statistics.

Before we take a look at the procedure for calculating the CI using a theory-based approach, please watch the video below from Crash Course Statistics on Confidence Intervals.

The procedure for a proportion

Let’s now use the theory-based/parametric method for calculating the confidence interval of the proportion of students with a past-year MDE.

Compute the standard error

For a proportion, the standard error is calculated as:

\[ SE = \sqrt{\frac{p(1-p)}{n}} \]

where \(p\) is the sample proportion for the variable of interest (e.g., proportion with a past-year MDE), and \(n\) is the sample size.

For our past-year MDE example using the NSDUH data, the \(SE\) is calculated as follows:

\[ SE = \sqrt{\frac{0.162(1-0.162)}{12950}}=0.003 \]

Choose a desired level of confidence

Next, we need to decide on a confidence level, that is, the proportion of the standard normal distribution (i.e., the z-distribution) we want our confidence interval to cover. As discussed earlier, commonly, a 95% confidence level is used, which encompasses the middle 95% of the distribution of parameter estimates, and implies that the tails (or the areas not covered by the confidence interval) make up the remaining 5%.

We can use the qnorm() function in R to find the z-scores that corresponds to our desired confidence level. The function is used to obtain the quantiles of the standard normal distribution. For example, for our problem, we will calculate a 95% CI, therefore, we input the following information into the qnorm() function to find the z-scores associated with the middle 95% of the distribution (i.e., the 2.5th and 97.5th percentiles).

qnorm(p = c(.025, .975))

[1] -1.959964  1.959964

Here’s a visualization of the standard normal distribution with these quantiles overlaid.

Compute the confidence interval

Now we’re ready to construct the CI. Once we have the SE and the z-scores associated with our desired lower and upper bound, we can construct the CI for the population proportion as follows:

\[ CI = \hat{p} \pm (z_{\alpha/2} \times SE_{\hat{p}}) \]

Where:

\(\hat{p}\) is the estimated sample proportion
\(n\) is the sample size
\(z\) is the z-score corresponding to the desired level of confidence

We can calculate the CI using the summarize() function as follows, notice the line that starts with se — here we take the standard deviation of mde_status in the sample, and divide it by the square root of \(n\). We then multiply the z-scores marking the bounds of our desired confidence level (i.e., the lower_boundCI marks the 2.5th percentile and the upper_boundCI marks the 97.5th percentile).

nsduh_2019 |>   
  select(mde_status) |> 
  drop_na() |> 
  summarize(
    proportion = mean(mde_status),  
    sample_size = n(),
    se = sqrt(proportion * (1 - proportion) / sample_size),
    lower_boundCI = proportion - 1.959964 * se,
    upper_boundCI = proportion + 1.959964 * se
  )

\[ 95\% \text{ CI } = .162 \pm (1.96 \times .003) = 0.156, 0.168 \]

Here, we see that the 95% confidence interval (CI) calculated using a theory-based (or parametric) approach is 0.156 to 0.168. This parametric method of calculating the CI provides an estimate that is quite close to the 95% CI obtained using the bootstrap method. It’s important to note that the CI, whether obtained through theory-based methods or bootstrapping, doesn’t imply that the true population mean has a 95% chance of being within this specific interval. Instead, it means that 95% of such intervals, calculated from repeated samples in the same way, would contain the true population parameter.

The procedure for a mean

Let’s now use the theory-based/parametric method for calculating the CI of the average depression severity score among students with a past-year MDE.

Compute the standard error

To begin, we need to calculate an estimate of the standard error. Recall that the standard error is the standard deviation of the theoretical sampling distribution. There is a theory-based/parametric formula for estimating the standard error (SE) of a mean, it is:

\[ SE = \frac{s}{\sqrt{n}} \]

The numerator of the formula is the standard deviation of the variable of interest, severity_scale, in the single sample, and the denominator is the square root of the sample size (\(n\)).

Determine the degrees of freedom

Next, we need to calculate the degrees of freedom for our estimate. Rather than the standard normal (i.e., z) distribution that we’ve worked with so far, we need to use an alternative distribution — namely the t-distribution¹. The t-distribution, also called the Student’s t-distribution, is a probability distribution that is similar to the standard normal distribution but has heavier tails. It was introduced by William Sealy Gosset under the pseudonym “Student” in 1908. Like the standard normal distribution, the t-distribution is symmetrical and bell-shaped, but its tails are thicker, meaning it’s more prone to producing values that fall far from its mean. That is, as compared to the z-distribution, we expect more cases in the tails of the distribution.

The Student’s t-distribution

The exact shape of the t-distribution depends on a parameter called the degrees of freedom (df). Imagine we have a variable measured for five people randomly selected from the population — for example, the depression severity score for five people in our study. Before estimating anything, each person’s depression score is free to take on any value. Now, imagine we want to use these five people to estimate the mean depression severity score. You now have a constraint — the estimation of the mean. That is, once the mean is estimated, four of the five values are free to vary, but one is not. That is, if we know the mean depression severity score for the five people, and the depression severity score for four of the individual people, then the value of the depression severity score for the fifth person is also known and, therefore, not free to vary. The df provide a heuristic for quantifying how much unique information we have available. With each CI that we construct, there will be a formula for calculating the degrees of freedom. When estimating a population mean, the formula for the df is simply \(n\) - 1, where \(n\) is the sample size. In our example, \(n\) = 1,821 (the number of people where mde_status equals 1 and the severity_scale is not missing), so if we desire to use our sample to estimate the mean in the population, then we have 1,820 df left.

Notice in the figure above (i.e., The Student’s t-distribution) that as the df increase, the t-distribution gets closer to a standard normal distribution (labeled z in the figure). With infinite df, the t-distribution is equivalent to the standard normal distribution.

When estimating a CI for the population mean, we typically do not know the population standard deviation (which of course is needed to estimate the standard error used for the CI). Instead, we must estimate it from the sample data. This introduces additional uncertainty because the sample standard deviation also varies from sample to sample. Due to this additional uncertainty, the sampling distribution of the sample mean does not perfectly follow a normal distribution. Instead, it follows a t-distribution, which accounts for the extra variability introduced by estimating the population standard deviation from the sample.

The key difference between these distributions lies in how they handle uncertainty. The t-distribution accounts for the extra uncertainty introduced by using the sample standard deviation instead of the known population standard deviation. As the sample size grows, the sample standard deviation becomes a more accurate estimate of the population standard deviation, reducing this uncertainty. Consequently, with larger sample sizes, the t-distribution increasingly resembles the standard normal distribution. This convergence happens because the larger the sample, the lesser the relative variability and uncertainty around the sample standard deviation.

Choose a desired level of confidence

Next, we need to decide on a confidence level, that is, the proportion of the t-distribution we want our confidence interval to cover. In this example, we’ll use a 95% confidence level. Correspondingly, we denote 1 - the desired confidence level (i.e., .05 for a 95 CI).

We can use the qt() function in R to find the t-score that corresponds to our desired confidence level and degrees of freedom. The qt() function in R is used to obtain the quantiles of the Student’s t-distribution. It stands for “quantile of t”, and is similar to the qnorm() function that we used earlier for a standard normal distribution (i.e., for finding z-scores of the standard normal distribution).

For example, for our problem, we will calculate a 95% CI, and we have 1,820 df, therefore, we input the following information into the qt() function to find the t-scores associated with the middle 95% of the distribution.

qt(p = c(.025, .975), df = 1820)

[1] -1.961268  1.961268

Here’s a depiction of the Student’s t-distribution with 1,820 \(df\) with the selected t-scores for the 2.5th and 97.5th percentiles overlaid:

Compute the CI

Now we’re ready to construct the CI. Once we have the SE and the t-scores, we can construct the confidence interval for the population mean as follows:

\[ CI = \bar{y} \pm (t_{n-1, \alpha/2} \times SE_{\bar{y}}) \]

Where:

\(\bar{y}\) is the sample mean for variable \(y\) — which is severity_scale in this example
\(t_{n-1, \alpha/2}\) is the t-score from the t-distribution. The t-score corresponds to the desired confidence level. For a 95% confidence interval, \(\alpha\) is 0.05, and \(\alpha/2\) is 0.025, indicating the area in each tail. The t-score also depends on the degrees of freedom (df), which is calculated as the sample size minus one (n−1) when estimating a mean.
\(SE_{\bar{y}}\) is the standard error of \(\bar{y}\)

We can calculate the confidence interval (CI) using the summarize() function as follows. Notice the line that starts with se — here, we take the standard deviation of severity_scale in our sample and divide it by the square root of n (i.e., apply the formula presented above for the estimation of the standard error for a mean). We then multiply this value by the t-scores marking the bounds of our desired confidence level (i.e., a 95% CI) — these t-scores correspond to the values we just calculated using the qt() function.

nsduh_2019 |>   
  filter(mde_status == 1) |> 
  select(severity_scale) |> 
  drop_na() |> 
  summarize(mean = mean(severity_scale),  
            std = sd(severity_scale),
            sample_size = n(),
            se = std/sqrt(sample_size),
            lower_boundCI = mean - 1.961268*se,
            upper_boundCI = mean + 1.961268*se)

\[ 95\% \text{ CI } = 6.11 \pm (1.96 \times .045) = 6.02, 6.20 \]

Here, we see that the 95% CI calculated using a theory-based approach is 6.02 to 6.20. This theory-based or parametric method of calculating the CI is quite close to the 95% CI that we calculated using the bootstrap method for mean severity score. Akin to the bootstrap CI, this interval gives us a range of values within which we can be reasonably confident that the true population mean lies. In more precise terms — we can expect that 95% of the 95% CIs that we would calculate across an infinite number of random samples would include the true population parameter (i.e., the mean depression severity score among adolescents with a past-year MDE).

Interpretation of frequentist-based confidence intervals

In our analysis, we observed that both bootstrap confidence intervals (CIs) and parametric (theory-based) CIs provided similar estimates for the proportion of adolescents with a major depressive episode (MDE) and the average severity of depression among these individuals. These CIs are interpreted in the same way, but they differ in their calculation methods and assumptions.

Parametric (Theory-Based) CIs

Calculation: Easier to calculate using formulas based on statistical theories.
Assumptions: Rely on key assumptions about the data, such as normality of the sampling distribution, which can be difficult to verify.
Example: To calculate the CI for a mean, we use the sample mean, the standard error (SE) calculated using the standard formula, and the t-score corresponding to our desired confidence level.

Bootstrap CIs

Calculation: Involves a resampling method, where we repeatedly draw samples from the data and recalculate the statistic of interest.
Assumptions: Do not rely on strict assumptions about the data’s distribution, making them more flexible and robust.
Example: We create many resamples of our data, calculate the mean for each resample to mimic the sampling distribution, and then use this distribution of means to determine the CI.

Understanding Confidence Intervals

A crucial aspect of confidence intervals is understanding what the confidence level means:

Confidence Level (e.g., 95%): This does not mean that there is a 95% probability that the true population parameter lies within the calculated interval. It does mean that if we were to take many random samples from the population and calculate a CI for each, about 95% of those intervals would contain the true population parameter.

Visualization Example

Consider a hypothetical simulation where we construct a 95% CI for the average depression severity of adolescents across 100 random samples of size 50 from a population. In the visualization presented below, the population mean is marked by an orange vertical line. The majority of the CIs (96 out of 100, shaded in gray) include the population mean, aligning with our expectations. However, a small fraction (4 out of 100, shown in black) do not contain the population mean. This outcome illustrates the inherent uncertainty: without knowing which sample we have in reality, we cannot be certain whether our specific CI includes the true parameter

Key Takeaways

Sample Size and Variability: Larger samples produce smaller CIs due to reduced standard error. High variability increases the standard error and the CI width.
Understanding CIs in Social Sciences: CIs help quantify uncertainty in human behavior and social context studies. They provide a range of plausible values for population parameters, aiding in hypothesis testing, decision-making, and policy formulation.

By using confidence intervals, scientists convey their findings with both depth and nuance, enhancing the robustness and credibility of their conclusions.

Introduction to a Bayesian Approach to Uncertainty

For much of its history, the field of Statistics has been influenced by the frequentist approach. This approach is grounded in the concept of long-run frequencies. To understand this, imagine if we could repeat an experiment — such as an experiment to estimate the mean severity of depression among a group of adolescents — countless times under the same conditions. The frequentist perspective focuses on the outcomes of these repeated experiments, observing what happens over many iterations of the experiment. It views the data we collect as random samples drawn from a broader population, implying that each sample could vary slightly if we were to repeat the experiment.

In this context, the true population parameters, like the average depression severity score among adolescents, are considered fixed but unknown. We don’t know this exact value in the population, but it’s assumed to be a constant that doesn’t change with each new experiment. From a frequentist viewpoint, probability is defined as the frequency of outcomes in these repeated experiments. This means if we repeated the experiment many times, drawing a random sample from the population and estimating our parameter of interest, we’d expect to see the parameter estimate in each sample fall within a certain range most of the time based on the outcomes of these repeated trials. And, these estimates would cluster around the true population parameter that we’re trying to discover.

This definition of probability is different from everyday usage. In frequentist statistics, probabilities are not about the odds of a single event occurring in isolation, but about the long-term behavior of outcomes over many, many repetitions of the study. So, when frequentists talk about the probability of an event, they are really talking about how frequently it would occur in a long series of similar trials.

Now, let’s explore an alternative framework — the Bayesian approach. A Bayesian approach to estimation of a point estimate, (i.e., a mean or a proportion) and the associated uncertainty in these estimates, presents a fundamentally different way of understanding probability and uncertainty. Unlike the frequentist method, which hinges on the idea of infinite sampling from a population, Bayesian statistics views probability more subjectively. In Bayesian terms, probability represents how certain we are about different outcomes (e.g., different possible mean severity scores in the population). This means probabilities in Bayesian statistics are not just about long-run frequencies; they are measures of personal belief or confidence based on the current evidence. For example, the severity of depression scale in the NSDUH ranges from 1 to 10, and in 2019, the mean depression severity among students with a past year MDE was 6.1. If we conducted the study again, we might have great confidence that the mean score in the new study will be around 6, less confidence that the mean score would be as low as 4 or as high as 8, and very little confidence that that the mean score would be very low (i.e., 1.5) or very high (i.e., 9.5).

This conceptual shift allows Bayesian analysis to focus directly on the data we have right now. Instead of imagining endless repetitions of an experiment to gauge frequencies, Bayesian methods update our beliefs or confidence in light of new data. When new evidence is gathered, Bayesian statistics provides a mathematical way to adjust our previous beliefs or estimates accordingly. This is often done using something called Bayes’ theorem, which blends our previous knowledge with the latest data to give a new, updated view of what’s most likely to be true.

Imagine you’re a doctor trying to decide on the best treatment for a patient. Before you even see the latest clinical trial results, you have years of medical training, experience, and previous research to draw upon. This background knowledge is your “prior” (i.e., prior information you bring to the table) in Bayesian terms. When the new data come in, you don’t throw away everything you knew before; instead, you update your beliefs based on how the new information aligns with or challenges your prior understanding. This process of updating is at the heart of Bayesian inference. We use Bayes’ theorem to formally combine our prior beliefs with new observed data, resulting in a “posterior” belief that represents our updated understanding.

What makes the Bayesian approach so powerful is its flexibility and its direct interpretation of probability as a degree of belief. This means we can directly answer questions like, “Given the data we have, how confident are we that a population mean falls within a certain range?” It also means we can be more nuanced in how we incorporate evidence from different sources, weighing everything from prior studies to expert opinions.

In summary, while the frequentist approach asks us to imagine what would happen over many repeated experiments, Bayesian statistics invites us to incorporate prior knowledge and then update our beliefs in light of new evidence. It’s a shift from thinking about long-run frequencies to directly assessing our certainty about the world based on the data at hand and what we already know. This approach not only makes statistical analysis more intuitive for many people but can also be more directly applicable to the decisions we face in our programs of research.

Understanding the mechanics

In the next section of this Module, we’ll explore how to use a Bayesian approach to estimate the average depression severity score among adolescents who experienced a major depressive episode (MDE) in the past year. We’ll work with data from the 2021 National Survey on Drug Use and Health (NSDUH), using what we’ve learned from the 2019 data as our guide.

Establishing prior beliefs

Our Bayesian approach to this problem begins by setting prior beliefs about the parameter of interest, the average depression severity score. These priors are based on accumulated knowledge, such as findings from previous research or expert consensus. In this case, our prior is primarily informed by insights from the 2019 NSDUH data. We translate these insights into a quantitative format by assuming a normal distribution centered around the mean score identified from previous data. This not only sets a mean based on past data but also expresses our confidence in this mean’s accuracy through the distribution’s spread, effectively setting the stage for our Bayesian analysis.

Fitting the model

We will use the brm() function from the brms package to fit our Bayesian model for the mean severity depression score among students with a past-year MDE in 2021. brms stands for Bayesian Regression Models using Stan, and is designed to make the power of Bayesian analysis accessible without requiring users to write complex Stan code (the premiere programming language used for statistical modeling and high-performance statistical computation in a Bayesian framework). Let’s start with a high-level overview of the process — and then carry out the analyses.

Combining prior knowledge with data

Our model will combine our prior beliefs (based on the 2019 NSDUH data) with the new data from 2021 using Bayes’ theorem.² This theorem mathematically merges our prior knowledge with the observed data to update our beliefs, resulting in what we call the “posterior distribution.” This posterior distribution reflects our updated understanding and includes a range of possible values for the depression severity score in 2021 in the population of adolescents, each with its associated probability. Remember that a sample represents just on possible random sample from the population. It’s the population parameter, not the sample parameter estimate, that we ultimately seek to know and understand.

For instance, the mean severity of depression score in the 2019 data is 6.1. While we might anticipate a similar mean for the 2021 data, it’s essential to consider factors such as sample variability, year-to-year fluctuations, and the impact of significant events like a pandemic. Therefore, acknowledging these uncertainties is prudent when speculating about where the 2021 mean may lie. To illustrate, examine the two graphs below, depicting probability density functions for the plausible mean depression severity score in the 2021 data. Both graphs present the range of possible values for the mean severity score on the x-axis (1 to 10 is the range for the scale). The graph on the left asserts great certainty that the mean is likely very close to the 2019 mean — assigning a near zero chance that the 2021 mean would be more than about a 1 point difference on either side of the 2019 mean. The graph on the right asserts much less certainty in where the mean for 2021 will fall — acknowledging that there is a reasonable chance that it could be quite far from the 2019 mean (e.g., as small as about 2.5 or as large as about 9.5) — though the further we move from the 2019 mean, the less probable the mean score is for the 2021 data.

Sampling from the posterior

Bayesian analysis often deals with complex probability distributions that cannot be simplified into neat mathematical formulas. Instead of trying to solve these distributions directly, we use computational techniques, such as Markov Chain Monte Carlo (MCMC), to navigate through them. This process begins by constructing what we call a “posterior distribution”, which melds our prior beliefs (e.g., beliefs based on 2019 NSDUH data) with the fresh evidence obtained from new data (e.g., the observed NSDUH data from 2021).

Markov Chain Monte Carlo (MCMC) is a powerful technique used in Bayesian statistics to estimate unknown parameters and explore probability distributions. Let’s break it down with a simple analogy:

Imagine you’re trying to explore a mountainous landscape at night, but your visibility is limited. You start at a random point and take small steps, guided by a flashlight. Each step you take depends only on your current location (this is known as the Markov property). You choose your direction based on how steep the terrain is, aiming to find higher ground (this part is the Monte Carlo simulation). You’re not necessarily trying to find the highest spot immediately but rather exploring different areas to understand the overall landscape.

In Bayesian estimation, MCMC works similarly:

Exploring Parameter Space: Instead of a physical landscape, we explore the space of possible parameter values.
Starting Point: The process starts at a random point in this parameter space.
Taking Steps: Each step (or sample) is taken based on the probability of the current position. Higher probability areas (higher ground) are more attractive for the next step.
Generating Samples: Each step generates a sample of parameter values.
Approximating the Distribution: Over many steps, these samples build up to approximate the posterior distribution, which represents our updated beliefs about the parameters given the data and prior information.

In Bayesian statistics, we incorporate prior information before running the MCMC algorithm. This prior information represents our beliefs about the parameters before seeing the data. It influences the probability density landscape that MCMC explores. Essentially, the prior shapes the terrain we are navigating, providing an initial map that guides our exploration.

Key Points

Markov Property: Each step only depends on the current position, not how you got there.
Monte Carlo Simulation: Randomly choosing steps to simulate the process of exploring the parameter space.
Posterior Distribution: The result of the MCMC process is a distribution that shows the most likely values for the parameters, given the data and prior beliefs.

In summary, MCMC is like exploring a dark, mountainous landscape with a flashlight. You start at a random spot and take steps based on the terrain you can see, building up a picture of the landscape over time. In Bayesian statistics, this helps us figure out the most likely values for our unknown parameters, even when we can’t see the entire “landscape” clearly due to uncertainty.

This technique is essential for navigating complex probability landscapes and finding the regions where our parameters are most likely to be, given the data we have and our prior knowledge.

The resulting posterior distribution encapsulates our updated beliefs about potential values of the parameter we are studying—in this case, the average depression severity score. It is from this probability distribution that we “sample” using MCMC techniques. It’s important to note that when we use the term “sample” in this Bayesian context, we are not referring to collecting actual data points from a population, like you would in a traditional survey study for example. Rather, we are generating a series of potential values for our parameter of interest (e.g., average depression severity in 2021). Each sample here is a simulated estimate of the depression severity score, derived from integrating our prior beliefs with the new data.

These simulated values, or samples, are selected based on their likelihood, which is determined by both the newly collected data and our existing knowledge (our priors). Each sample represents a plausible value of the mean depression severity score in 2021, and collectively, these samples form the posterior distribution. This distribution serves as a comprehensive summary of our revised beliefs about the average depression severity score, reflecting the insights gained from the new data.

By using this method, we gain a richer, more nuanced understanding of what the true value of the parameter might be. This approach does more than just estimate a single number (e.g., the mean depression severity in 2021); it provides a full spectrum of possible outcomes (a probability distribution for mean depression severity in 2021 in the population), each weighted by its probability, allowing us to appreciate the uncertainty and variability in our estimates.

For example, consider the graph below showing the posterior distribution for the mean depression severity score among adolescents in 2021.

On the x-axis, we plot potential scores for mean depression severity. The density plot above these values illustrates the likelihood of each score, as estimated by our Bayesian model. The densest part of the plot, which indicates the highest probability, centers around a score just shy of 6.3. This suggests that, according to our model, the most probable mean severity score is near this value. However, the distribution also indicates that while less likely, the severity score could realistically vary between lower and higher values.

Interpreting the results

Estimating the mean and credible interval

From the posterior samples, we calculate the mean to obtain a point estimate — a single number summarizing our best guess based on the current data and our prior information. To express uncertainty, we can calculate a 95% credible interval from these samples. Unlike the frequentist confidence interval which can only speak to the long-run likelihood, the Bayesian credible interval allows us to say, “We are 95% confident that the average depression severity score lies within this range, given the current data and our prior beliefs.” Indeed, this is a much more intuitive statement compared to the frequentist interpretation (i.e., 95% of the 95% CIs computed across many, many random samples will contain the true population parameter).

Visualizing uncertainty

Once the Bayesian model is fitted, we can visualize the results using ggplot2 with the stat_halfeye() function from the tidybayes package. This visualization helps us see not only our best estimate but also the range of possible values, indicating how certain or uncertain we are about these estimates.

These two additional aspects are also displayed on the graph created earlier, and presented again below.

Notice that directly above the x-axis, there is a dot and a thick black line. The dot represents the mean of the depression severity scores across all posterior draws, providing a single summary estimate. The thick black line around the dot shows the 95% credible interval, which ranges approximately from 6.2 to 6.4. This interval represents the range within which we are 95% confident that the true mean severity score falls, based on our model’s analysis.

This graphical representation helps convey not just our best estimate of the mean depression severity but also our uncertainty about this estimate, highlighting both the most likely value and the plausible range of values given the data and our prior beliefs.

By following these steps, the Bayesian model provides a comprehensive view of what we know and how certain we are about the average depression severity score. It blends prior knowledge with new data in a principled way, offering a nuanced understanding that is directly applicable to real-world decision-making.

An example

Let’s take a look at how this is carried out for the NSDUH example.

First, let’s import and prepare the NSDUH 2021 data (the data file is called nsduh_2021.Rds) and create the variables that we will need. The variables are identical to those from 2019.

df_2021 <- readRDS(here::here("data", "nsduh_2021.Rds"))

df_2021 <-
  df_2021 |> 
  select(sex, mde_pastyear, starts_with("severity_")) |>
  mutate(mde_status = case_when(mde_pastyear == "positive" ~ 1,
                                mde_pastyear == "negative" ~ 0)) |> 
  mutate(severity_scale = rowMeans(pick(starts_with("severity_")))) |> 
  select(severity_scale) |> 
  drop_na()

Estimate the mean using Bayesian priors

Now, we’re ready to estimate the mean depression severity score among the 2021 adolescents with a past-year MDE.

Selection of priors

Before fitting the Bayesian model, we must select our priors. In this simple example, there are three pieces of prior information needed:

The mean that we expect (i.e., the average depression severity score among adolescents with a past-year MDE).
The standard deviation for this expected mean (i.e., how uncertain are we about where the mean might be).
The expected distribution of the mean.

For the first parameter, we will specify that the average (or mean) depression severity score should be around 6.1, a value derived from our analysis of the 2019 NSDUH data. For the third parameter, we specify a normal distribution for these scores, reflecting our expectation that, while a 2021 mean that is close to the 2019 mean is most probable, there’s surely variability in either direction.

The second parameter, the standard deviation, is usually the most difficult to choose. Given the depression severity scale ranges from 1 to 10, scores outside this range are not feasible. Thus, in contemplating this model, we seek to choose a prior distribution that stays within these bounds but remains broad to acknowledge the inherent uncertainty and variability in depression severity scores from one year to the next. When selecting the prior for the standard deviation, it is useful to set up a graph to consider various possibilities to help make a decision. For example, take a look at the graph below that depicts a prior distribution that is specified as normal, with a mean of 6.1 and standard deviation of 1.3.

# Parameters for the Gaussian distribution
mean <- 6.1
std_dev <- 1.3

# Generate points on the x axis
x <- seq(mean - 3*std_dev, mean + 3*std_dev, length.out = 1000)

# Calculate the y values for the Gaussian probability density function (PDF)
y <- dnorm(x, mean = mean, sd = std_dev)

# Create a data frame for plotting
data <- data.frame(x, y)

# Plotting
ggplot(data, aes(x = x, y = y)) +
  geom_line() +
  labs(title = "Gaussian probability distribution",
       subtitle = paste('Mean =', mean, ', SD =', std_dev),
       x = 'Depression severity scale mean', 
       y = 'Density') +
  theme_minimal() +
  theme(plot.subtitle = element_text(size = 10))

This graph visualizes our specified prior probability distribution for average depression severity among adolescents participating in the 2021 NSDUH. By choosing a normal distribution centered around 6.1 with an appropriate standard deviation (as suggested by the range between 5 and 7 being ‘highly likely’), our belief that the population mean is most likely around 6.1 is reflected. This prior reflects our expectation before analyzing the new data: scores near 6.1 are deemed more probable, with the likelihood gradually decreasing for scores farther from this central value. Consequently, extreme means — such as those less than 2.5 or greater than 9.5 — are considered very unlikely under this prior distribution. It’s important to note that while our prior allows for a broad range of possible means, reflecting a level of uncertainty, it places higher probabilities on outcomes near the expected mean of 6.1, aligning with our prior knowledge or belief about the severity of depression among adolescents.

In summary, after visualizing different probability distributions and assessing how they align with expectations given the 2019 data, a mean of 6.1 and a standard deviation of 1.3 helps to create a distribution that’s wide enough to capture the range of possible average scores without being overly restrictive. This width is crucial, especially considering the unique circumstances surrounding the 2021 study year, such as the potential impacts of COVID-19. It’s reasonable to anticipate increased uncertainty in these scores due to such external factors. It’s probably quite unlikely that the mean would be much less, or much greater, than the 2019 mean of 6.1, but this proposed prior distribution allows for that small possibility.

Fit the Bayesian model

To specify the Bayesian model, we use the following code:

# fit the bayesian model
model <- brm(
  formula = severity_scale ~ 1, 
  data = df_2021, 
  prior = prior(normal(6.1, 1.3), class = "Intercept"), 
  family = gaussian(), 
  warmup = 1000, 
  iter = 4000, 
  thin = 5,
  chains = 4
  )

Let’s break down the code in the code block above:

brm(): This function is the powerhouse for the brms package to fit Bayesian regression models. Even though we’re not fitting a traditional regression model (i.e., with predictors) in this scenario, the setup and rationale behind using brm() follow similar principles, and we’ll be able to build on this example later in the course when we’re ready to start adding predictors to our model, rather than just estimating the mean.
formula = severity_scale ~ 1: This argument specifies the model within the brm() function. Here, severity_scale represents the continuous outcome variable we’re interested in analyzing — in this case, the depression severity score among adolescents. The ~ 1 part of the formula indicates that we are modeling the severity_scale as a function of an intercept only, without including any predictor variables. Essentially, this setup allows us to estimate the overall mean depression severity across all individuals with a past-year MDE in the data frame. It’s a simple model that aims to capture the central tendency (mean) of the depression severity scores. This specification will become more clear as we begin learning about linear regression models later in the course.
data = df_2021: This argument specifies the data frame to be analyzed.
prior = prior(normal(6.1, 1.3), class = "Intercept"): In our Bayesian model, we specify our initial beliefs or knowledge about the average depression severity scores using a prior distribution. This argument sets these priors. Through the function normal(), the first argument (6.1) sets the mean of the proposed prior distribution and the second argument (1.3) sets the standard deviation of the proposed prior distribution).
family = gaussian(): This argument specifies a normal distribution for modeling the outcome in the 2021 data.
iter = 4000, warmup = 1000, thin = 50: These arguments specify the number of iterations, the warm-up period for the Markov Chain Monte Carlo (MCMC) algorithm, and the thinning interval respectively³.
chains = 4: This argument sets the number of chains to run in parallel, helping to assess convergence⁴.

Check model convergence

Before examining the results we need to check that the model “converged”. What does that mean? When we use Bayesian methods to analyze data, we’re often employing sophisticated computational techniques to explore complex probability landscapes. One such technique, central to our Bayesian workflow, is Markov Chain Monte Carlo (MCMC). MCMC helps us “sample” from the posterior distribution—the updated beliefs about our parameters after considering both our prior knowledge and the new data. However, for MCMC to give us accurate insights, it needs to reach a state called convergence.

Why is Convergence Necessary? Imagine you’re trying to understand the layout of a vast, mountainous landscape at night by walking through it with a flashlight and taking notes. To get a true sense of the landscape, you need to wander enough so that your exploration isn’t just limited to one corner or path. Convergence means that you’ve walked enough to know the landscape well — you’ve visited each part proportionally to its importance, giving you a faithful map of the terrain.

In practical terms, convergence ensures that the samples we’ve collected from the posterior distribution are representative of it. Without convergence, we might make incorrect inferences about our parameters because our exploration was incomplete or biased. It’s like trying to describe a mountainous region after only seeing a valley — you’re missing a big part of the picture.

How can we check convergence for our simple example?

The mcmc_trace() function generates trace plots for each parameter estimate (for example, the intercept (mean — referred to as b_intercept in brms and sigma (the standard deviation) in our model).

Here’s how to request these:

# trace plot to check convergence
model |> mcmc_trace(pars = "b_Intercept")

model |> mcmc_trace(pars = "sigma")

These plots display the sampled values at each iteration for every chain. For a model that has converged, the trace plot of each chain should resemble a “hairy caterpillar sitting on the ground,” filling the plot space evenly without clear patterns, drifts, or separations between chains. This indicates that the chains are exploring the posterior distribution effectively and consistently. If you can conceptually fit a rectangle around the bulk of the traces without including large excursions, your model is likely well-converged. Conversely, a trace plot showing chains that do not overlap well or exhibit trends (snaking or undulating patterns) may indicate a lack of convergence. Indeed, our trace plots looks like a fuzzy caterpillar, and no trending is apparent. This indicates model convergence.

Additionally, we can request a posterior predictive check. The pp_check() function generates posterior predictive checks for a fitted model.

# posterior plot check
model |> pp_check(ndraws = 100, type = "dens_overlay")

Posterior predictive checks (PPCs) are a diagnostic tool used to assess how well a model’s predictions align with the observed data. The function simulates new data based on the posterior distribution of the model parameters. Essentially, it uses the fitted model to create a distribution of possible outcomes (predictions) for each observation in the data frame. It then compares these predicted outcomes to the actual observed outcomes, highlighting differences or similarities between the model’s predictions and the real data. PPCs are an essential part of the model-checking process in Bayesian analysis. They help to diagnose issues with the model fit, such as systematic discrepancies between the predictions and observations that might indicate a poor model fit or inappropriate model specifications. A good fit in a PPC doesn’t prove your model is correct, but a bad fit can often indicate problems. Discrepancies highlighted by PPCs can guide model refinement, such as rethinking priors, considering additional covariates, or exploring alternative model structures (these latter two possibilities will make more sense later in the semester). Our outputted graph looks quite good. The raw distribution of the mean depression severity scores (the navy blue distribution) is quite similar to all of the 100 drawn posterior distributions (the lighter blue distributions). If the drawn posterior distributions were quite different than the raw distribution — then we would have cause for concern.

Extract the estimated mean and credible interval

After fitting the Bayesian model and ensuring convergence, we can extract the estimates of interest. This can be done by applying the tidy() function from the broom.mixed package to the output of our Bayesian model.

model |> tidy() |> filter(effect == "fixed")

The value under estimate corresponds to the estimated mean for the depression severity scale, which is 6.3. The value under std.error provides the estimated standard error, analogous to the standard error estimated using the parametric (i.e., theory-based) approach earlier, with a value of 0.04.

Our 95% credible interval, represented by conf.low (lower bound) and conf.high (upper bound), is roughly 6.2 to 6.4. These bounds are the 2.5th and 97.5th percentiles of the posterior distribution, akin to how we calculated the middle 95% of the simulated sampling distribution using bootstrapping. This interval reflects our degree of belief, based on the observed data and prior assumptions, that the true population mean falls within these limits. In other words, we are 95% confident that the true population mean value for depression severity among adolescents in 2021 lies between roughtly 6.2 and 6.4.

Unlike a frequentist confidence interval, which would be interpreted in terms of long-run frequencies of containing the true parameter if the experiment were repeated many times, the Bayesian credible interval directly reflects the probability of the parameter lying within the specified range, given the current data and prior knowledge. This makes the Bayesian interval a more intuitive measure of uncertainty for many people, as it aligns more closely with how we naturally think about probabilities and uncertainty in our everyday lives.

Visualize the results

Last, we can create a visualization of the posterior draws and the 95% credible interval.

# create a dummy dataframe to designate the mean/intercept
newdata <- data.frame(Intercept = 1) 

# draw predicted scores from the posterior distribution
draws <-
  model |> 
  epred_draws(newdata = newdata) 

# plot the results
draws |> 
  ggplot(mapping = aes(x = .epred)) +
  stat_halfeye(
    point_interval = "mean_qi",
    .width = .95,
    fill = "#A7226E") +
  theme_minimal() +
  labs(
    title = "Draws from the posterior distibution for mean depression severity \nin the population in 2021",
    y = "Density", 
    x = "Depression severity")

Let’s break down this code:

newdata <- data.frame(Intercept = 1): This line creates a dummy data frame named newdata with a single column named Intercept. The dataframe contains only one row with the value 1. This setup is used because our model formula in brm() does not include any predictors other than the intercept. By providing a dataframe with only an intercept, we’re effectively asking for predictions based solely on the overall mean estimated by the model.
epred_draws(newdata = newdata): The epred_draws() function from the tidybayes package is used to generate draws of predicted scores from the posterior distribution of the model. Here, we’re using it to obtain predictions for the dummy data frame newdata, which allows us to focus on the aggregate predictions, i.e., the overall mean severity score, rather than individual-level predictions.
ggplot(mapping = aes(x = .epred)): This starts the plot creation process. It sets up a plot object, specifying that the x-axis should represent the predicted scores extracted by epred_draws(), which are named .epred. stat_halfeye() adds a half-eye plot to the ggplot object, which is a way of visualizing the distribution of predicted scores. The half-eye plot includes both a density plot and an interval plot, giving a clear visual representation of the distribution’s shape and the credible interval around the predictions.
.width = .95: This argument in stat_halfeye() sets the width of the credible interval to 95%, indicating that the interval covers 95% of the predicted values, centered around the mean. This can be changed if you’d prefer some other credible interval — for example 90%.

This visualization step is crucial as it not only provides a clear picture of the model’s predictions but also illustrates the uncertainty surrounding these predictions through the distribution and credible intervals of the predicted proportions. This approach gives a more nuanced understanding of the model’s outputs, highlighting both the central tendency (mean or median) and the variability (uncertainty) of the predictions.

Wrap up

In this Module, we explored the concept of confidence intervals (CIs) from three distinct perspectives: bootstrapping, theory-based methods, and Bayesian credible intervals. Each approach offers a unique lens through which we can quantify uncertainty and make inferences about population parameters based on sample data.

Bootstrapping: This non-parametric method leverages resampling with replacement from our original data to generate numerous simulated resamples. By calculating the statistic of interest across these resamples, we construct a distribution that allows us to estimate the variability of our statistic. Bootstrapping is particularly powerful for its minimal assumptions about the underlying data distribution and its applicability to complex estimators.
Theory-Based Methods: These methods rely on theoretical probability distributions (like the normal or t-distribution) to construct CIs. By assuming certain properties about the population (e.g., normality), we can use mathematical formulas to estimate the standard error of our statistic and, subsequently, its CI. This approach benefits from its mathematical simplicity and efficiency but requires adherence to its underlying assumptions.
Bayesian Credible Intervals: Diverging from the frequency interpretation of probability, Bayesian analysis incorporates prior knowledge along with the observed data to update our beliefs about a parameter, resulting in a posterior distribution. The credible interval, derived from this distribution, reflects a range within which the parameter is believed to lie with a certain probability. This method offers a direct probabilistic interpretation and the flexibility to incorporate prior information, making it highly valuable when one has robust prior information to incorportate.

While each method has its advantages and limitations, together they provide a comprehensive toolkit for statistical inference. Bootstrapping introduces a flexible, assumption-light way to understand statistical variability. Parametric methods offer a classic, theory-driven framework for quick and assumption-based inferences. Bayesian analysis, with its incorporation of prior knowledge, offers a nuanced and adaptable approach to understanding uncertainty.

Confidence intervals are a cornerstone of statistical inference, providing a range of plausible values for our parameters of interest. Whether through bootstrapping, parametric methods, or Bayesian analysis, understanding and correctly interpreting CIs empowers us to make informed decisions based on data. As we’ve seen, the choice of method depends on the specific context, including the nature of the data, the assumptions we’re willing to make, and the information we already have. By mastering these techniques, you’re now equipped to navigate the uncertainties inherent in statistical analysis and draw more meaningful conclusions from your data.

Credits

I drew from the excellent Modern Dive and Bayes Rules e-books to develop this Module.

Footnotes

You might wonder why we didn’t use a t-distribution for calculation of the CI for the proportion using the parametric/theory-based method. When calculating a CI for a proportion using a parametric approach, the z-distribution is typically used. This is because the calculation of a CI for a proportion relies on the normal approximation to the binomial distribution.↩︎
Though it’s not necessary to fully understand, for interested readers, Bayes’ theorem equation is given by:

\[ P(\theta \mid D) = \frac{P(D \mid \theta) \times P(\theta)}{P(D)} \]
- \(P(\theta \mid D)\): This is the posterior probability of the parameter \(\theta\) given the data \(D\). It represents our updated belief about \(\theta\) after observing the data.
- \(P(D \mid \theta)\): This is the likelihood of observing the data \(D\) given the parameter \(\theta\). It measures how well the data supports different values of \(\theta\).
- \(P(\theta)\): This is the prior probability of the parameter \(\theta\). It encapsulates our beliefs about \(\theta\) before observing the data, based on previous knowledge or expert opinion.
- \(P(D)\): This is the evidence or marginal likelihood of the data \(D\). It is the probability of observing the data under all possible values of \(\theta\), integrating over all possible values of the parameters. It serves as a normalizing constant to ensure the posterior probabilities sum to one.
In our example:
- \(\theta\) represents the depression severity score for students with a past year MDE
- \(D\) is new NSDUH data from 2021
- \(P(\theta)\) is our prior information about depression severity based on the 2019 NSDUH data- \(P(D \mid \theta)\) is computed based on how likely the observed 2021 data is, given a certain level of depression severity
- \(P(D)\) ensures the resulting probabilities are properly normalized
↩︎
You don’t have to worry to much about this now, you’ll get more practice as we progress through the semester. Deciding on the appropriate number of iterations and the warm-up period for Markov Chain Monte Carlo (MCMC) simulations in Bayesian analysis involves a balance between computational efficiency and the accuracy of the estimated posterior distribution. Here are some guidelines to help choose these parameters:

Number of Iterations
- Start with a Default: Starting with the defaults can be a reasonable first step — here just removing the iter argument would kick in the default.
- Consider Model Complexity: More complex models, with more parameters or more complicated likelihoods, may require more iterations to explore the parameter space thoroughly.
- Assess Convergence: Use diagnostic tools (e.g., trace plots, R-hat statistics) to check if the chains have converged to the stationary distribution. If diagnostics indicate that the chains have not converged, increasing the number of iterations can help.
- Evaluate Precision of Estimates: More iterations can lead to more precise estimates of posterior summaries (mean, median, credible intervals). If the posterior summaries change significantly with additional iterations, it may indicate that more iterations are needed.
Warm-Up Period
- Purpose: The warm-up (or “burn-in”) period is used to allow the MCMC algorithm to “settle” into the stationary distribution, especially if starting from values far from the posterior mode. Samples from this period are typically discarded.
- Proportion of Iterations: A common practice is to set the warm-up period to be about half or at least a quarter of the total iterations, but this can vary based on the specific model and data.
- Adjust Based on Diagnostics: Use MCMC diagnostics to assess whether the chosen warm-up period is sufficient. Look at trace plots to see if the initial part of each chain shows signs of convergence. If the chains appear to have not stabilized after the warm-up period, consider increasing it.
- Iterative Approach: It may be necessary to run the MCMC algorithm multiple times with different warm-up lengths to find an appropriate value. The goal is to discard the initial samples where the algorithm has not yet reached the region of the posterior distribution of interest.
Thinning
- Purpose: Thinning is a technique used to reduce the autocorrelation between successive samples in an MCMC chain and to reduce the storage requirements for the sampled chains. Given the requested specifications for this model, the model runs for 4000 iterations for each of the 4 chains. The first 1000 iterations in each chain are used for warmup (also known as burn-in) and are discarded. These iterations allow the MCMC algorithm to reach a stable state. After the warmup phase, there are 3000 iterations left in each chain (4000 total iterations - 1000 warmup iterations). With thin = 5, only every 5th iteration is kept after the warmup phase. This reduces autocorrelation between successive samples and saves storage space. From the 3000 post-warmup iterations, only every 5th iteration is retained. This means that we will keep 3000/5 = 600 iterations from each chain. Since there are 4 chains, and each chain retains 600 samples after thinning, the total number of samples retained for analysis is 600 times 4, which equals 2400.
Assessing Convergence
- Convergence Diagnostics: Multiple chains allow for the use of convergence diagnostics like the Gelman-Rubin statistic (R-hat). If all chains are converging to the same distribution, R-hat values should be close to 1 (typically, 1 ≤ R-hat ≤1.01) for each parameter. Different starting values for each chain can help ensure that convergence is not accidental or due to chains starting too close to the target distribution.
↩︎
Deciding on the number of chains to run in parallel for Markov Chain Monte Carlo (MCMC) simulations involves considerations of convergence diagnostics and computational resources. Running multiple chains is crucial for assessing the convergence of the MCMC algorithm to the target posterior distribution. Relying on the defaults as a beginner is perfectly fine, here, you could remove the chains argument and the default for your model will be specified.↩︎