Quantifying Uncertainty in Linear Regression Models

Module 12

Artwork by @allison_horst

Learning objectives

  • Explain the concept of uncertainty in the context of linear regression coefficients
  • Describe how sample-to-sample variability affects regression coefficients
  • Analyze the role of standard errors in quantifying the uncertainty of regression coefficients
  • Interpret the meaning of confidence intervals for the slope and intercept in linear regression
  • Illustrate how bootstrapping can be used to estimate the variability of regression coefficients
  • Understand theory-based methods for estimating uncertainty in regression coefficients
  • Compare frequentist and Bayesian approaches to estimating uncertainty in regression coefficients
  • Apply methods for visualizing the uncertainty in regression coefficients

Introduction to the data

In this module we will study the impact of weather conditions on bicycle usage in New York City. To begin, we will start with data from 2023. The data were compiled from two sources:

Bicycle Count Data: The bicycle count data is sourced from the New York City Open Data portal, specifically from the dataset entitled “Bicycle Counts” available here. This dataset encompasses daily records of bicycles counted by 32 strategically placed counters throughout New York City. The purpose of these counters is to provide a quantitative basis for understanding cycling trends across the city, which can inform infrastructure planning, safety measures, and promote cycling as a mode of transportation. The map below marks the placement of the 32 bike counters in New York City.


Weather Data: Weather data for the same period was procured using the openmeteo R package, which pulls information from Open-Meteo, an open-source weather data provider. Open-Meteo offers historical, current, and forecasted weather data, making it an invaluable resource for correlating weather conditions with cycling activity. The weather variables include metrics like daily maximum and minimum air temperatures (measured at 2 meters above ground), apparent temperature, precipitation sums and duration, and maximum wind speed — all of which may be important in understanding how weather influences cycling frequency and patterns. We can pull in the weather data using the following code:

library(openmeteo)

nyc_weather_2023 <-
    weather_history("NYC",
    start = "2023-01-01",
    end = "2023-12-31",
    daily = c("temperature_2m_max", 
              "temperature_2m_min",
              "apparent_temperature_max",
              "apparent_temperature_min",
              "precipitation_sum",
              "precipitation_hours",
              "wind_speed_10m_max")
    ) 
`geocode()` has matched "NYC" to:
New York in New York, United States
Population: 8175133
Co-ordinates: c(40.71427, -74.00597)

I used these two data sources to compile a data frame for this Module, it’s called nyc_bikes_2023.Rds. Each row of data represents one day in 2023. The following variables are included:

Variable Description
date Date for observation
total_counts Total number of bikes recorded by 32 bike counters dispersed throughout New York City
daily_temperature_2m_max Maximum daily air temperature (Celsius) at 2 meters above ground
daily_temperature_2m_min Minimum daily air temperature (Celsius) at 2 meters above ground
daily_apparent_temperature_max Maximum daily apparent temperature (Celsius)
daily_apparent_temperature_min Minimum daily apparent temperature (Celsius
daily_precipitation_sum Sum (millimeters) of daily precipitation (including rain, showers and snowfall)
daily_precipitation_hours The number of hours with rain
daily_wind_speed_10m_max Maximum wind speed (in km/hr) on a day


Let’s load the packages that we will need in this Module.

library(broom)
library(broom.mixed)
library(gtsummary)
library(gt)
library(brms)
library(tidybayes)
library(bayesrules)
library(skimr)
library(here)
library(tidyverse)

Now, we can import the data frame, called nyc_bikes_2023.Rds.

nyc_bikes_2023 <- read_rds(here("data", "nyc_bikes_2023.Rds")) 
nyc_bikes_2023 |> head()

A simple linear regression model

Before fitting a linear regression model, let’s take a look at the sample data relating total_counts to daily_temperature_2m_max:

As we learned in Module 10, a simple linear regression (SLR) model attempts to explain the relationship between two variables through a linear equation of the form:

\[ {y_i} = b_0 + b_1x_i + e_i \]

Where:

  • \(y_i\) is the dependent variable (in this case, total_counts), which we are trying to predict or explain. This is also commonly referred to as the outcome or the response variable.

  • \(x_i\) is the independent variable (here, daily_temperature_2m_max) that we believe might predict \(y_i\). This is also commonly referred to as the predictor.

  • \(b_0\) is the intercept, representing the expected value of \(y_i\) when \(x_i\) is 0.

  • \(b_1\) is the slope coefficient, indicating the expected change in \(y_i\) for a one-unit change in \(x_i\).

  • \(e_i\) represents the residual, accounting for the variation in \(y_i\) not explained by \(x_i\).


We can fit the model using the lm() function:

slr <- lm(total_counts ~ daily_temperature_2m_max, data = nyc_bikes_2023)

Now, let’s take a look at the parameter estimates using tidy().

slr |> tidy() |> select(term, estimate)

The estimate for the intercept (23712) represents the expected number of bikes recorded (total_counts) when the maximum daily temperature (daily_temperature_2m_max) is 0 degrees Celsius. In practical terms, this can be thought of as the baseline number of bicycles counted on a cold day with 0°C (which is 32°F) as the maximum temperature.

The estimate for the slope (1310) indicates that, on average, for each 1°C increase in the maximum daily temperature, the total number of bikes counted increases by approximately 1310 bikes. This positive coefficient suggests that warmer temperatures are associated with higher bicycle usage, possibly due to more people choosing to bike under comfortable weather conditions.

The results of glance() provide some additional information about the model fit.

slr |> glance() |> select(r.squared, sigma)


The \(R^2\), labeled r.squared in the output, is 0.556 and measures the proportion of the variance in the outcome/dependent variable (total_counts) that is predictable from the predictor/independent variable (daily_temperature_2m_max). In this example, approximately 56% of the variation in daily bike counts can be explained by the variation in maximum daily temperature. This is a substantial proportion, indicating that the maximum daily temperature is a good predictor of bike usage in New York City, though there is still a considerable amount of variation that is not predicted by this model alone (i.e., ~ 44%). The residual standard error (labeled sigma) gives an estimate of the standard deviation of the residuals, which are the differences between observed values and the values predicted by the model. In this context, a sigma of ~10260 means that the typical prediction for the total bike counts deviates from the actual bike counts by about 10,260 bikes, on average. This provides a measure of the accuracy of the predictions made by the model; a lower sigma would indicate more precise predictions.

Making predictions from the model

When analyzing data, it’s common to want to predict outcomes based on specific values of predictors. For instance, in our analysis of bicycle ridership, we might be interested in forecasting the number of riders on a day when the maximum temperature is 21°C (approximately 70°F). To facilitate this, we employ the predict() function in R:

# Creating a data frame with the desired predictor values
new_data <- data.frame(daily_temperature_2m_max = 21) 

# Utilizing the linear model to predict outcomes based on new data
slr |> predict(newdata = new_data) 
       1 
51229.46 

Of course, we can easily do this manually (using the tidy() output) as follows:

23711.546 + (1310.377 * 21)
[1] 51229.46

Our computation suggests that, on a day where the maximum temperature is 21°C, we anticipate around 51,229 bicycle riders to be counted in NYC. This value is depicted by the orange circle in the graph below.


Understanding sample-to-sample variability

When we conduct a study using sample data, we are essentially capturing a snapshot of the real world under specific conditions and at a specific time. In the case of our bike counts and temperature study, the data for 2023 represents one possible realization of many that could occur. Various factors contribute to sample-to-sample variability in this context. For example:

  1. Weather Variability: Weather conditions can vary significantly from year to year, including temperature patterns, precipitation, and extreme weather events. Since our study focuses on the relationship between bike counts and maximum daily temperature, different weather conditions in other samples (e.g., different years) could lead to different observations and, consequently, different estimates of the effect of temperature on bike usage.

  2. Changes in City Infrastructure and Policies: Over time, New York City might implement new bike lanes, bike-sharing programs, or policies affecting transportation. Such changes can influence people’s willingness to bike and could affect the relationship between temperature and biking behavior observed in different samples.

  3. Population Changes and Cultural Shifts: The population of a city is not static; people move in and out, and cultural attitudes towards biking and outdoor activities can shift. These changes can influence the number of people who choose to bike under various weather conditions, adding to the variability between different samples.

  4. Variability in Data Collection Locations: The placement of bike counters throughout the city is a critical factor in measuring bike usage. If these counters were positioned at slightly different locations, perhaps capturing areas with different traffic patterns or proximities to popular destinations, the recorded bike counts could vary. For instance, counters placed closer to parks or bike paths might record more activity on warm days compared to those placed in more industrial areas with less recreational cycling.

  5. Precision in Temperature Measurement: The accuracy and precision with which temperature and other weather conditions are measured can also influence the observed relationship between temperature and bike usage. Variations in the placement of temperature sensors, their calibration, or the model of the sensors used can lead to slight differences in recorded temperatures, potentially affecting the estimated impact of temperature on biking behavior.

  6. Influence of Unmeasured Environmental and Social Factors: Beyond the placement of counters and the precision of temperature measurements, other environmental and social factors that aren’t captured in the dataset can contribute to variability. These might include temporary road closures, construction work, special events that either encourage or deter biking, and even shifts in public transportation availability.

The implications of variability

Acknowledging sample-to-sample variability is essential for statistical thinking. It reminds us that our estimates from a single sample are subject to uncertainty. This variability underscores the importance of considering the potential range of estimates we might obtain if we were to repeat our study under slightly different conditions or with different samples. It also motivates the use of statistical methods that account for this uncertainty, allowing us to make more informed and robust conclusions about the relationships we observe in our data.

This underscores the fact that relying solely on point estimates—such as the intercept, the slope, or the \(R^2\) value — is not sufficient. To gain a more comprehensive understanding of our model’s reliability and the robustness of our findings, we need to incorporate and interpret statistical measures that represent the uncertainty associated with these estimates. Such measures are critical for understanding not just the specific effects observed in our sample but how these effects might generalize to the population at large, reflecting our true objective in statistical modeling and inference.

Intuition for uncertainty in a SLR

When fitting a linear regression model, we are interested in estimating the population parameters of the linear regression model (i.e., the intercept and slope(s)) using our sample data.

The population equation represents the idealized, true relationship between predictor variables and the response variable across the entire population of interest. This equation is often what we aim to estimate through statistical modeling, but it remains unknown and must be inferred from sample data.

The population equation that relates a predictor variable (X) to an outcome variable (Y) may be written as follows:

\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]

Here:

  • \(Y_i\) is the value of the response variable for the \(i^{th}\) case in the population.
  • \(\beta_0\) is the true intercept of the population regression line, representing the expected value of \(Y\) when \(X = 0\).
  • \(\beta_1\) is the true slope of the population regression line, indicating how much \(Y\) is expected to change with a one-unit change in \(X\).
  • \(X_i\) is the value of the predictor variable for the \(i^{th}\) case in the population.
  • \(\epsilon_i\) represents the error term for the \(i^{th}\) case, capturing all other factors affecting \(Y\) besides \(X\) and inherently includes the variability in \(Y\) not explained by \(X\). The error term, \(\epsilon_i\), follows a normal distribution with a mean of 0 and a standard deviation of \(\sigma\), denoted as \(\epsilon_i \sim N(0, \sigma^2)\). This represents the randomness or noise in the relationship between \(X\) and \(Y\).

Consider a hypothetical scenario where we magically know the true relationship between the daily maximum temperature (\(X_i\)) and the number of bike riders (\(Y_i\)) in New York City.

\[Y_i = 21000 + 1450X_i + \epsilon_i\]

This equation asserts that if the daily maximum temperature is 0°C, the predicted number of bicycle riders recorded by counters is 21,000 for a particularly day. And, that for each 1°C increase in max daily temperature, we expect that the counters will record 1450 additional riders for the day.

In practice, we don’t know the true values of \(\beta_0\), \(\beta_1\), or \(\epsilon_i\), so we estimate them using sample data. The sample equation, derived from our data, is formulated as:

\[ \hat{y_i} = b_0 + b_1x_i + e_i \]

  • \(\hat{y_i}\) is the predicted value of the response variable for the \(i^{th}\) case in the sample.
  • \(b_0\) and \(b_1\) are the estimates of the intercept and slope, respectively, obtained from the sample.
  • \(x_i\) is the observed value of the predictor variable for the \(i^{th}\) case in the sample.
  • \(e_i\) is the residual for the \(i^{th}\) case, representing the difference between the observed value and the predicted value of the response variable: \(e_i = y_i - \hat{y_i}\).

Contrasting population and sample equations

  • Parameters vs. Estimates: In the population equation, \(\beta_0\), \(\beta_1\), and \(\epsilon_i\) are fixed but unknown parameters that define the true model. In the sample equation, \(b_0\) and \(b_1\) are our best estimates of these parameters based on the sample data, and \(e_i\) represents the residuals of these estimates.
  • Understanding Error Terms: The error term, \(\epsilon_i\), in the population model reflects the true, unobservable variability around the population regression line due to factors not included in the model. The residuals, \(e_i\), in the sample equation are observable and measure the discrepancy between the observed and predicted values based on our sample.
  • Purpose and Application: The population equation is conceptual, guiding our understanding of how variables are related across the entire population. The sample equation is practical, allowing us to make predictions and infer relationships based on the data we have collected.

In summary, while the population equation represents the idealized, true model we aim to uncover, the sample equation provides our best approximation of this model using the data at hand. Through statistical analysis, we use the sample equation to infer the properties of the population equation, navigating from the specific insights provided by our sample back to the broader truths of the population.

Let’s imagine that we had the true population data. The graph below depicts this simulated population.

Let’s enhance this graph to include both the simulated population data, and the sample data collected in 2023. The population data is recorded in pink, while the sample data is recorded in black.

When interpreting the graph above, we can identify two primary sources of error in the relationship between the observed sample data (represented by the black line) and the hypothetical population data (represented by the pink line).

1. Sampling Error: The first source of error stems from the difference between the sample regression line (black) and the population regression line (pink). This discrepancy highlights the concept of sampling error, which is the variation between the true population parameter (e.g., the true relationship between temperature and bike counts) and the estimate obtained from a sample. No sample is a perfect representation of the population due to the randomness inherent in which observations end up in the sample. This kind of error underscores the uncertainty in using sample data to infer population parameters. Even with a perfectly conducted study, the sample estimates will likely differ from the true population values simply due to the randomness of which individuals or observations are included in the sample.

2. Prediction Error (Residual Error): The second source of error relates to the prediction error or residual error, which is present in both the population and sample regression models. This error encompasses the deviation of the observed values from the values predicted by the regression line (i.e., the actual values for bike counts, and the predicted bike counts based on the model). Several factors contribute to prediction error, including:

  • Measurement error: Inaccuracies in how the dependent or independent variables are measured (e.g., imprecision in counting bikes or recording temperatures).

  • Individual variation: Since the observations in this data frame are days, the individual variation may be related to specific daily conditions such as special events, weather anomalies, holidays, or other factors that impact the number of bike riders but are not captured by temperature alone.

  • Model specification error: The possibility that the model does not fully capture the true relationship because it is too simple, omits relevant variables, or assumes linearity when the relationship is not linear.

These errors highlight important considerations in statistical modeling and inference. Sampling error reminds us of the limits of generalizing from a sample to a population, emphasizing the need for statistical methods that quantify uncertainty (like standard errors and confidence intervals). Prediction error, on the other hand, points to the inherent limitations in predicting outcomes with models, motivating efforts to improve model accuracy through better data, more appropriate modeling techniques, and consideration of a broader range of explanatory variables.

In the graph below, I’ve made an important modification that further sets the stage for the ideas that we will consider in this Module. Here, I’ve added a 95% confidence band for the regression line that relates temperature to number of bicyclists counted in both the true population and in the observed sample.

Visualizing Uncertainty with Confidence Bands  When we add confidence bands around the regression line on our graph, these shaded areas represent the range within which we expect the true regression line to lie, given a certain level of confidence (often 95%). These bands illustrate the uncertainty in our estimates, showing not just our best guess (the regression line) but also the variability associated with that guess. Specifically, a 95% confidence band for the regression line indicates that if we were to repeat the sampling process many times, constructing a confidence band around the regression line for each sample, approximately 95% of these bands would contain the true regression line.

Two Types of Uncertainty Illustrated

  1. Population Uncertainty: When we talk about the population, we’re referring to the entire set of possible observations. The “true” population error variability is a theoretical concept that reflects the inherent variability in the outcome variable that is not explained by the predictor variables. In an idealized, perfectly measured population, this variability would capture the pure “noise” or unexplained variability after accounting for all possible predictors, not just those included in our model. Similarly, the confidence band around the population regression line (pink) acknowledges that, even if we knew the entire population, there would still be uncertainty in our estimates due to other sources of error (e.g., measurement error, individual differences not accounted for by the model). While we conceptualize the population line as the “true” (i.e., population) relationship, in reality, we can never know this with absolute certainty.

  2. Sampling Uncertainty: The sample, on the other hand, is a subset of the population and includes additional sources of variability. This might include sampling error (the error from estimating population parameters based on a sample) and any measurement errors or inaccuracies in the data collection process. As a result, the observed error variability in the sample can be higher than that of the population because it encompasses both the inherent population variability and the additional variability introduced by the sampling process and measurement issues. The confidence band around the sample regression line (black) visually communicates the uncertainty due to sampling variability. It shows that, given the data we have from our sample, the true relationship between temperature and bike counts could plausibly lie anywhere within this shaded region. This variability is a natural consequence of the fact that our sample might not perfectly represent the entire population.

Uncertainty in regression models

Just as we gauged the uncertainty around a sample mean or sample proportion in Modules 8 and 9, we may also gauge the uncertainty around our regression estimates. We’ll begin studying methods for quantifying uncertainty for the intercept and slope — that is, we’ll estimate the standard error for the intercept and slope in a fitted regression model. These standard errors in the regression model context helps us understand the precision of our regression estimates.

The standard errors for the intercept and slope(s) in a linear regression model provide measures of the uncertainty or variability around these estimated parameters. Essentially, they tell us how much we might expect these estimates to “wiggle” if we were to fit the model to different samples from the same population. A smaller standard error suggests greater precision in our estimate, often arising when our data points closely follow a linear pattern (leading to smaller residuals). In contrast, a larger standard error indicates more uncertainty, which can result from data that scatter widely around the regression line (larger residuals). In essence, standard errors give insight into the reliability of our regression estimates, with smaller values indicating more reliable estimates and larger values signaling caution due to great variability.

In Module 9, we learned how to estimate standard errors (and confidence intervals) using bootstrapping (a non-parametric approach that mimics the sampling distribution of the parameters) and theory-based formulas for means and proportions. We also learned about a Bayesian approach to quantifying uncertainty in our estimates of means and proportions. In this Module, we’ll extend all of these techniques to learn how they work in a regression model context.

Frequentist approach to quantification of uncertainty

Recall from Module 9 that we learned about a bootstrapping approach and a theory-based (i.e, parametric) approach to estimate a standard error (SE), and corresponding confidence interval (CI), for an estimated proportion (e.g., the proportion of adolescents with a past-year Major Depressive Episode (MDE)) and a mean (e.g., the average depression severity among adolescents with a past-year MDE). In this section, we will use how to apply these frequentist methods to estimate SEs and CIs for regression parameter estimates.

Bootstrapping

The bootstrap approach for estimating the standard error of a regression intercept and regression slope(s) represents a powerful, non-parametric method that relies on resampling techniques to approximate the sampling distribution of a statistic. Unlike theory-based or parametric methods, which depend on assumptions about the underlying distribution of the data (e.g., normality), bootstrapping generates empirical sampling distributions by repeatedly resampling the observed data with replacement. This approach allows for the direct estimation of standard errors, confidence intervals, and other statistical measures without relying heavily on theoretical distribution assumptions, making it particularly useful in situations where those assumptions may not hold or are unknown.

To gain some intuition about how bootstrapping works in a regression context, let’s create one bootstrap resample based on our nyc_bikes_2023 data. We will generate a single bootstrap resample, equal in size to our original sample of 365 days, through random selection with replacement. Recall that “drawing with replacement” means that after selecting a day for the resample, it is “returned” to the original pool, making it eligible for selection again. As a result, some days will appear multiple times in a resample, while other days might not be included at all. This technique ensures each selection is independent and allows us to accurately simulate sampling variability, adhering to the principles of bootstrap methodology.

Let’s take a look at the code to create one bootstrap resample of our data, which builds on the techniques that you learned about in Module 9:

set.seed(12345)

resample1 <- 
  nyc_bikes_2023 |> 
  select(date, total_counts, daily_temperature_2m_max) |> 
  slice_sample(n = 365, replace = TRUE) |> 
  arrange(date)

resample1 |> head(n = 14)


We can fit the regression model using this single bootstrap resample, and obtain the parameter estimates and the overall model fit using tidy() and glance().

lm1_resample <- lm(total_counts ~ daily_temperature_2m_max, data = resample1)
lm1_resample |> tidy() |> select(term, estimate)
lm1_resample |> glance() |> select(r.squared, sigma)


We can also create a scatter plot of the resampled data, which looks quite similar to the scatter plot based on the full sample that we created earlier.

lm1_resample |> 
  ggplot(mapping = aes(x = daily_temperature_2m_max, y = total_counts)) +
  geom_point(color = "#8DD3C7", alpha = .5) +
  theme_minimal() +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, color = "#8DD3C7") +
  labs(title = "The relationship between temperature and number of bicyclists \non a given day in NYC",
       subtitle = "Bootstrap #1", 
       x = "Max temperature in Celsius", 
       y = "Total number of bicycle riders recorded")


Instead of creating one bootstrap resample — we can repeat the process multiple times. As an example, let’s look at the results for 9 bootstrap resamples. The table below presents the intercept, slope, \(R^2\), and sigma produced when the linear model is fit to each resample.

Regression estimates for 9 bootstrap resamples
resample intercept slope r.squared sigma
1 23949.45 1307.152 0.5666796 9561.591
2 23403.70 1339.634 0.5543797 10333.302
3 23375.82 1352.952 0.5924992 9811.788
4 23653.76 1325.031 0.5802832 9957.217
5 26075.18 1189.872 0.5136135 10411.938
6 23273.36 1327.877 0.5412949 10533.370
7 23426.31 1315.231 0.5794856 9986.892
8 23205.02 1330.360 0.5945514 9449.718
9 25125.02 1237.323 0.5445161 10270.019

It is also informative to look at a scatter plot produced from each of the 9 bootstrap resamples.

In looking at the table of the 9 bootstrap resamples, we can see that the estimates of all parameters vary slightly, but are relatively similar. Likewise, the 9 scatter plots are each slightly different, as expected, but all show the same pattern — as temperature increases, more riders are recorded.

Rather than faceting by resample, we can put the best fit line for the 9 resamples on a single graph:

With just these 9 resamples, we can already get a sense of the variability from resample to resample — which helps us to visualize the uncertainty in our regression parameter estimates. For example, on the graph above we can see clearly see the variability in the intercept of each line (the predicted number of riders when temperature = 0) as well as the effect of temperature on riders as illustrated by the varying slopes of the best fit lines.

Recall from Module 9 that reliable estimates of bootstrapped parameters requires many more than just a few resamples. For example, many thousand resamples are typically used. Let’s take a look at a function that will repeat the process of creating bootstrap resamples, so that we can create 5000 resamples, fit the regression model in each resample, and then summarize across the resamples to quantify uncertainty.

# Create a function to carry out the bootstrap resampling and fit model
bootstrap_lm <- function(data) {
  size <- nrow(data)
  resample <- data |>
    slice_sample(n = size, replace = TRUE) |>
    arrange(date)
  lm_model <- lm(total_counts ~ daily_temperature_2m_max, data = resample)
  tidy_model <- tidy(lm_model)
  glance_model <- glance(lm_model)
  
  return(list(tidy = tidy_model, glance = glance_model))
}

# Perform the bootstrap resampling and model fitting using the bootstrap_lm() function
set.seed(12345) # For reproducibility
bootstrap_results <- map(1:5000, ~bootstrap_lm(nyc_bikes_2023))

# Extracting and preparing the data for graphing and summary
model_summaries <- 
  map2(bootstrap_results, 1:5000, ~mutate(.x$glance, resample = .y)) |>
  bind_rows() |> # takes 5000 outputs and combines them into one larger data frame. 
  select(resample, r.squared, sigma) |>
  pivot_longer(-resample, names_to = "term", values_to = "estimate") 

# Work now with the regression parameter estimates from tidy()
coefficients <- 
  map2(bootstrap_results, 1:5000, ~mutate(.x$tidy, resample = .y)) |>
  bind_rows() |>
  select(resample, term, estimate)  

# Create one data frame that combines all estimates
bootstrap_distributions <- 
  bind_rows(coefficients, model_summaries) |> 
  arrange(resample)

# Calculate the summary statistics for intercept, slope, r-squared, and sigma
bootstrap_summary <- bootstrap_distributions |>
  filter(term %in% c("(Intercept)", "daily_temperature_2m_max", "r.squared", "sigma")) |> 
  group_by(term) |>
  summarise(mean_estimate = mean(estimate),
            sd_estimate = sd(estimate))
Tip

Detailed code explanation for the curious

I don’t expect you to understand every bit of this now — but, if you’re interested in a detailed code explanation, please read the summary below.

Libraries

  • purrr: This package it used for its functional programming tools, particularly map() functions to apply operations across elements in a list or vector.

Bootstrap Function

  • A custom function bootstrap_lm() is defined to perform bootstrap resampling and fit a linear regression model to each resample. This function:

    • Calculates the sample size of the provided data fame.

    • Performs resampling with replacement, maintaining the data frame size but allowing for repeated observations, to mimic drawing from the population.

    • Fits a linear regression model (called lm_model) to the resampled data, predicting total_counts from daily_temperature_2m_max.

    • Extracts and returns detailed model summaries (tidy_model for coefficients and glance_model for overall model metrics) in a list.

Bootstrap Resampling Execution

  • The bootstrap_lm() function is executed 5000 times using map(), each time on the nyc_bikes_2023 data frame, to generate a wide array of model fits. This simulates drawing from the underlying population distribution to assess the variability in model estimates.

  • set.seed(12345) ensures reproducibility of the results by setting a fixed starting point for random number generation used in resampling. This isn’t necessary unless you need to reproduce the exact same estimates from the random process.

Data Extraction and Preparation

  • The script extracts model summaries and coefficients from each bootstrap resample.

  • Model Summaries: Extracts the model summary values and does some reformatting to facilitate merging with coefficients later.

  • Coefficients: Similar extraction process for regression coefficients (intercept and slope), including the resample identifier.

Combining Estimates

  • Merges the coefficients and model summaries into a single data frame, bootstrap_distributions, arranging them by resample identifier. This unified structure facilitates subsequent analysis.

Summary Statistics

  • Calculates mean and standard deviation for each parameter estimate across all bootstrap resamples, summarizing the estimated distribution for relevant statistics and stores this information in a data frame called bootstrap_summary.

The output from this process is a table that presents the mean and standard deviation of each parameter estimate (i.e., intercept and slopes), the \(R^2\), and sigma (i.e., \(\sigma\)). The values under mean_estimate in the table below are simply the average estimate across the 5000 bootstrap resamples. The values under sd_estimate are the standard deviations of each estimate across the 5000 bootstrap resamples. The mutate() function in the code below rounds the numbers to the nearest thousandth for easier viewing.

bootstrap_summary |> mutate(across(where(is.numeric), ~format(round(.x, digits = 3), scientific = FALSE)))

The value under mean_estimate in the table provides the average estimate for each parameter across the 5000 bootstrap resamples. The values under sd_estimate provide an estimate of the standard error for each parameter across the 5000 bootstrap resamples. Since the bootstrap resampling is mimicking the sampling distribution, and the standard deviation of the sampling distribution is the standard error — then this summary provides the quantification of uncertainty that we seek. These sd_estimate values quantify the variability of the intercept and slope estimates, as well as the \(R^2\) and sigma, across the bootstrap resamples — providing an intuitive and direct measure of the precision of the regression parameters estimated from the original sample data. Larger standard deviations indicate more sample-to-sample variability, while smaller standard deviations indicate less sample-to sample variability. In this way these standard deviations tell us about the uncertainty associated with these estimates.

Now, let’s take a closer look at some of the elements produced via the bootstrapping process to further grow our intuition. First, recall that bootstrap resampling’s purpose is to mimic the sampling distribution of each parameter. That is, if we were to reconduct our study on temperature and bicycle counts in NYC many, many times, we would expect that the relationship between temperature and bicycle counts would vary from study to study (for all of the reasons mentioned earlier). Foe example, the predicted number of bicycles when daily max temperature equals 0°C (i.e, the intercept) and the predicted change in number of bicycle for each 1°C increase in max temperature (i.e., the slope) would vary to some extent from study to study. The bootstrap resamples are attempting to quantify this variability.

We can create a histogram of the estimates of the intercept, slope, sigma, and r.squared across our bootstrap resamples to visualize this variability. These estimates are stored in the file we created in the last code chunk — called bootstrap_distributions. In this file, the estimates of the intercept, slope (for daily_temperature_2m_max), r.squared, and sigma are recorded for all 5000 bootstrap resamples, as illustrated in the table below.

bootstrap_distributions |> 
  select(resample, term, estimate) |> 
  mutate(across(where(is.numeric), ~format(round(.x, digits = 3), scientific = FALSE))) |> 
  head(n = 30) 

To visualize the variability in these estimates across the bootstrap resamples, we can create a set of histograms.

bootstrap_distributions |> 
  select(term, estimate) |> 
  filter(term %in% c("(Intercept)", "daily_temperature_2m_max", "r.squared", "sigma")) |>  
  ggplot(mapping = aes(x = estimate, fill = term)) +
  geom_histogram(bins = 50, alpha = .5) +  
  facet_wrap(~term, ncol = 4, scales = "free_x") +  
  theme_minimal() +
  labs(x = "Estimate in bootstrap resample", 
       y = "Number of bootstrap resamples with the estimate",
       title = "Estimates across 5000 bootstrap resamples") +
  theme(legend.position = "none")  

Please take a moment to study the histograms to note the approximate minimum and maximum values for the estimates.

Using estimates from each bootstrap resample, we may also plot the corresponding fitted regression lines to visually assess the variability of our model’s predictions. This visualization is achieved with geom_abline(), a ggplot2 function that draws straight lines specified by an intercept and slope.

# Calculate means for each term
means <- bootstrap_distributions |> 
  group_by(term) |> 
  summarise(mean_estimate = mean(estimate))

mean_int <- 
  means |> 
  filter(term == "(Intercept)") |> 
  pull(mean_estimate) 

mean_slope <- 
  means |> 
  filter(term == "daily_temperature_2m_max") |> 
  pull(mean_estimate) 

# Reshape the data frame to have intercepts and slopes on the same row for each bootstrap sample
bootstrap_line <- 
  bootstrap_distributions |>
  filter(term %in% c("(Intercept)", "daily_temperature_2m_max")) |> 
  select(term, estimate) |> 
  mutate(sample_id = rep(1:(n()/2), each = 2)) |> # Create an ID for each pair of intercept and slope
  pivot_wider(id_cols = sample_id, names_from = term, values_from = estimate) 

# Now plot using geom_abline
nyc_bikes_2023 |> 
  ggplot(mapping = aes(x = daily_temperature_2m_max, y = total_counts)) +
  geom_point(alpha = .1, color = "lightgrey") +
  geom_abline(data = bootstrap_line, aes(intercept = `(Intercept)`, slope = daily_temperature_2m_max), color = "darkgrey") +
  geom_abline(intercept = mean_int, slope = mean_slope, color = "black") +
  scale_color_manual(values = colors) +  # Apply the defined color palette
  theme_minimal() +
  labs(title = "Relationship between temperature and bicyclists in NYC across \n5000 bootstrap resamples",
       x = "Max temperature in Celsius",
       y = "Total number of bicycle riders recorded") +
  guides(color = "none")  

Due to the high number of resamples (5000), these lines collectively resemble a thick confidence band, effectively illustrating the range of plausible regression lines given the data variability. Additionally, a prominent black line represents the regression model defined by the mean intercept and slope calculated across all bootstrap resamples. This average line serves as a central tendency marker amidst the bootstrap lines, highlighting the overall direction and strength of the relationship modeled in the resampling process. It’s also interesting to use the graph above to examine the span of the predicted scores (predicted bike count) at any given temperature. For example, for cold days (e.g., -5°C) and very hot days (35°C), the vertical distance between the largest and smallest predicted score is large — but for average days (e.g., around 15°C), the vertical distance between the largest and smallest predicted score is smaller. This indicates that we have more precision to predict the outcome when the predictor is toward the average score in the sample, and less precision as we move toward the tails of the distribution of the predictor. This is an important element of all fitted regression models.

Confidence intervals for the parameter estimates

In addition to quantifying uncertainty in the estimates via the standard error, we may calculate confidence intervals based on the bootstrap resamples. Recall from Module 9 that we used the quantile() function to find the lower and upper bound of our desired confidence interval (CI) based on the distribution of the bootstrap estimates. Let’s calculate a 95% CI for the estimates using this percentile method applied to our bootstrap resamples.

95% CI for the Intercept

bootstrap_distributions |> 
  filter(term == "(Intercept)") |> 
  summarize(lower = quantile(estimate, probs = .025),
            upper = quantile(estimate, probs = .975))

95% CI for the Slope

bootstrap_distributions |> 
  filter(term == "daily_temperature_2m_max") |> 
  summarize(lower = quantile(estimate, probs = .025),
            upper = quantile(estimate, probs = .975))

For \(R^2\)

bootstrap_distributions |> 
  filter(term == "r.squared") |> 
  summarize(lower = quantile(estimate, probs = .025),
            upper = quantile(estimate, probs = .975))

For sigma

bootstrap_distributions |> 
  filter(term == "sigma") |> 
  summarize(lower = quantile(estimate, probs = .025),
            upper = quantile(estimate, probs = .975))

The calculated 95% CIs — based on 5000 bootstrap resamples — give us insight into the uncertainty of these estimates under a frequentist framework. The CIs imply that if we were to repeat our study under identical conditions numerous times, we would expect 95% of those intervals to capture the true parameter in the population. Thus, we can think of these intervals as providing a range of plausible values.

Using the same information, we can add the CIs to the histograms of the sampling distributions plotted earlier. In the updated plot below, the mean across the bootstrap resamples is recorded in a solid line in the middle of the histogram, and the middle 95% of each distribution is marked with the dashed vertical lines.

# Calculate 95% CI for each term
cis <- 
  bootstrap_distributions |>
  group_by(term) |>
  summarise(lower = quantile(estimate, probs = .025),
            upper = quantile(estimate, probs = .975))

# Join means and CIs for easier plotting
means_cis <- left_join(means, cis, by = "term")

# Plotting
bootstrap_distributions |> 
  select(term, estimate) |> 
  ggplot(mapping = aes(x = estimate, fill = term)) +
  geom_histogram(bins = 50, alpha = .5) +
  geom_vline(data = means_cis, aes(xintercept = mean_estimate, color = term), linetype = "solid") +
  geom_vline(data = means_cis, aes(xintercept = lower, color = term), linetype = "dashed") +
  geom_vline(data = means_cis, aes(xintercept = upper, color = term), linetype = "dashed") +
  facet_wrap(~term, ncol = 4, scales = "free_x") +  # Adjusted ncol to 2 for better layout
  theme_minimal() +
  labs(x = "Estimate in bootstrap resample", 
       y = "Number of bootstrap resamples with the estimate",
       title = "Estimates across 5000 bootstrap resamples",
       subtitle = "Solid line represents the mean, dashed lines represent the middle 95% CI for each distribution") +
  theme(legend.position = "none")  

It’s important to note that these interpretations rely on the frequentist concept of CIs, which quantifies the level of uncertainty or variability in parameter estimates without making direct probabilistic statements about the parameters themselves. Instead, it speaks to the reliability of the estimation process — highlighting a range of plausible values for each estimate.

Confidence Intervals for Model Predictions

When we use our simple linear regression (SLR) model to predict the number of bicyclists on a day with a maximum temperature of 21°C, the model predicts 51,229 recorded cyclists. While this point estimate is useful, it’s important to understand the uncertainty around this prediction. Predictions are rarely exact, and quantifying their uncertainty helps us understand the range within which the true values might lie.

This is where CIs for predictions come into play. They offer a way to quantify uncertainty. Specifically, there are two types of uncertainty relevant to our prediction:

  1. Uncertainty Around the Mean Outcome:

    • What it is: This type of uncertainty assesses the range of plausible average outcomes for a given predictor value.

    • Example question: “What is the range of plausible average numbers of riders we can expect on days with a maximum temperature of 21°C?”

    • Term to know: This is known as a confidence interval for the predicted mean.

  2. Uncertainty Around Individual Predictions:

    • What it is: This type of uncertainty pertains to predicting individual outcomes based on certain predictor values.

    • Example question: “Given a 21°C day, what is the range of plausible rider counts we might observe?”

    • Term to know: This is known as a prediction interval for an individual predicted score.

By distinguishing between these uncertainties, we can better understand both the average predictions and the variability of individual predictions. This enhances our interpretation and communication of the model’s findings.

Let’s take a look how we can calculate these two types of intervals for a predicted outcome.

First, we need to calculate the predicted scores for each bootstrap resample where daily_temperature_2m_max equals 21.

# Compute the predicted scores for each bootstrap resample
predicted_scores <- 
  bootstrap_distributions |>
  filter(term %in% c("(Intercept)", "daily_temperature_2m_max", "sigma")) |>
  pivot_wider(names_from = term, values_from = estimate) |>
  mutate(yhat = `(Intercept)` + daily_temperature_2m_max * 21)

predicted_scores |> head()

Next, we are ready to calculate the CI and PI for the predicted score.

95% CI for mean of y when x = 21: Calculating the CI is very simple, and mimics the CI for the intercept in our fitted model. In fact, if you centered the daily_temperature_2m_max at 21 and refit the bootstraps using the centered version of temperature, the 95% CI for the new intercept would be equal to the 95% CI that we’ll calculate below.

# Calculate the 95% CI for the predicted scores
predicted_scores |> 
  select(yhat) |> 
  summarize(lower = quantile(yhat, probs = .025),
            upper = quantile(yhat, probs = .975))

The 95% CI for the mean number of bicycle riders at 21°C, ranging from 50,043 to 52,395, provides a range of plausible values for the average outcome based on our linear regression model. From a frequentist perspective, this interval means that if we were to repeat our study many times, collecting new samples and recalculating the model and confidence interval each time, we would expect the interval to contain the true average number of bicycle riders at 21°C in 95% of these studies. It’s important to note that this interpretation does not imply a 95% probability that the true mean falls within this specific interval; rather, it reflects the reliability of the interval estimation process itself under repeated sampling. This distinction underscores the frequentist concept of CIs as a measure of the precision and stability of our estimates, rather than a probabilistic statement about the parameters themselves.

Now that we’ve calculated the CI for the mean number of bicycle riders at 21°C, we turn our attention to the prediction interval (PI). While the CI provides a range of plausible values for the average outcome, the PI addresses the variability we might expect in individual predictions. This is particularly useful when predicting specific outcomes rather than average trends.

95% PI for a new y when x = 21: Calculating the PI involves accounting for both the uncertainty in the estimated mean and the individual variability around that mean. Essentially, the PI provides a range of plausible values for individual predicted outcomes, considering both model uncertainty and the inherent variability in the data.

Calculating the PI involves accounting for both the uncertainty in the estimated mean and the individual variability around that mean. Essentially, the PI provides a range of plausible values for individual predicted outcomes, considering both model uncertainty and the inherent variability in the data. The process is a bit more involved, as demonstrated in the code below. To the predicted scores, we add a unique random value based on the standard deviation (sigma) for that specific bootstrap resample (thus the use of rowwise()), reflecting the individual variability. That is, the syntax mutate(adjusted_yhat = yhat + rnorm(1, mean = 0, sd = sigma)) adds a random value (drawn from a normal distribution with a mean of 0 and a standard deviation of sigma) to each predicted score (yhat). This accounts for the individual variability in predictions.

set.seed(123) # For reproducibility

# Add a residual to yhat to account for individual variability
adjusted_predictions <- 
  predicted_scores |> 
  rowwise() |> 
  mutate(adjusted_yhat = yhat + rnorm(1, mean = 0, sd = sigma)) |> 
  ungroup()

# Calculate the 95% CI for the predicted scores
adjusted_predictions |> 
  select(adjusted_yhat) |> 
  summarize(lower = quantile(adjusted_yhat, probs = .025),
            upper = quantile(adjusted_yhat, probs = .975))

The 95% PI for the number of bicycle riders at 21°C, ranging from 31,041 to 71,240, provides a range of plausible values for individual outcomes based on our linear regression model. This interval means that for a given day with a maximum temperature of 21°C, we would expect the actual number of riders to fall within this range 95% of the time.

Comparing the CI to the PI While the CI gives us an estimate of where the true mean number of riders might lie, the PI broadens this to account for individual variability. This distinction is crucial:

  • Confidence Interval (CI): Reflects the uncertainty around the mean outcome, giving us a range for the average number of riders we might expect.

  • Prediction Interval (PI): Incorporates both the uncertainty around the mean and the variability of individual data points, providing a broader range that accounts for the fact that individual observations can vary widely.

Practical Example

Imagine you are a city planner for NYC. You are planning to allocate resources for bike lane maintenance in different areas of the city. You want to understand the average number of riders on a particular route to ensure that the budget and resources are optimally distributed. You would use the CI for the prediction to determine the average number of bike riders per day on that route. For instance, if the CI for the mean number of riders is between 18,000 and 22,000, you can confidently state that the average usage falls within this range. This helps in planning long-term infrastructure and resource allocation for bike lane maintenance, as it reflects the stability and precision of your estimate. Prediction Interval (PI)

Now imagine you are planning a large city event, such as a marathon or festival, and need to predict the number of bike riders on the event day to manage crowd control and safety measures effectively. You would use the PI to predict the number of bike riders on the event day. Suppose the PI for the number of riders on that day is between 15,000 and 25,000. This wide range accounts for the variability in daily bike usage, including potential spikes due to the event. Understanding this variability is crucial for preparing adequate security, emergency services, and ensuring the infrastructure can handle the higher traffic, thus informing operational decisions for that specific day.

By understanding both CIs and PIs, you can better interpret your model’s predictions. The CI tells you about the stability and precision of your estimated average, which is useful for long-term planning and resource allocation. The PI, on the other hand, informs you about the expected variability in individual predictions, which is critical for managing short-term events or unexpected fluctuations in rider numbers. This comprehensive view enhances your ability to make informed decisions based on your model, whether you’re interested in average trends or specific outcomes.

Take a look at the graph below which contrasts the CI and the PI.

We can clearly see here that when estimating the number of bicycles recorded on a 21°C day in NYC, our estimates exhibit different levels of precision depending on whether we are predicting the average number of riders (for CI) or the number of riders on a specific day (for PI). The CI for the mean prediction is much narrower compared to the PI for an individual prediction. This distinction highlights a key concept in statistical modeling.

When we calculate a CI, we are estimating the range within which the average number of riders is expected to fall, given a maximum temperature of 21°C. This interval is narrower because it only accounts for the uncertainty in estimating the mean, assuming that if we took many samples, the mean of those samples would converge to the true mean with increasing precision.

In contrast, the PI takes into account not only the uncertainty in estimating the mean but also the variability of individual rider counts around that mean. Therefore, it is wider, reflecting the additional variability that we expect in the number of riders on any given day.

This should be intuitive: estimating the average number of riders across many days is inherently more precise than predicting the number of riders on any single day. The average smooths out the day-to-day fluctuations and anomalies, leading to a more stable estimate, whereas individual days can vary significantly due to numerous unpredictable factors such as weather fluctuations, special events, or random variations in daily commuting patterns.

By understanding these intervals, we gain a more nuanced perspective on our predictions. The confidence interval gives us insight into the expected average behavior, which is useful for planning and long-term decision-making. Meanwhile, the prediction interval provides a realistic range for individual days, which is crucial for understanding the potential variability and making day-to-day operational decisions.

In summary, while both intervals are essential, they serve different purposes and offer different insights. The CI reassures us about the precision of our average predictions, while the PI prepares us for the natural variability in daily rider counts.

Theory-based (i.e., parametric) estimation of standard errors and CIs

There also exists a theory-based approach to estimating the SE and corresponding CI for regression parameter estimates — namely, the intercept and slope(s). These methods rely on well-established statistical principles and assumptions about the underlying data distribution — namely that the residuals are normally distributed so that the Central Limit Theorem (CLT) holds. This section explores the theoretical foundations and computational steps involved in this estimation process.

For a SLR, the precision of the estimated intercept (\(b_0\)) and slope (\(b_1\)) is quantified by their standard errors (SE).

The SE of the slope, \(SE_{b_1}\), is calculated using the formula:

\[ SE_{b_1} = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 / (n-2)}{\sum_{i=1}^{n}(x_i - \bar{x})^2}} \]

Where:

  • \(y_i\) and \(\hat{y}_i\) are the observed and predicted values of the outcome variable, respectively.

  • \(x_i\) are the values of the predictor variable, \(\bar{x}\) is the mean of these values, and \(n\) is the sample size.

The SE for the intercept, \(SE_{b_0}\), is given by:

\[ SE_{b_0} = \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}} \cdot \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n-2}} \]

This formula incorporates both the variability of the predictor variable and the model’s overall error. The additional term \(\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\) in the formula for \(SE_{b_0}\) accounts for the distance of the mean of the predictor variable from zero, affecting the precision of the intercept estimate. When the intercept is centered toward the middle of the distribution, our SE will be smaller. However, as the intercept is centered toward the tails of the distribution (i.e., very low or very high scores for the predictor), our SE will be larger.

The standard errors provide a measure of the variability or uncertainty around our estimated regression parameters. Recall that these SEs are estimates of the standard deviation of the sampling distribution for each respective parameter. For example:

  • Standard Error of the Slope (\(SE_{b_1}\)): This measures the variability of the estimated slope (which represents the change in the outcome variable for a one-unit change in the predictor variable) across many random samples drawn from the population. A smaller \(SE_{b_1}\) indicates a more precise slope estimate, suggesting that the predictor variable has a consistent relationship with the outcome variable across different samples. Essentially, it tells us how much the estimated slope is expected to fluctuate due to random sampling variability, with a smaller value implying greater reliability of the slope estimate.

  • Standard Error of the Intercept (\(SE_{b_0}\)): This measures the variability of the estimated intercept (which represents the expected value of the outcome variable when the predictor variable is zero) across many random samples drawn from the population. A smaller \(SE_{b_0}\) indicates a more precise intercept estimate, suggesting that the intercept has a consistent value across different samples. Essentially, it tells us how much the estimated intercept is expected to fluctuate due to random sampling variability, with a smaller value implying greater reliability of the intercept estimate. Note that the standard error of the intercept will vary depending on where the predictor variable(s) are centered. The closer they are centered to the mean of the predictor, the smaller the standard error will be.

In practice, statistical software like R simplifies these calculations for us. For instance, running the following code provides the estimated standard errors for the intercept and slope in our regression model:

slr <- lm(total_counts ~ daily_temperature_2m_max, data = nyc_bikes_2023)
slr |> tidy() |> select(term, estimate, std.error)

The values listed under std.error provide the estimated standard errors for \(SE_{b_0}\) and \(SE_{b_1}\), denoting the precision with which we can estimate the intercept and slope of the regression model. These standard errors are crucial for constructing confidence intervals and conducting hypothesis tests about the regression parameters (which we will tackle in Module 16), helping us understand and quantify the uncertainty in our estimates.

Confidence intervals for the parameter estimates

To calculate CIs for the intercept and slope(s) using a theory-based (i.e., parametric approach), we take the estimate plus the critical values of t times the standard error (std.error). Recall from Module 9 that to calculate the appropriate critical value of t, we need the degrees of freedom (df) for the model. For a regression model, the appropriate df for computing the critical value of t for the intercept and the slope(s) is calculated as n - 1 - p, where n is the sample size, and p is the number of predictors (we lose one df for estimating the intercept and an additional df for estimating the effect of each predictor). Since we only have 1 predictor in this regression model, the correct df is 365 - 1 - 1 = 363.

Thus, if we desire to calculate a 95% CI, the appropriate critical values of t are:

qt(c(.025,  .975), df = 363)
[1] -1.966521  1.966521

CI for the intercept:

\[ CI = SE_{b_0} \pm t_{n-2, \alpha/2} \times SE_{b_0} \]

\[ 95\% \text{ CI } = 23711.546 \pm 1.967\times1212.994 \]

CI for the slope:

\[ CI = SE_{b_1} \pm t_{n-2, \alpha/2} \times SE_{b_1} \]

\[ 95\% \text{ CI } = 1310.377 \pm 1.967\times61.511 \]

We don’t need to calculate these by hand. The tidy() output contains these intervals.

slr |> 
  tidy(conf.int = TRUE, conf.level = .95) |> 
  select(term, estimate, std.error, conf.low, conf.high)

What do the CIs capture in this context? In the long run, if we were to repeatedly collect samples and calculate the 95% CI for the effect of temperature on number of bicycle riders, we would expect about 95% of those intervals to contain the true population slope. Our calculated interval for this study is approximately 1189 to 1431 — that is, it is plausible to expect that for each 1°C increase in temperature, we expect between 1189 and 1431 additional riders. On the other hand, the 95% CI for the intercept spans from about 21326 and 26097. On days when the temperature is 0°C, we can plausibly expect between 21326 and 26097 riders, on average.

Now, let’s recreate the scatter plot with the best fit line overlaid, but now we will request standard errors (se = TRUE within the geom_smooth() function call). The shaded area in the figure below provides an assessment of the precision of our estimated best fit line. Though our single sample provides an intercept and slope for the best fit line in the sample (the thick black line), the shaded grey area provides a range of plausible areas where the best fit line in the population may fall. In this way, our regression model provides us with the best fit line in our sample, but also provides us with an understanding of the precision of our estimates and how much we can expect them to vary from sample-to-sample.

Notice that when max temperature is at the mean across days (17.7°C), the vertical length of the confidence band is at it’s smallest. The confidence band is quite a bit thicker at the tails of the temperature distribution (e.g., at 0°C or at 35°C). If we centered temperature at 0°C, the standard error for the intercept would be substantially smaller.

nyc_bikes_2023 |> 
  ggplot(mapping = aes(x = daily_temperature_2m_max, y = total_counts)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, formula = y ~ x, color = "black") +
  theme_minimal() +
  labs(title = "Scatterplot of max daily temperature and bicycle riders with best fit line",
       subtitle = "Confidence interval for regression line in gray shaded area", 
       x = "Max temperature in Celsius",
       y = "Total number of bicycle riders recorded") 

Confidence intervals for model predictions

Earlier we used the predict() function to calculate the predicted score for the outcome at a certain value of \(x_i\) by using the following code:

# Creating a data frame with the desired predictor values
new_data <- data.frame(daily_temperature_2m_max = 21) 
# Utilizing the linear model to predict outcomes based on new data
predicted_riders <- slr |> predict(newdata = new_data) 

95% CI for the mean of y at a certain x: We can augment the predict() function code to calculate a CI for the predicted count of bicycles when the temperature equals 21°C by specifying interval = "confidence":

slr |> predict(newdata = new_data, interval = "confidence", level = .95)
       fit      lwr      upr
1 51229.46 50099.72 52359.21

The CI for the predicted number of bicycle riders at a daily maximum temperature of 21°C is given as 50,100 to 52,359, with a point estimate of 51,229 riders. This interval provides a range of plausible values for the true average number of riders on days with a maximum temperature of 21°C. The CI reflects the precision of our estimate of the mean response: the narrower the interval, the more precise our estimate. It’s important to note that this confidence interval pertains to the average number of bicycle riders we expect to see under these conditions, not the variability of individual observations around this mean.

The predict() function in R, when used with linear regression models, calculates both confidence intervals (CIs) and prediction intervals (PIs) for the predicted values. These intervals are based on the standard error (SE) of the prediction, which differs depending on whether we desire a CI or a PI.

For a CI, the formula for the SE is as follows:

\[ SE_{\hat{y}} = \sqrt{MSE \cdot \left(\frac{1}{n} + \frac{(x - \bar{x})^2}{\sum (x_i - \bar{x})^2}\right)} \]

Where:

  • \(MSE\) is the mean squared error of the regression model, calculated as: \(MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i -\hat{y_i})^2\)

  • \(n\) is the sample size

  • \(x\) is the value of the predictor for which the prediction is made

  • \(\bar{x}\) is the mean of the predictor variable

  • \(x_i\) are the individual observed values of the predictor

CI Calculation:

Given \(SE_{\hat{y}}\), the CI is calculated as:

\[ CI = \hat{y} \pm t_{n-2, \alpha/2} \times SE_{\hat{y}} \]

where \(t_{n-2, \alpha/2}\) is the critical value from the t-distribution with \(n-1-p\) degrees of freedom and a significance level of \(\alpha\) (i.e., the same df for the fitted model parameter estimates is used).

95% PI for a new y at a certain x: We just need a simple change to the prior code for the CI to obtain a PI instead, namely, we need to change the interval type from "confidence" to "prediction":

slr |> predict(newdata = new_data, interval = "prediction", level = .95)
       fit      lwr      upr
1 51229.46 31021.89 71437.04

The prediction interval (PI) for a new observation of the number of bicycle riders at a daily maximum temperature of 21°C is given as 31,022 to 71,437, with a point estimate of 51,229 riders. This provides a range of plausible values for the number of riders on any given day with a maximum temperature of 21°C. Compared to the CI, the PI is substantially wider, reflecting not just the uncertainty in estimating the mean number of riders but also the natural variability of individual rider counts around that mean. The PI accounts for both the error in the prediction of the mean and the spread of individual observations around the predicted mean, which is why it’s much broader than the CI. This broader interval provides a realistic range for planning or decision-making purposes, acknowledging that actual observations can significantly vary around the predicted average due to factors not captured by the model.

The SE that the predict() function uses to calculate the PI includes an additional term for the residual variance, reflecting the scatter of individual points around the regression line:

\[ SE_{\text{new y}} = \sqrt{MSE + SE_{\hat{y}}^2} \]

Given this SE for a new observation, the PI is calculated as:

\[ PI = \hat{y} \pm t_{n-2, \alpha/2} \cdot SE_{\text{new y}} \]

The key difference between the CI and PI calculations lies in the standard error component. For CIs, the SE reflects the uncertainty in estimating the mean response for given \(x_i\) values. For PIs, the SE also incorporates the variability (MSE) of individual observations around the regression line, leading to wider intervals. These calculations enable the predict() function to provide estimates that account for the precision of model predictions (CI) and the expected range of variability in new data points (PI).

Let’s take a look at one last visualization to further study the difference between a CI and a PI for a predicted score. Error bar plots are a powerful visual tool designed to express this uncertainty, providing a clear representation of the precision and reliability of statistical estimates. Error bar plots are graphical representations that utilize points to signify central estimates, like means or predicted values, and bars, known as “error bars,” to span a range that reflects variability or uncertainty. These bars can depict different statistical measures of dispersion, such as standard deviations, standard errors, or more specifically for our discussion, CIs and PIs.

# Create a data frame with the point estimates and the CI and PI when x = 21
results <- data.frame(
  type = c("CI", "PI"),
  estimate = c(51229.46, 51229.46),  # Assuming the same point estimate for both CI and PI
  lower = c(50099.72, 31021.89),  # CI lower bound, PI lower bound
  upper = c(52359.21, 71437.04)   # CI upper bound, PI upper bound
)

# Create the ggplot
results |> 
  ggplot(mapping = aes(x = type, y = estimate, ymin = lower, ymax = upper)) +
  geom_errorbar(width = 0.05, color = "skyblue3") +
  geom_point(size = 2, color = "skyblue4") +
  scale_x_discrete(labels = c("CI" = "Confidence Interval", "PI" = "Prediction Interval")) +
  labs(title = "Error bar plot for predicted riders at 21°C",
       x = "", y = "Number of riders") +
  theme_minimal() 

The plot distinctly illustrates that estimating a 95% Confidence Interval (CI) for the average number of riders on days with a maximum temperature of 21°C is relatively precise. This precision is evident in the narrow range of the CI, indicating a high level of confidence in where the true average likely falls. In contrast, the 95% Prediction Interval (PI), which forecasts the expected range of riders for any single day at 21°C, exhibits significantly less precision. The broader span of the PI underscores the inherent variability in daily rider counts, reflecting both the uncertainty in estimating the mean and the natural fluctuation of individual observations around that mean.

It is instructive to consider the questions that can be answered based on each type of interval. We might use a CI to answer a question like: “What is a range of plausible values for the average rider count on a 21°C day?” On the other hand, we might use a PI to answer a question like: “What is a range of plausible values for the expected number of riders if next Thursday is 21°C?” Intuitively, if trying to get a CI for an average, we can be quite precise because taking an average of multiple observations tends to reduce the impact of individual variability.

This is because the Central Limit Theorem tells us that the sampling distribution of the sample mean will tend to be normally distributed and have a smaller standard deviation compared to the original data distribution, especially as the sample size increases. Thus, the CI for the average number of riders is narrower, reflecting this reduced variability.

However, predicting a single observation (using a PI) must account for the full variability of individual data points, resulting in a wider interval.

This visualization thus effectively communicates the contrast in precision between predicting average outcomes (CI) and individual events (PI), emphasizing the greater uncertainty associated with the latter.

Bayesian approach to quantification of uncertainty

In Module 9 we studied a Bayesian approach to estimating and visualizing uncertainty for descriptive statistics (i.e., a proportion and a mean). In this section, we’ll use the same machinery to estimate and visualize uncertainty for linear regression parameter estimates.

In Bayesian regression, the approach to understanding population parameters — including the intercept (\(\beta_0\)), the slope (\(\beta_1\)), and sigma (\(\sigma\)) — fundamentally diverges from the frequentist perspective. Rather than treating these parameters as fixed yet unknown quantities that we aim to estimate directly from the sample data, the Bayesian method views them as random variables endowed with their own probability distributions. This pivotal shift in perspective enables the incorporation of existing knowledge or beliefs into the analysis through the use of prior distributions.

Distinguishing Bayesian and Frequentist Approaches

  • Frequentist View: When taking a frequentist approach to uncertainty in linear regression parameters, \(\beta_0\), \(\beta_1\), and \(\sigma\) are considered as fixed values that exist in reality, but are unknown. The goal is to use sample data to make inferences about these fixed values, typically resulting in single-point estimates (\(b_0\), \(b_1\)) and an assessment of estimation uncertainty (e.g., through standard errors and CIs) based purely on the observed data.

  • Bayesian View: When taking a Bayesian approach to uncertainty in linear regression parameters, \(\beta_0\), \(\beta_1\), and \(\sigma\) are considered uncertain quantities expressed through probability distributions. This approach acknowledges our uncertainty about their values before seeing the data (prior beliefs) and updates this uncertainty in light of the data (posterior beliefs). The result is not a single estimate but a full probability distribution for each parameter, reflecting all possible values they could take and the likelihood of each, given the data and our prior beliefs.

Incorporating Prior Knowledge

  • Prior Distributions: Before analyzing data, Bayesian analysis requires specifying prior distributions for each parameter. These priors can be based on historical data, expert opinion, or could be non-informative (though this is not ideal) — serving as a baseline from which the data can speak more loudly. This aspect is a key distinguishing feature of the Bayesian approach, allowing for a more nuanced and comprehensive analysis that integrates both new data and existing knowledge.

  • Updating Beliefs with Data: Through Bayes’ theorem, the prior distributions are updated with the observed data to produce posterior distributions. These posteriors offer a nuanced view of our updated beliefs about the parameters’ likely values, balancing prior knowledge with the information provided by the data.

Understanding Parameters as Distributions

  • Random Variables with Distributions: Viewing parameters as random variables means recognizing that our knowledge about them is uncertain and best described probabilistically. The Bayesian framework captures this uncertainty and refines it using data, moving from broader prior distributions to more focused posterior distributions that reflect our updated understanding.

In summary, the Bayesian approach transforms the task of parameter estimation from calculating fixed point estimates (i.e., as in a frequentist approach) to understanding the parameters as inherently uncertain quantities whose possible values and associated probabilities are continuously updated with new evidence. This probabilistic perspective not only accommodates existing knowledge but also quantifies uncertainty in a comprehensive and flexible manner, distinguishing it fundamentally from the frequentist approach to population parameters and sample parameter estimates.

A new data frame to apply a Bayesian framework

In this section, we will continue to explore the relationship between temperature (daily_temperature_2m_max) and the number of bicycle riders (total_counts) in NYC on a particular day. Instead of refitting the model using the 2023 data, we will leverage our knowledge from the 2023 data as prior information. This prior information will be used to examine the relationship between temperature (daily_temperature_2m_max) and the number of bicycle riders (total_counts) for the 2020 data.

The data frame, which includes all of the same variables as in the 2023 data frame, is hosted in a file called nyc_bikes_2020.Rds. Let’s begin by importing the data.

nyc_bikes_2020 <- read_rds(here("data", "nyc_bikes_2020.Rds")) 
nyc_bikes_2020 |> head()

When conducting Bayesian model building, it’s best to embrace a step-by-step approach, starting with the fundamentals before advancing to more complex considerations. This methodology allows us to systematically explore the relationship between an outcome (\(Y\)) and one or more predictor variables (\(X\)), rooted in the following principles:

  1. Identify the Nature of \(Y\): The first step involves discerning whether the outcome variable is discrete or continuous. This distinction guides us in selecting the most appropriate model structure for the data (e.g., a Normal/Gaussian distribution for continuous data, which is our focus in this course).

  2. Modeling the Mean of the Outcome: We express the mean of the \(y_i\) scores for variable \(Y\) as a function of the predictor(s) (e.g., \(\bar{y} = b_0 + b_1 x_i\) for a SLR). This expression lays the foundation for understanding how changes in the predictor is expected to relate to changes in the outcome.

  3. Identifying Unknown Parameters: Every model comes with parameters that need to be estimated — such as the intercept (\(\beta_0\)), slope (\(\beta_1\)), and the standard deviation of the residuals (sigma — that is, \(\sigma\)). Identifying these parameters is crucial for model formulation.

  4. Choosing Prior Distributions: The essence of Bayesian modeling lies in selecting prior distributions for these unknown parameters. For the intercept (\(\beta_0\)) and slope (\(\beta_1\)), which can span the entire real line, Normal distributions serve as intuitive choices given their flexibility. Choosing a prior distribution for \(\sigma\) can be more challenging. We must recognize that \(\sigma\) represents a standard deviation and cannot be negative, so we want to exclude the possibility of negative values in our prior information. The parameters of these distributions — mean (\(m\)) and standard deviation (\(s\)) — are hyperparameters (i.e., a hyperparameter is a parameter of a prior distribution) that can be adjusted based on prior knowledge or beliefs about the values of \(\beta_0\), \(\beta_1\), and \(\sigma\).

Defining the priors

We have thoroughly studied the linear model that relates temperature to the count of bike riders in NYC based on the 2023 data. We will make use of the bootstrap distributions of the model parameters derived from the 2023 data to specify the priors for the 2020 data. As described in Module 9, Bayesian models use Markov Chain Monte Carlo (MCMC) algorithms to estimate the parameters of the model, and for this type of estimation, it’s recommended to center the predictor(s) toward the middle of the data so the intercept of the regression model has a good deal of data available to estimate the parameter. This can help convergence of the models. Therefore, instead of using the temperature variable expressed in it’s natural metric (where 0 represents 0°C), we’ll create a new version of the temperature variable — called temp.21 — which is centered at 21°C. This temperature, which equates to about 70°F, is more firmly in the middle of the distribution. I also picked this value since we explored this particular temperature earlier in the Module when calculating predicted bicycle counts based on the fitted model.

Since we’ll use the 2023 data as prior information, I refit the bootstrap models that we studied above using this centered version of temperature. The table below presents the the mean and standard deviation of the three parameters that we need to specify priors for — i.e., the intercept, the slope for temperature, and sigma, based on the centered version of temperature (temp.21) for the 2023 data. Because of the recentering of the predictor — the estimate and standard error for the intercept has shifted (as compared to our prior bootstrap results).

It’s also useful to visualize what the distribution of these parameters looks like across our bootstrap resamples using the 2023 data. The plots below show the distribution of each of the parameters across the 5000 bootstrap resamples. Please take a moment to look at these — noting the middle of the distribution, as well as the minimum and maximum scores.

The mean, standard deviation, minimum, and maximum estimates derived from the bootstrap distributions of the 2023 data provide a foundation for specifying priors in a Bayesian linear regression model for the 2020 data. These estimates not only capture the central tendency and variability of each parameter but also the range of plausible values, offering a detailed view into the empirical knowledge and uncertainties surrounding these parameters. This nuanced approach allows us to directly incorporate our empirical insights and uncertainties into the Bayesian analysis for the 2020 data, ensuring a more informed and stable model.

Using Bootstrap Estimates as Priors

  • Intercept (\(\beta_0\)) Prior: Given the nature of the intercept, a Normal distribution is an appropriate choice for the prior, set as \(\beta_0 \sim N(51224, 612)\) (i.e., the intercept is assumed to follow a Normal distribution with a mean of 51224 and a standard deviation of 612). This specification uses the bootstrap mean as the distribution’s center and the standard deviation to capture our uncertainty about the estimate.

  • Slope (\(\beta_1\)) Prior: For \(\beta_1\), reflecting how much the outcome variable changes with a one-unit increase in the predictor, a Normal prior based on the bootstrap estimates is utilized: \(\beta_1 \sim N(1311, 55)\). This choice acknowledges the observed relationship between temperature and bike ridership, factoring in the magnitude’s uncertainty.

  • Residual Standard Deviation (\(\sigma\)) Prior: Specifying a prior for \(\sigma\) can be challenging because it represents a standard deviation (which is a bit less intuitive than the intercept and slope) and cannot be negative1. The bootstrap distributions for \(\sigma\) based on the 2023 data appeared relatively normal in our histogram. Thus, a Normal prior for \(\sigma\) based on the 2023 data is as follows: \(\sigma \sim N(10221, 500)\).

Incorporating Priors into the 2020 Model

Bayesian modeling provides a method for a scientist to incorporate expert knowledge directly. For example, incorporating the priors derived from the 2023 data into our Bayesian linear regression model for the 2020 data serves as a methodical bridge, linking empirical insights from a recent period to the analysis of an exceptional year marked by the global COVID-19 pandemic. This period likely experienced deviations in patterns due to unprecedented circumstances, such as lock downs and changes in public behavior, which could influence bike ridership differently compared to subsequent years. As a result, while our priors based on 2023 data provide a strong empirical foundation, we could consider modifications that reflect potential differences expected in 2020. For instance, overall ridership levels might have been different (perhaps lower or higher) or exhibited different sensitivities to temperature due to reduced commuting or recreational activities during various phases of pandemic-related restrictions. Thus, in setting our priors, we could:

  1. Widen the Prior Distributions: Adjusting the standard deviations of our Normal priors can offer greater flexibility, acknowledging the increased uncertainty and potential for variation in ridership patterns during 2020. This approach helps ensure our priors are informative yet not overly restrictive, allowing the 2020 data to significantly inform the posterior distributions.

  2. Consider Prior Predictive Checks: Conducting prior predictive checks can be invaluable. This involves simulating data from our model using only the priors and examining whether these simulated data outcomes seem reasonable within the context of 2020. If the simulated outcomes are excessively unrealistic (e.g., predicting higher ridership than plausible during lock down periods), this signals a need to adjust the priors to better reflect the expected conditions of that year.

  3. Embrace the Bayesian Philosophy: The essence of Bayesian analysis lies in its iterative process of updating beliefs with new evidence. By starting with priors informed by 2023 data but adjusted for the anticipated differences of 2020, we embrace a balanced approach. This allows the data from 2020 to “speak” more emphatically, particularly when it presents new patterns or trends not observed in the 2023 dataset. Our analysis, therefore, remains rooted in empirical knowledge while being receptive to the unique insights the 2020 data may reveal.

In summary, by thoughtfully calibrating our priors to account for the peculiarities of 2020 without making them overly constrictive, we enable a nuanced analysis that respects both our prior knowledge and the distinctiveness of the data at hand. This balanced approach underscores the Bayesian commitment to learning from data, ensuring that our model is both informed by past insights and adaptable to uncover new understandings, especially in the face of unprecedented events like those experienced during the COVID-19 pandemic.

Following the second recommendation above (Consider Prior Predictive Checks), we can check that the prior distributions that we’ve selected fit what we imagine for the 2020 data by using the following code, which allows for the visualization of the priors. The function plot_normal() is a function from the bayes_rules package — which is a companion package to the highly recommended Bayes’ Rules e-book.

see_prior_intercept <- 
  plot_normal(mean = 51223, sd = 611) + 
  theme_minimal() +
  labs(x = "intercept", y = "probability density function")

see_prior_slope <- 
  plot_normal(mean = 1311, sd = 55) + 
  theme_minimal() +
  labs(x = "slope", y = "probability density function")

see_prior_sigma <- 
  plot_normal(mean = 10221, sd = 500) + 
  theme_minimal() +
  labs(x = "sigma", y = "probability density function")

library(patchwork)
p <- see_prior_intercept | see_prior_slope | see_prior_sigma 
p + plot_annotation(title = "Implied density plots for specified priors")

These density plots show the implied probability distributions for each parameter estimate based on our 2023 data. We can check these to see if the priors appropriately capture the variability and range we expect for the 2020 data. For example, noting the center of the distribution and the spread of each distribution (e.g., the minimum and maximum scores for each parameter). This visualization helps us verify that our priors are reasonable and aligned with our empirical knowledge.

Practical Steps

  1. Interpret the Plots: Assess whether the peaks (modes) of the distributions are centered around the mean estimates. Check the spread of the distributions to ensure they encompass the variability seen in the 2023 data.

  2. Adjust Priors if Necessary: If the density plots indicate that the priors are too narrow or too broad, adjust the standard deviations accordingly. Re-evaluate the choice of distribution (e.g., considering heavier-tailed distributions for sigma if needed).

  3. Validate Assumptions: Cross-verify the empirical knowledge and assumptions underlying the priors. Ensure that any domain-specific insights are reflected in the chosen distributions.

By thoroughly examining these density plots based on our 2023 priors, we can ensure that our priors are not only informed by the empirical data from 2023 but also appropriately flexible to capture the variability and uncertainties expected in the 2020 data. This careful consideration enhances the robustness and credibility of our Bayesian model.

Additionally, we can simulate data based on our prior distribution specifications. This step allows us to verify that the priors capture the variability and range we expect for the 2020 data. Visualizing these distributions helps ensure that our Bayesian model is well-informed by the previous year’s data and is capable of providing realistic predictions.

We can accomplish this task via the brms package. We start by setting up what looks like a typical Bayesian model (similar to our Module 9 example — but here we have a predictor of our outcome: formula = total_counts ~ temp.21). Additionally, for predictive checking we add the argument sample_prior = "only", which instructs the brm function to sample exclusively from the prior distribution, ignoring the observed data. This is useful for visualizing and understanding the prior predictive distribution, which shows the range of outcomes that the priors alone would predict.

Purpose of sample_prior = "only"

  • Prior Predictive Checks: By sampling only from the prior, we can perform prior predictive checks to ensure that our priors are reasonable and cover a plausible range of values for the outcome variable.

  • Model Validation: It helps in validating the priors to see if they reflect our domain knowledge and assumptions before fitting the model to the actual data.

  • Understanding Priors: It allows us to visualize the implications of our priors independently of the data, which can help in refining the priors if they seem too restrictive or too vague.

The code to carry out this task is below:

# Create centered version of temperature in the 2020 data frame
nyc_bikes_2020 <-
  nyc_bikes_2020 |> 
  mutate(temp.21 = daily_temperature_2m_max - 21)


# Define our priors 
priors <- c(
  prior(normal(51224, 612), class = Intercept),
  prior(normal(1311, 55), class = b, coef = 'temp.21'),
  prior(normal(10221, 500), class = sigma)
)

# Visualize implied distributions
sim_based_on_priors <- brm(
    formula = total_counts ~ temp.21,
    data = nyc_bikes_2020,
    family = gaussian(),
    prior = priors,
    sample_prior = "only",
    chains = 8, 
    warmup = 5000, 
    iter = 20000, 
    thin = 50  
    )

Once the data is simulated based on the priors, we can create plots to determine if the output seems reasonable. To do so we use the add_epred_draws() from the tidybayes package. Let’s take a look at the code to obtain 100 estimates of the best fit line based on our priors, then break down how the code works.

# Visualize the prior predictive distribution (100 best fit lines)
draws_priors <- tibble(
  temp.21 = seq(-21, 15)) |> 
  add_epred_draws(sim_based_on_priors, ndraws = 100) 

draws_priors |> 
  ggplot(mapping = aes(x = temp.21, y = .epred)) +
  geom_line(mapping = aes(group = .draw), alpha = 0.2) +
  theme_minimal() +
  labs(title = "100 best fit lines simulated from the prior distribution",
       x = "Max daily temperature in Celsius (centered at 21°C)", 
       y = "Number of bicyle riders") 

Code breakdown:

  1. Temperature Sequence Creation: The tibble draws_priors is created with a sequence of temperature values from -21 to 15. These temperatures are the input values for which predictions will be generated, ranging from a very cold temperature to a quite hot temperature. Recall that we centered the temperature variable at 21°C — so this range must reflect the centered range of the variable to work properly

  2. Generate Predicted Values: The add_epred_draws() function is used to generate 100 sets of predicted values for the number of bicycle riders (total_counts) based on the model sim_based_on_priors specified earlier.

    • The sim_based_on_priors model was fitted using only the prior distributions, so the predicted values reflect the variability and relationships implied by the priors alone.

    • ndraws = 100 ensures that 100 different sets of predicted values are generated, giving a sense of the variability and uncertainty in the prior predictive distribution.

  3. Combine Predictions with Input Data: The resulting tibble, draws_priors, now contains the original temperature values along with the corresponding predicted values for each of the 100 draws. Each draw implies a possible “best fit line” for the relationship between temperature and the number of bicycle riders, as informed by the prior distributions that we specified.

  4. Plot the best fit lines: Using the temperature scores and the corresponding predictions for number of bicycle riders (total_counts), we can create a line plot using ggplot() — which effectively shows the 100 fitted lines from the 100 draws of the data.

Purpose: The purpose of this process is to visualize the prior predictive distribution. By plotting these 100 best fit lines, we can assess whether the prior distributions are reasonable and whether they introduce sufficient variability to account for the uncertainty likely present in the actual data. If the lines appear too similar, it may indicate that the priors are too restrictive and need to be adjusted to better capture the expected range of outcomes.

Here’s the output from this process:

Evaluating the Prior Distributions

Upon careful review of the prior predictive checks performed with our initial priors — derived from the 2023 data — it is evident that these priors might be too conservative for applying to new data, particularly for 2020, a year markedly impacted by the COVID-19 pandemic. Specifically, the prior standard deviations for the intercept and slope, along with the prior for sigma (\(\sigma\)), yield a predictive distribution with less variability than is likely prudent. For example, there seems to be too little variability in the intercept and slope estimates (as illustrated by the fitted lines across 100 simulated samples). This limited variability suggests that our priors might unduly influence the model’s estimation when applied to the new data from 2020.

Rationale for Adjustments

Given the unprecedented circumstances of 2020, including lock downs, shifts in commuting patterns, and changes in recreational activities, it’s reasonable to anticipate greater uncertainty in the parameter estimates. To account for this increased uncertainty and ensure our model can adequately capture the potential range of outcomes, it’s likely prudent to widen the variance of the standard deviations for both the intercept and slope, and similarly, to widen the expected variance for \(\sigma\). This adjustment aims to:

  1. Broaden the Predictive Distribution: Allowing for a wider range of possible outcomes that better reflect the heightened uncertainty and variability in ridership patterns during the pandemic.

  2. Mitigate Overconstraining Effects: Reducing the risk that our priors excessively limit the model’s capacity to learn from the 2020 data, thus maintaining a balance between prior knowledge and empirical evidence.

  3. Enhance Model Flexibility: Ensuring the model can adapt more freely to the data, which is crucial in accurately capturing the effects of the unique conditions of 2020 on bike ridership.

Updated Code for Setting Priors

Based on this rationale, we will adjust our priors by tripling the standard deviation estimate for the intercept, slope, and \(\sigma\). Then we can take another look at the implied density graphs and simulated data based on the updated, wider priors:

# Density plots with wider priors
see_prior_intercept <- 
  plot_normal(mean = 51224, sd = (612*3)) + 
  theme_minimal() +
  labs(x = "intercept", y = "probability density function")

see_prior_slope <- 
  plot_normal(mean = 1311, sd = (55*3)) + 
  theme_minimal() +
  labs(x = "slope", y = "probability density function")

see_prior_sigma <- 
  plot_normal(mean = (10221), sd = (500*3)) + 
  theme_minimal() +
  labs(x = "sigma", y = "probability density function")

library(patchwork)
p <- see_prior_intercept | see_prior_slope | see_prior_sigma 
p + plot_annotation(title = "Implied density plots for specified wider priors")

wider_priors <- c(
  prior(normal(51224, 1836), class = Intercept),
  prior(normal(1311, 165), class = b, coef = 'temp.21'),
  prior(normal(10221, 1500), class = sigma)
)

sim_based_on_wider_priors <- 
  brm(
    formula = total_counts ~ temp.21,
    data = nyc_bikes_2020,
    family = gaussian(),
    prior = wider_priors,
    sample_prior = "only",
    chains = 8, 
    warmup = 5000, 
    iter = 20000, 
    thin = 50 
    )
  
# Visualize the prior predictive distribution (100 best fit lines)
draws_wider_priors <- tibble(
  temp.21 = seq(-23,15)) |> 
  add_epred_draws(sim_based_on_wider_priors, ndraws = 100) 

draws_wider_priors |> 
  ggplot(mapping = aes(x = temp.21, y = .epred)) +
  geom_line(mapping = aes(group = .draw), alpha = 0.2) +
  theme_minimal() +
  theme_minimal() +
  labs(title = "100 best fit lines simulated from the prior distribution",
       subtitle = "Updated wider priors",
       x = "Max daily temperature in Celsius (centered at 21°C)", 
       y = "Number of bicyle riders") 

Comparing these visualizations to the earlier visualizations with narrower priors, we can see that more uncertainty is attained. To see this, compare the range of scores on the x-axis in the density plots for the initial priors and the wider priors. Additionally, take note of the increased variability in the 100 fitted regression lines based on the wider priors. The increased variability across simulated samples indicates that our model is now better equipped to account for the heightened variability and unpredictability associated with the 2020 data. This broader range of predicted values allows the model to capture more extreme outcomes, which are plausible given the unique circumstances of the COVID-19 pandemic.

Key Observations from the Updated Visualizations

  1. Increased Variability:
    • The prior predictive distribution now displays a wider range of possible outcomes for the number of bicycle riders given the maximum daily temperature. This suggests that our model can accommodate a greater degree of uncertainty, which is essential for accurately reflecting the impact of the pandemic on bike ridership patterns.
  2. Better Fit for 2020 Data:
    • By visualizing the simulated data frames from the updated priors, we observe that the predicted values span a broader range, which aligns more closely with the expected variability in the 2020 data. This ensures that our model does not over-constrain the predictions, allowing for a more flexible and realistic representation of the data.
  3. Enhanced Model Flexibility:
    • The adjustments to the priors have helped mitigate the risk of overfitting to the prior information from 2023. The model is now more adaptable, capable of learning from the new data without being overly influenced by the priors.

Fit the Bayesian model to the 2020 data

With our priors selected, we’re ready to fit the Bayesian simple linear regression model. The code to carry out this task is below, and builds directly on our models from Module 9. The primary difference is that we our now fitting a regression model (formula = total_counts ~ temp.21), and our priors require inputs for the intercept, slope, and sigma — as we discussed earlier.

wider_priors <- c(
  prior(normal(51224, 1836), class = Intercept),
  prior(normal(1311, 165), class = b, coef = 'temp.21'),
  prior(normal(10221, 1500), class = sigma)
)

# Specify the final Bayesian linear regression model
bayesian2020 <- brm(
  formula = total_counts ~ temp.21,
  data = nyc_bikes_2020,
  family = gaussian(), 
  prior = wider_priors,
  warmup = 5000, 
  iter = 20000,
  thin = 50,
  chains = 8
)

Explanation of Arguments to brm():

  • formula: Specifies the model formula.

  • data: The data frame nyc_bikes_2020 contains the data to be used in the model.

  • family: Indicates the distribution of the outcome variable. gaussian() specifies a Normal distribution, suitable for continuous outcomes.

  • prior: Specifies the prior distributions for the model parameters. wider_priors is a user-defined set of priors that reflect prior beliefs about the parameters before observing the data.

  • warmup: The number of warm-up iterations per chain. These iterations are used to allow the Markov chains to reach a stable state and are not included in the final analysis.

  • iter: The total number of iterations per chain, including warm-up. This ensures a sufficient number of samples are collected for accurate posterior estimation.

  • thin: The thinning interval, which reduces autocorrelation by retaining only every 50th sample. This helps manage storage and computational efficiency.

  • chains: The number of Markov chains to run. Running multiple chains helps assess convergence and ensures robust estimation of the posterior distribution.

The fitted model bayesian2020 will contain the posterior samples and other relevant information needed for inference and prediction based on the specified Bayesian linear regression model.

Diagnostics

Before interpreting the parameter estimates from a Bayesian model, it is critical to perform thorough model checking. Model checking ensures that the model accurately represents the underlying data and that the assumptions made during the modeling process are valid. This involves assessing the goodness-of-fit, checking for convergence of the MCMC samples, and evaluating the posterior predictive distributions. Without proper model checking, there is a risk of drawing incorrect or misleading conclusions from the parameter estimates, as the model might be overfitting, underfitting, or failing to capture essential patterns in the data. By rigorously validating the model, we can have greater confidence in the reliability and robustness of the results, leading to more accurate and meaningful inferences.

Effective sample size (Neff) and R-hat

Effective sample size (Neff) and R-hat are critical diagnostics used in Bayesian analysis to assess the quality and reliability of MCMC simulations. After fitting a Bayesian model using brms, we can use the neff_ratio() and rhat() functions to obtain these diagnostics.

  • Neff quantifies the number of independent samples from the MCMC simulation. Even though our simulation might generate thousands of samples, these samples are often correlated. Neff adjusts the count to reflect the true amount of independent information we have. A higher Neff means more reliable estimates. Ideally, we want the Neff-ratio (Neff divided by the total number of samples) to be between 0.8 and 1. This range indicates efficient sampling and a good approximation of the posterior distribution.

  • R-hat assesses whether the MCMC chains have converged. It does this by comparing the variation within each chain to the variation across all chains. A R-hat value close to 1 indicates that the chains have mixed well and are likely sampling from the same posterior distribution. If R-hat is greater than 1.05, it suggests that the chains haven’t converged, meaning our results might not be reliable.

We can request the Neff-ratio and R-hat for our fitted model as follows:

bayesian2020 |> neff_ratio() 
b_Intercept   b_temp.21       sigma   Intercept      lprior        lp__ 
  1.0059922   0.9451993   0.9383633   0.9957716   0.9897666   0.9644696 
bayesian2020 |> rhat() 
b_Intercept   b_temp.21       sigma   Intercept      lprior        lp__ 
  1.0002755   1.0010501   1.0004262   1.0003910   1.0006091   0.9998684 

The Neff-ratio for all estimated parameters (b_intercept, b_temp.21, and sigma — that is, the first three estimates in the output) is in the desirable range, this means the chains are well-behaved, with limited autocorrelation reducing the sampling efficiency.

The R-hat values for all estimated parameters (b_intercept, b_temp.21, and sigma) are very close to 1, this means the chains have adequately explored the posterior distribution and converged to a stable solution.

Trace plots

Trace plots are essential diagnostic tools in Bayesian statistics, used to evaluate the behavior of MCMC simulations. Each trace plot visualizes the sampled values of a parameter across iterations within each chain, providing insights into the convergence and mixing of the MCMC chains. Here’s how to interpret trace plots from a Bayesian model:

  1. Identifying Convergence: Converged chains should converge to the same distribution, meaning they should fluctuate around the same value or within a consistent range across the plot. This indicates that regardless of the starting values, the chains are exploring the same posterior landscape. Unconverged chains appear to explore different parts of the parameter space without overlapping, it suggests a lack of convergence. This could mean that the chains have not yet explored the full posterior distribution, and more iterations may be necessary.

  2. Assessing Mixing: Good Mixing displays a “hairy caterpillar sitting on the floor” appearance, where the samples densely cover a consistent range and resemble random noise around a value. This indicates that the chain is efficiently exploring the posterior distribution. Poor Mixing may exhibit visible patterns, such as trends, cyclic behavior, or long periods where the chain gets “stuck” at certain values. This suggests autocorrelation within the chain, where successive samples are too similar to each other, reducing the chain’s effective sample size.

  3. Evaluating Stationarity: Stationary looks like the chain oscillating within a stable range throughout the simulation without trending upwards or downwards. Non-Stationary may be indicated by a visible trend in the trace plot, where the parameter values systematically increase or decrease over iterations. This can suggest that the chain has not yet reached its equilibrium distribution.

  4. Spotting Autocorrelation: While trace plots primarily show convergence and mixing, patterns observed can also hint at autocorrelation. For instance, a slowly meandering trace plot (showing gradual shifts from high to low values or vice versa) suggests that consecutive samples are highly correlated, indicating that the sampler is not efficiently exploring the parameter space.

Let’s take a look at the trace plots for our example:

bayesian2020 |> 
  gather_draws(b_Intercept, b_temp.21, sigma) |> 
  ggplot(mapping = aes(x = .iteration, y = .value, color = as.factor(.chain))) +
  geom_line(linewidth = 0.05) +
  labs(color = "Chain") +
  facet_wrap(vars(.variable), scales = "free_y") +
  theme(legend.position = "bottom")

Here’s an evaluation of these trace plots:

  1. Randomness and Mixing: The trace plots for all three parameters appear to exhibit randomness and good mixing. There are no obvious patterns, trends, or drifts over the iterations, indicating that the chains are well-mixed.

  2. Convergence: The fact that the trace plots for each parameter oscillate around a stable mean suggests that the model has likely converged. The chains are not showing any systematic drift which is a good sign.

  3. Scale and Variability: The variability within each chain seems consistent across iterations, which is another positive sign indicating that the sampling process is stable.

Overall, the trace plots look good. They suggest that the chains have converged and are well-mixed. This implies that the posterior estimates for the parameters are reliable.

Practical Steps Based on Neff-ratio, R-hat, and Trace Plot Interpretation What should you do if these diagnostics don’t conform to the desired description provided above?

  • Run Longer Chains: If convergence or stationarity is not achieved, consider running your MCMC simulations for more iterations.

  • Increase Burn-in: Discard more initial samples at burn-in to ensure the analysis is conducted on samples after the chains have reached stationarity.

  • Adjust Sampler Settings: Increase the thinning parameter to allow more iterations between samples. Thinning is intended to produce a more representative sample from the posterior distribution by reducing the influence of closely correlated successive samples.

Density plots

Density plots generated from the posterior distributions of parameter estimates in a Bayesian model offer rich insights into the model’s outcomes. When we generate these plots for each parameter (such as b_Intercept, b_temp.21, and sigma) and color-code by chain, we can interpret several key aspects of our Bayesian analysis:

  1. Distribution Shape and Central Tendency: The peak of the density plot represents the most probable value(s) of the parameter, akin to the mode of the distribution. It provides an intuitive sense of where the parameter is most likely to lie. A symmetric density plot suggests that the parameter values are equally likely to lie above or below the central tendency, indicating no bias in one direction.

  2. Variability and Uncertainty: The spread of the density plot reflects the uncertainty associated with the parameter estimate. A wider plot indicates greater uncertainty (or higher variance) in the parameter’s estimated value, while a narrower plot suggests more precise estimates. The length and weight of the tails of the density plot indicate the probability of extreme values. Longer tails mean that there’s a significant probability of the parameter taking on values far from the central tendency.

  3. Comparison Across Chains: The density plots for each parameter should ideally show good overlap among the different chains, suggesting that all chains are sampling from the same posterior distribution. This overlap is a visual indicator of convergence across chains. Variations in the shapes or peaks of the density plots across chains might suggest issues with convergence or that the chains are exploring different parts of the parameter space. This could warrant further investigation or more iterations to ensure convergence.

Let’s take a look at the density plots from our fitted Bayesian model:

bayesian2020 |> 
  gather_draws(b_Intercept, b_temp.21, sigma) |>  
  ggplot(mapping = aes(x = .value, color = factor(.chain))) +
  geom_density() +
  labs(color = "Chain") +
  facet_wrap(vars(.variable), scales = "free") +
  theme_minimal() +
  theme(legend.position = "bottom")

The density plots for all three parameters look good, and adhere to the desired results described above. The distribution, as well as the placement of the center of the distribution — as well as the minimum and maximum values are in line with our expectations based on the 2023 data.

In summary, density plots of parameter estimates from a Bayesian model offer a nuanced view of the posterior distributions, encapsulating information about the likely values of parameters, their uncertainty, and the convergence of the MCMC simulation, all of which are crucial for making informed inferences from the model.

Interpret the results

Once we feel confident that the Bayesian model reached convergence and produced stable, reliable estimates, we can interpret the model.

To begin, let’s take a look at the parameter estimates — including the estimate of the intercept and slope, as well as \(R^2\).

bayesian2020 |> 
  tidy(conf.int = TRUE, conf.level = 0.95) |> 
  select(-c(effect, component, group))
bayesian2020 |> 
  bayes_R2(probs = c(0.025, 0.975))
    Estimate  Est.Error      Q2.5     Q97.5
R2 0.5188799 0.02330326 0.4695579 0.5613561

Here, let’s focus on the point estimates (labeled estimate). The intercept represents the predicted number of riders when the temperature is 21°C (i.e., temp.21 equals 0), and the slope represents the expected change in the number of riders for each 1°C increase in maximum temperature. The \(R^2\) is .520 — suggesting that the model explains about 52% of the variability in the outcome.

The tables also provide the 95% credible intervals for each estimate (see conf.low and conf.high). Credible intervals provide a range within which the parameter values are likely to lie, given the data and the specified priors. In a Bayesian framework, a 95% credible interval means that there is a 95% probability that the parameter value lies within this range, given the model and the data.

The 95% credible interval for the slope ranges from 1151 to 1385. In the Bayesian framework, this means there is a 95% probability that the true effect of temperature lies within this interval, given the data and the specified priors. This interval provides a range of plausible values for the effect of each one degree increase in temperature (in the metric of Celsius) on the number of bicycle riders in 2020. The effect of temperature in this Bayesian model indicates that higher temperatures are associated with an increase in the number of bicycle riders. The point estimate, credible interval, and standard error together provide a comprehensive picture of the magnitude and certainty of this effect. This information can be used to inform decisions related to transportation planning, public health, and other areas where understanding the impact of temperature on bicycle ridership is relevant.

While the numerical estimates are useful, it is common to utilize a type of graph called a Half-eye to visualize the interval. Half-eye plots are an effective way to visualize the posterior distributions and credible intervals of model parameters. These plots combine the features of density plots and interval plots, providing a comprehensive view of the distribution of parameter estimates. Here, the plots are produced via the stat_half_eye() function from the ggdist package.

In a a Half-eye plot, the shaded area represents the kernel density estimation of the posterior distribution for each parameter. This visually conveys the shape of the distribution, highlighting the central tendency and variability for each parameter estimate. Superimposed on each density plot is the 95% credible interval. These intervals convey the range within which the parameter values likely lie with a certain probability (e.g., 95% credible intervals), given the model and the data. With these plots, we can clearly see which values are most probable, providing a more intuitive understanding of the parameter estimates in the context of our Bayesian model.

bayesian2020 |> 
  gather_draws(b_Intercept, b_temp.21, sigma) |>  
  ggplot(mapping = aes(x = .value, fill = .variable)) +
  stat_halfeye(normalize = "xy", .width = c(.95)) +
  facet_wrap(vars(.variable), scales = "free_x") +
  labs(title = "Half-eye plots for the posterior distributions of each parameter",
       y = "Density",
       x = "Parameter posterior distribution", 
       fill = "Parameter") +
  theme(legend.position = "bottom") +
  theme_minimal()

Fitted draws

It is of use to plot the observed data alongside 2000 draws of the predicted number of rides (linear predictions) from the posterior distribution of the model predictions for each temperature value. The function add_linpred_draws() is used to generate linear predictor draws from the posterior distribution of the fitted Bayesian model (bayesian2020).

This type of plot, often referred to as a posterior predictive plot, provides a visual representation of the fitted model along with the observed data. The plot incorporates the posterior distributions obtained from a Bayesian analysis to show the range and variability of predictions based on the model.

nyc_bikes_2020 |> 
  add_linpred_draws(bayesian2020, ndraws = 2000) |> 
  ggplot(mapping = aes(x = temp.21, y = total_counts)) +
  geom_point(data = nyc_bikes_2020, size = 0.5, alpha = .1) +
  geom_line(mapping = aes(y = .linpred, group = .draw), alpha = 0.2, linewidth = 0.5) +
  labs(title = "Relationship between temperature and bicyclists in NYC across 2000 \ndraws from the posterior distribution",
       subtitle = "Bayesian model fitted to 2020 data",
       x = "Max daily temperature in Celsius (centered at 21°C)",
       y = "Total number of bicycle riders recorded") +
  guides(color = "none") +
  theme_minimal()

The multiple lines in the plot represent 2000 samples from the posterior distribution of the linear predictor for the Bayesian model. These lines show the range of possible regression relationships between temperature and bicycle ridership, considering the uncertainty in the model parameters:

  • Density and Credible Intervals: The density of lines at any given temperature value can be interpreted as the credible interval for the regression line. Densely packed lines represent regions with higher credibility, indicating where the predicted values are most likely to fall. Sparser areas represent lower credibility regions, where the predicted values are less certain.

  • Uncertainty and Confidence: Areas with a higher density of lines indicate greater confidence in the model’s predictions, as these regions have higher probability for the predicted values. Conversely, regions with sparser lines indicate higher uncertainty and lower confidence in the predictions.

  • Visualizing Model Predictions: This visualization helps to understand the model’s predictions and their associated uncertainty, making it easier to assess the reliability of the model’s predictions across different temperature values.

Overall, this plot provides a comprehensive visualization of the posterior distribution of the regression line, illustrating both the central tendency and the uncertainty in the model’s predictions.

Confidence for model predictions

Next, let’s compute credible intervals for the mean of y (counted riders) at 21°C (which corresponds to the intercept given our centering) as well as prediction intervals for a new observation at 21°C. Note that temp.21 == 0 is equivalent to a 21°C day.

Calculating a Credible Interval for a 21°C day

First, we calculate the credible interval for the mean of y at 21°C. The credible interval reflects the range within which the true mean number of riders is likely to fall with 95% probability.

# For the mean of y
ci_data <- bayesian2020 |>
  linpred_draws(newdata = tibble(temp.21 = 0)) |>
  summarise(lower = quantile(.linpred, probs = 0.025),
            upper = quantile(.linpred, probs = 0.975),
            .groups = "drop")

ci_data |> knitr::kable()
temp.21 .row lower upper
0 1 42220.52 44577.18

This interval indicates that there is a 95% probability that the true mean number of riders at 21°C lies between the values presented under lower and upper respectively in the table above. In Bayesian terms, given our prior beliefs and the observed data, the posterior distribution suggests that the mean number of riders is most likely within this range.

Calculating a Prediction Interval for a 21°C day
We can also calculate a prediction interval. A prediction interval provides a range within which a new observation is likely to fall with a certain probability (95% for our example). In the context of our model this interval reflects the uncertainty around predicting the number of riders on a particular day where the temperature is 21°C. It will be wider than the credible interval because it accounts for both the uncertainty in the mean prediction (the model uncertainty) and the individual variability in the data (the residual error). That is, the prediction interval incorporates the variability in the intercept and slope estimates, as well as the inherent variability in the data. This means that the interval represents the range within which we expect the actual number of riders to fall 95% of the time for a new day with a temperature of 21°C.

# For a new observation of y
pi_data <- bayesian2020 |>
  predicted_draws(newdata = tibble(temp.21 = 0)) |>
  summarise(lower = quantile(.prediction, probs = 0.025),
            upper = quantile(.prediction, probs = 0.975),
            .groups = "drop")

pi_data |> knitr::kable()
temp.21 .row lower upper
0 1 21575.31 64458.37

This interval indicates that there is a 95% probability that the total number of bicycle riders on a future day with a maximum temperature of 21°C will fall between the values presented under lower and upper respectively in the table above. In Bayesian terms, given our prior beliefs and the observed data, the posterior distribution suggests that individual observations (new data points) are most likely to lie within this range. The prediction interval suggests that, based on the data and our prior beliefs, we are 95% confident that the total number of bicycle riders on a day with a maximum temperature of 21°C will fall within this range. This interval is wider than the credible interval because it accounts for both the uncertainty in the mean estimate and the variability in individual observations.

Visualize the intervals

The code below presents these two intervals as Half-eye plots:

p1 <- bayesian2020 |> 
  linpred_draws(newdata = tibble(temp.21 = 0)) |> 
  ggplot(mapping = aes(x = .linpred)) +
  stat_halfeye(.width = c(.95)) +
  theme_minimal() +
  labs(title = "Predicted riders at 21˚C",
       subtitle = "Confidence interval for mean of y",
       x = "Predicted riders", y = NULL)

p2 <- bayesian2020 |> 
  predicted_draws(newdata = tibble(temp.21 = 0)) |> 
  ggplot(mapping = aes(x = .prediction)) +
  stat_halfeye(.width = c(.95)) +
  theme_minimal() +
  labs(title = "Predicted riders at 21˚C",
       subtitle = "Prediction interval for a new y",
       x = "Predicted riders", y = NULL)

library(patchwork)
p <- p1 | p2
p + plot_annotation(title = "Comparison of the credible interval and prediction interval for a 21˚C day")

Note that the shape of the distributions look similar, but note that the range of the x-axis is very different.

As a final graphic, let’s map the credible and prediction interval for a 21˚C day onto a graph to put the quantities in context.

ci_lower <- ci_data$lower
ci_upper <- ci_data$upper

pi_lower <- pi_data$lower
pi_upper <- pi_data$upper

# Scatter plot with CI represented as a point and vertical whisker
p1 <- ggplot(nyc_bikes_2020, aes(x = temp.21, y = total_counts)) +
  geom_point(alpha = 0.1) +
  geom_errorbar(mapping = aes(x = 0, ymin = ci_lower, ymax = ci_upper), color = "#2F9599", width = 1) +
  labs(title = "CI for Predicted Riders at 21°C", x = "Temperature", y = "Rides") +
  theme_minimal()

# Scatter plot with PI represented as a point and vertical whisker
p2 <- ggplot(nyc_bikes_2020, aes(x = temp.21, y = total_counts)) +
  geom_point(alpha = 0.1) +
  geom_errorbar(mapping = aes(x = 0, ymin = pi_lower, ymax = pi_upper), color = "#A7226E", width = 1) +
  labs(title = "PI for Predicted Riders at 21°C", x = "Temperature", y = "Rides") +
  theme_minimal()

# Combine plots side by side
p <- p1 | p2
p + plot_annotation(title = "The credible interval and prediction interval for a 21˚C day in context")

These plots provide a visual representation of the credible and prediction intervals, helping to contextualize the range of expected values for the number of riders at 21°C. The credible interval provides insight into the mean number of riders, while the prediction interval offers a range for where new observations are likely to fall.

Wrap up

Understanding sample-to-sample variability is essential for grasping the nuances of statistical analysis, particularly in the context of linear regression. When we analyze sample data, we must account for various factors that contribute to variability — in the example we considered in this Module, these factors include climate changes, infrastructure developments, population dynamics, data collection methods, and measurement precision (to name a few). These factors highlight the inherent uncertainty in our estimates and underscore the importance of robust statistical methods that appropriately model uncertainty.

In the realm of linear regression, quantifying uncertainty in the coefficients — such as the intercept and slope — is pivotal. Standard errors provide a measure of the variability of these estimates, and confidence intervals offer a range within which we expect the true parameter values to lie.

Bootstrapping serves as a powerful non-parametric method to assess the variability of regression coefficients by resampling the data and estimating the distribution of the coefficients. This approach helps to visualize and quantify uncertainty, providing a more comprehensive understanding of our estimates. If assumptions of the Central Limit Theorem are met, theory-based (i.e., parametric) methods can simplify the work to obtain estimates of uncertainty.

The frequentist and Bayesian frameworks offer different perspectives on uncertainty. Frequentist methods focus on long-run properties of estimators, while Bayesian methods incorporate prior knowledge and update beliefs based on observed data. Bayesian approaches provide a probabilistic interpretation of parameter estimates through credible intervals, which can be particularly insightful for decision-making.

In practice, visualizing uncertainty using confidence bands or credible intervals can greatly enhance the interpretation of regression models. These visual tools help communicate the precision and reliability of our estimates, making statistical analysis more transparent and informative.

In summary, acknowledging and quantifying uncertainty in linear regression is fundamental to sound statistical practice. By leveraging various methods and perspectives, we can make more informed and robust conclusions, ultimately leading to better decision-making based on our data analyses.

Credits

Resources

For an excellent repository of materials to help you learn more about Bayesian modeling, check out this website.

Footnotes

  1. If this introduction to Bayesian modeling piques your interest, and you want to learn more, there are many excellent books and resources — some of which I’ll put in the Resources section at the end of this Module. For the challenging topic of selecting priors — this Stan blog post is helpful.↩︎