Null Hypothesis Significance Testing

Module 16

Artwork by @allison_horst

Learning objectives

  • Describe why hypothesis tests are conducted
  • Contrast the null hypothesis and the alternative hypothesis
  • Explain the gist of null hypothesis testing
  • Describe Type I and Type II errors
  • Contrast the concepts of alpha and p-value
  • Conduct a hypothesis test using R
  • Interpret the results of a hypothesis test

Overview

In this Module, we’ll build on the knowledge we’ve developed in Modules 8, 9 and 12 to explore a crucial aspect of inferential statistics: hypothesis testing. Hypothesis testing offers a structured approach to evaluating theories about human behavior, societal trends, and other phenomena. Through hypothesis testing, we can systematically assess assumptions, compare theories, and determine the impact of interventions or treatments.

We’ll cover the core components of hypothesis testing, from understanding its fundamental principles to performing statistical tests and interpreting the results. By the end of this Module, you’ll have the skills to apply hypothesis testing correctly in practical scenarios, enabling you to rigorously evaluate and draw meaningful conclusions from data.

Introduction to the data

We will work with data from a study by Hofman, Goldstein & Hullman.


The study addresses the challenge scientists face in conveying the uncertainty of their research findings to their audiences. This uncertainty can be broadly classified into two types: inferential uncertainty and outcome uncertainty.

  • Inferential uncertainty refers to our confidence in the accuracy of a summary statistic, such as mean. This type of uncertainty is captured using a confidence interval (CI). As we’ve studied throughout this course, a CI provides a range of values within which we expect the true population parameter to lie, based on our sample data. For example, a 95% CI around a sample mean indicates that, if we were to repeat our sampling process many times, approximately 95% of the calculated intervals would contain the true population mean. This interval helps us understand the precision of our estimate and is crucial for comparing means between groups.

  • Outcome uncertainty, on the other hand, describes the variability of individual outcomes around a summary statistic, such as the mean, regardless of how accurately the mean itself has been estimated. This type of uncertainty is illustrated using a prediction interval (PI). A PI provides a range of values within which we expect a future individual observation to fall, given the current data. For example, a 95% PI indicates that we expect 95% of individual outcomes will fall within this range. This interval accounts for the inherent variability in individual data points and is useful for understanding the spread and potential extreme values of individual outcomes.

For example, in Module 12, we calculated a 95% CI and a 95% PI for the number of bicycle riders given the temperature in New York City — for instance, the predicted number of riders on a 21°C day. The 95% CI for the predicted average number of bicycle riders on days when the temperature was 21°C was approximately 50,100 to 52,359. Meanwhile, the 95% PI for a new observation of the number of bicycle riders on a 21°C day was approximately 31,022 to 71,437. This indicates that we expect 95% of individual observations of bicycle riders on a 21°C day to fall within this range. The key difference between the CI and PI calculations lies in the standard error component. For CIs, the standard error reflects the uncertainty in estimating the mean response for given temperature. For PIs, the standard error also incorporates the variability (mean squared error, MSE) of individual observations around the regression line, resulting in wider intervals. These intervals provide critical insights into the precision of model predictions (CI) and the expected range of variability in new data points (PI).

Recall this graph from Module 12 that depicted a 95% CI and a 95% PI for number of riders on a 21°C day. We can clearly see that there is much more error associated with the PI.

To summarize, CIs and PIs serve different but complementary purposes in statistical analysis. CIs help us gauge the precision of our estimated parameters, reflecting inferential uncertainty, while PIs help us understand the range of possible individual outcomes, reflecting outcome uncertainty. Both types of intervals are essential for accurately interpreting and communicating the uncertainty in scientific research.

When presenting data, scientists must decide whether to emphasize inferential or outcome uncertainty, a choice that significantly impacts the appearance of visualizations. Visualizations of inferential uncertainty, such as CIs, illustrate the precision of estimated means. For instance, a common graph might display group means with error bars that extend 1.96 standard errors above and below the mean, creating a 95% CI. This type of visualization helps compare the sampling distributions of means between groups. Conversely, visualizations of outcome uncertainty, such as PIs, emphasize the spread of individual outcomes.

There is a fascinating area of research dedicated to visualizing uncertainty in a way that viewers can most accurately understand. Most prior research in uncertainty visualization has focused on determining the best techniques for enhancing comprehension, assuming either inferential or outcome uncertainty is being visualized. In contrast, Hofman and colleagues investigated how the type of uncertainty visualized affects people’s beliefs about the effectiveness of a treatment. To explore this, the authors conducted two large pre-registered randomized experiments where participants viewed different visualizations of the same data — some focusing on inferential uncertainty (i.e., CIs) and others on outcome uncertainty (i.e., PIS) — and were then asked about their beliefs regarding the size of treatment effects.

To convey these concepts, the authors utilized diagrams with error bars: one set depicted 95% CIs, emphasizing the precision of the mean estimate of a treatment effect, while the other set showed 95% PIs, illustrating the expected range of individual outcomes from a treatment. Though both formats can theoretically offer the same insight into the underlying data, assuming the sample size is known, they accentuate different aspects of data interpretation. The CI-focused presentation zeroes in on the uncertainty of the mean estimate, whereas the PI-focused approach sheds light on the expected variability in individual outcomes.

The study also probed whether the differences in perception induced by these visualization formats could be mitigated by augmenting the diagrams with explanatory text. Specifically, the researchers posited that including additional details about 95% PIs in the caption of a figure primarily showing 95% CIs might alter readers’ inferences regarding the distribution of outcomes under treatment, while still communicating the inferential uncertainty. This aspect of the experiment sought to determine if supplementary textual information could help reconcile the different interpretations elicited by the two graphical formats.

The study we’ll consider in this Module recruited 2,400 participants via Amazon’s Mechanical Turk. After excluding 49 participants for taking part in both the pilot and the experiment for the authors’ study, the final sample consisted of 2,351 U.S.-based participants with approval ratings of 97% or higher on Mechanical Turk. Payment for each participant was set at $0.75.

The participants were randomly assigned to one of four experimental conditions related to confidence intervals (CI) and prediction intervals (PI), both with and without extra text to explain both types of intervals:

  • Condition 1: Confidence interval with descriptive text that matched the visualization
  • Condition 2: Confidence interval with descriptive extra supplemental text
  • Condition 3: Prediction interval with descriptive text that matched the visualization
  • Condition 4: Prediction interval with descriptive extra supplemental text

Study protocol

First, all participants were presented with the following two screens to introduce the study:


Then, depending on the assigned condition, participants viewed their designated condition visualization (click on each tab to view what the participant was shown based on condition assigned):


The outcome variable that we will consider in this module is willingness to pay (WTP) — that is, the amount of money each participant was willing to pay to rent the special boulder after viewing their designated graph and descriptive text.

An attention check was built into the study to ensure that participants were paying attention. Some 608 participants failed the attention check and were removed from the analysis — leaving a total of 1,743 participants for our analysis.

Import the data
Now that we have an understanding of the study, let’s take a look at the data from the experiment. The data frame is called hulman_exp1.Rds and includes the following variables:

Variable Description
worker_id Participant ID
condition.f Experimental condition
interval_CI A binary indicator comparing CI viz (coded 1) to PI viz (coded 0)
text_extra A binary indicator comparing viz with extra text (coded 1) to viz text only (coded 0)
wtp_final Amount participant was willing to pay for special boulder

Let’s load the packages needed for this Module.

library(broom)
library(marginaleffects)
library(skimr)
library(here)
library(tidyverse)

Now, we’re ready to import the data. I’ll display the first few rows of the data frame so you can get a feel for it.

df <- read_rds(here("data", "hulman_exp1.Rds"))
df |> select(-worker_id) |>  head(n = 25) 

Introduction to hypothesis testing

In this module we will use the principles of the sampling distribution that we first learned about in Module 7 to conduct hypothesis tests. A hypothesis is a scientist’s assertion about the value of an unknown population parameter. Recall that the purpose of a scientific study isn’t to learn just about the sample, but rather to try and learn something about the population of interest — that is, the population from which the sample was drawn.

A hypothesis test consists of a test between two competing hypotheses about what might be happening in the population. The first is referred to as the null hypothesis and the second is referred to as the alternative hypothesis.

In the first part of the Module we will focus on the comparison of participants who only saw text related to the observed visualization (i.e., Conditions 1 and 3 as described above). In comparing Condition 1 to Condition 3, we will examine the effect of seeing a confidence interval versus a prediction interval. Here, the experimental condition is the type of uncertainty interval (i.e., prediction interval vs. confidence interval).

Stating the null and alternative hypothesis

The null hypothesis is often an assertion that there is no difference in an outcome variable (i.e., Y) between groups, or there is no effect of X (a predictor) on Y (an outcome). That is, with the null hypothesis we assume that any effect observed in our sample is simply due to random chance. In our current focus, the null hypothesis is that there is no difference in WTP based on type of interval observed. In other words the null hypothesis asserts that the mean WTP across the two conditions is equal.

The alternative hypothesis is typically the assertion that there is a difference between groups or there is an effect of one variable on another. Building on our null hypothesis example, the alternative hypothesis for our current study is that there is a difference in WTP based on type of interval observed. That is, the alternative hypothesis asserts that the mean WTP across the two conditions is not equal.

To begin, let’s calculate the average WTP in each condition.

df |> 
  filter(condition.f %in% c("1: CI with viz stats only", "3: PI with viz stats only")) |> 
  group_by(condition.f) |> 
  select(condition.f, wtp_final) |> 
  skim()
Data summary
Name select(…)
Number of rows 905
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables condition.f

Variable type: numeric

skim_variable condition.f n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
wtp_final 1: CI with viz stats only 0 1 79.01 53.09 0 50 75.0 100 249 ▅▇▇▂▁
wtp_final 3: PI with viz stats only 0 1 49.64 49.38 0 15 37.5 59 249 ▇▅▂▁▁


Here, we see that then mean WTP for people who saw the CI is 79.0 ice dollars, while the mean WTP for people who saw the PI is 49.6 ice dollars. We can create a graph to display this information, let’s take a look at a box plot, with the average WTP for each group overlaid.


From the two means, we can also calculate the difference in the average WTP across the two conditions. Whether or not there is a difference in WTP between the two conditions considered here is the crux of our hypothesis test.

observed_diff <- 
  df |> 
  filter(condition.f %in% c("1: CI with viz stats only", "3: PI with viz stats only")) |> 
  group_by(condition.f) |> 
  select(condition.f, wtp_final) |> 
  summarise(mean_value = mean(wtp_final)) |>
  pivot_wider(names_from = condition.f, values_from = mean_value) |>
  mutate(diff = `1: CI with viz stats only` - `3: PI with viz stats only`) |> 
  pull(diff)

observed_diff
[1] 29.36551

The average WTP was about 29.4 ice dollars higher when individuals saw the visualization with the CI compared to the visualization with the PI. This difference is illustrated by the gap between the two orange dots on the prior graph along the y-axis. Participants who viewed the smaller range of the CI, compared to the PI, likely perceived the expected improvement with the special boulder as more certain, making them believe it was more worthwhile to rent it.

Given this sizable difference, you might wonder why a statistical test is necessary to determine if this mean difference is significantly different from zero. The hypothesis test is crucial for several reasons. One key concept we’ve studied is “sample-to-sample variability.” This means that results from one random sample might not predict the outcomes of another random sample or the broader population’s trends. In inferential statistics, our goal is to determine if the observed effects in our sample are representative of the entire population or if they could have arisen by random chance.

In our study, we are asking whether the observed difference in WTP (an average of ~29.4 ice dollars more with the CI visualization) would likely be seen if the entire population were subjected to the experiment. Before we conclude that this is a real effect likely to be observed in the population, we need to verify it statistically.

Please watch the following video from Crash Course Statistics on p-values, which builds on these ideas.

In comparing the two conditions, our goal is to determine if the population mean (\(\mu\)) WTP when shown the CI visualization is significantly different from the population mean (\(\mu\)) WTP when shown the PI visualization. Specifically, our parameter of interest is the difference between the two population means — the difference in WTP for people who view the CI visualization compared to those who view the PI visualization. By focusing on \(\mu\), we are considering the comprehensive, population-level effect, rather than just the observed difference in our sample (\(\overline{y}\)).

We can formally write the null and alternative hypothesis as follows:

  • Null hypothesis: \(H_0:\mu_{\text{CI}} = \mu_{\text{PI}}\) (the two means are equal).

  • Alternative hypothesis: \(H_a:\mu_{\text{CI}} \neq \mu_{\text{PI}}\) (the two means are not equal).

The sampling distribution under the null and alternative hypotheses

Let’s delve deeper into the theoretical sampling distribution under the given null hypothesis to enhance our understanding of null hypothesis significance testing (NHST). A sampling distribution for a parameter of interest in the population (e.g., the difference in WTP between the two conditions) provides a foundational concept in inferential statistics. Here’s a refresher if Modules 7 and 8 seem like eons ago:

Sampling Distribution for a Parameter:

  1. Start with a Population Parameter: In statistics, we’re often interested in estimating some characteristic of a population. This characteristic, such as a mean, proportion, difference in means, or a regression coefficient, is referred to as a population parameter.

  2. Take Many, Many Random Samples: Imagine we could repeatedly take samples of the same size from this population. For each sample, we would conduct the experiment and compute the statistic of interest (e.g., the difference in WTP between the two conditions).

  3. Distribution of Sampled Statistics: If we were to plot all these sample statistics, we’d get a distribution known as the sampling distribution of that statistic. The true population parameter is located at the center of the sampling distribution. This means that if we computed the sample statistic in each random sample drawn, the average of all those sample statistics would converge to the true population parameter.

  4. Key Idea: The sampling distribution tells us how much the statistic (like the difference in WTP between the two conditions) would vary from one sample to another.

  5. Central Limit Theorem: One of the most fundamental concepts in statistics is the Central Limit Theorem (CLT). The CLT asserts that, given a sufficiently large sample size, the sampling distribution of a sample statistic will approximate a normal distribution, regardless of the shape of the population distribution. This property greatly facilitates statistical inferences, as it allows us to capitalize on the well-known properties of the normal distribution. The tendency toward normality, as described by the CLT, enables the use of parametric methods of estimation. These are statistical techniques grounded in specific distributional assumptions, primarily normality. Because of this, we can confidently apply powerful statistical tests like the standard error estimation techniques we’ve studied throughout this course and specific hypothesis tests, such as t-tests, which we will delve into in this Module.

  6. Standard Error: The standard error quantifies the variability or spread of the sampling distribution of a sample statistic. Specifically, the standard error is the standard deviation of the sampling distribution. The standard error provides a measure of how much the sample statistic (e.g., the difference in mean WTP between the two conditions) is expected to vary across repeated sampling from the population.

Why is the Sampling Distribution Important?
Understanding the sampling distribution is crucial for inferential statistics because it helps quantify the uncertainty associated with our sample estimates. When we use a single sample to make inferences about a population parameter, we recognize there’s some inherent error because another sample will give a somewhat different result. The sampling distribution provides a framework to understand and quantify this variability, enabling us to make more informed decisions and interpretations about our data. In essence, the sampling distribution gives us a lens into the “universe of possible samples.” It shows us how our sample estimate might fare if we were to repeat our study many times, offering insight into the reliability and variability of our estimates.

Application of the Sampling Distribution to Test our Null Hypothesis
Let’s imagine that we know that the difference in WTP across the two conditions in the population is 30 ice dollars (with people who view a CI having a mean WTP that is 30 ice dollars higher than people who view a PI), and that the standard deviation of the sampling distribution is 3. Using this information, I’ve simulated two sampling distributions. One under the supposed “known parameters” — e.g., mean difference between conditions of 30. And, a second under the asserted null hypothesis — e.g., mean difference between conditions of 0.

The figure below presents these two sampling distributions. The plot on the left illustrates the sampling distribution for our “known population parameter”, which is a difference of 30 ice dollars between conditions (observe the point where the dashed line intersects the x-axis). In contrast, the plot on the right visualizes the sampling distribution under the null hypothesis, which posits that there will be no difference (i.e., a difference of 0) in ice dollars spent to rent the special boulder between the two conditions (marked by the intersection of the dashed line on the x-axis). With the foundations of the Central Limit Theorem and the Empirical Rule in mind, we can assess the plausibility of this null hypothesis given what we observed in our sample.

Thus, with a null hypothesis significance test (NHST), we are asking a fundamental question: Given the null hypothesis is true — which posits no difference in WTP between the two conditions — how likely is it that we would observe the sample data we have (or something more extreme)? This pivotal question invites us to delve into the realm of conditional probability, emphasizing how NHST utilizes this concept.

Recall from Module 5 that conditional probability is concerned with assessing the probability of an event, given that another event has already occurred. Formally, we express this as:

\(P(\text{Observing our Sample Data} | H_0 \text{ is true})\).

In the context of NHST for our Hofman and colleagues sample data, this refers to the probability that we’d observe a mean difference in WTP between the two conditions of 29.4 (the actual difference between conditions in the Hofman study sample) or larger given the null hypothesis is true.

A NHST essentially tests the conditional probability of observing our specific sample data (or data more extreme) under the assumption that the null hypothesis is true. It’s crucial to understand that NHST does not provide the probability that the null hypothesis is true given the observed data. Instead, it evaluates how probable our observed data (or more extreme outcomes) would be if the null hypothesis were correct. This distinction is critical in the interpretation and understanding of NHST results.

To visualize this, take a look at the graph below — which extends the sampling distribution under the null hypothesis that we just studied, by adding our observed difference between conditions. Recall that earlier in the Module, we found the difference in mean WTP between the two conditions to be 29.4 ice dollars. Notice that the pink line represents this mean difference observed in our sample with respect to the x-axis.

If the null hypothesis is true, leveraging the Central Limit Theorem and the Empirical Rule, we can make predictive assertions about the behavior of sample means. Specifically, the CLT assures us that the sampling distribution of the mean difference in WTP between individuals exposed to the CI visualization and those exposed to the PI visualization will approximate a normal distribution, given a sufficiently large sample size. This normal distribution has a mean of 0 under the null hypothesis, which posits no difference between the population means.

Visualizing this through the lens of the probability density function (PDF) of a normal distribution helps contextualize our expectations. The PDF graphically represents how the probabilities of possible outcomes are distributed along the range of a variable. The peak of the curve centers around the mean (0 under the null hypothesis), with probabilities tapering off symmetrically as we move away from the center. The PDF of a normal distribution visually illustrates the likelihood of various outcomes around this mean, showing that outcomes closer to the mean are more probable, while those further away (towards the tails) become increasingly less likely.

The Empirical Rule dovetails with this by elucidating that in any normal distribution approximately 95% of observations fall within 1.96 standard deviations from the mean. This standard deviation of the sampling distribution, termed the “standard error,” thus becomes a critical yardstick for assessing the variability of our sample mean differences (between the two conditions) from the hypothesized population mean difference.

In hypothesis testing, when we assume the null hypothesis is true — specifically, that there is no difference in the population means of WTP between two conditions — we set the stage for evaluating the likelihood of our observed sample results within this framework. We ask: If the null hypothesis is true, how likely is it that we’d observe our sample statistic (or something more extreme)?

In this context, the pnorm() function becomes an indispensable tool. We can use it to calculate the cumulative probability of observing a sample mean difference as extreme as, or more extreme than, the one we’ve noted, assuming the null hypothesis holds. By evaluating how many standard errors our observed mean difference deviates from 0, pnorm() helps quantify the probability of encountering such an extreme outcome.

For instance, if our observed sample mean difference is two standard errors from the hypothesized mean of 0, pnorm() can compute the cumulative probability of observing such a difference, or one more extreme. This calculation reveals how much of the probability distribution lies beyond our observed mean difference, providing a clear picture of how unusual our findings are under the null hypothesis. An observed difference that falls within a region containing a significant majority of the distribution (e.g., more than 95%) suggests rarity, thus casting doubt on the null hypothesis’s applicability.

Let’s refine our graph of the sampling distribution under the null hypothesis by standardizing the mean differences. Instead of showing the actual mean difference in WTP between the two conditions for each sample, we’ll convert these raw difference scores to z-scores.

Recall from Module 2 that a z-score is calculated by subtracting the mean from an observation and then dividing by the standard deviation. In our context of comparing sample means across many, many random samples, we apply a similar concept with a specific twist. For each mean difference, we calculate a z-score by subtracting the hypothesized mean difference under the null hypothesis (which is 0, assuming there’s no real difference between the groups) and dividing by the standard error of the mean difference. This standardization process converts the mean differences into units of standard errors away from the hypothesized mean difference.

By doing this, we create a standardized sampling distribution that allows us to easily compare the observed mean difference to what we would expect under the null hypothesis. This adjustment mirrors the process of calculating z-scores but is specifically tailored to the analysis of differences between sample means rather than individual observations.

In the Hofman et al. data, the observed sample mean difference in WTP is 29.4 ice dollars (which we estimated earlier in the Module). For illustrative purposes, in our current exploration, we’re operating under the assumption that the standard deviation of the sampling distribution of the mean difference — known as the standard error — is 3. Consequently, our observed sample mean difference between the two groups stands at \(\frac{29.4 - 0}{3} = 9.8\) standard errors away from the null hypothesis value (which is denoted as the pink line in the graph above).

This finding places our observed mean difference squarely in the realm of statistical improbability if the null hypothesis were indeed true. To put it in perspective, the standard normal distribution — which underpins the calculation of probabilities for z-scores — suggests that values falling more than a few standard deviations away from the mean are exceedingly rare. Specifically, the probability of observing a value more than 3 standard deviations (or standard errors when referring to the sampling distribution) from the mean is about 0.3% on either tail of the distribution. Our observed mean difference, which is 9.8 standard errors away from the hypothesized mean difference of 0, is far beyond this range, indicating an extremely low probability under the null hypothesis.

What does this imply for our hypothesis test? If the null hypothesis were true, and there truly were no difference in ice dollars spent between the two conditions, then drawing a random sample that results in a mean difference as extreme as 9.8 standard errors from this proposed null value would be extraordinarily unlikely. The rarity of such an observation under the null hypothesis framework points to a significant discrepancy between what the null hypothesis predicts and what we have observed in our sample.

In terms of conditional probability, the chance of stumbling upon our observed sample mean difference — or one even more extreme — given that the null hypothesis is accurate, can be quantified using the cumulative distribution function (CDF) of the normal distribution. However, without yet delving into specific probability calculations, the qualitative assessment alone suggests a striking deviation from the null hypothesis expectations.

This deviation prompts a critical reevaluation of the null hypothesis. In the context of hypothesis testing, such an extreme observation suggests strong evidence against the null hypothesis, nudging us towards considering the alternative hypothesis that there is indeed a meaningful difference in ice dollars spent between the two conditions.

By contextualizing our observed sample mean difference within the framework of standard errors and the standard normal distribution, we harness the power of statistical reasoning to assess the plausibility of a hypothesis. This approach not only solidifies our understanding of hypothesis testing but also exemplifies the rigorous analytical methods statisticians employ to draw inferences from sample data about the broader population.

Therefore, the gist of hypothesis testing is to determine the probability that we would draw a random sample that produces a certain mean difference between conditions IF the null hypothesis is true. Using the tools and concepts from Modules 7 and 8, we can compute the probability of obtaining a sample mean difference as extreme as the one we observed (or more extreme), assuming the null hypothesis is true. We’ll delve deeper into the mechanics of this later in the Module. For now, given our null hypothesis that the mean difference between conditions is 0 and our sample estimate is 9.8 standard errors away from this null), we can calculate:

pnorm(q = 9.8, lower.tail = FALSE)
[1] 5.629282e-23

This calculation provides the probability of observing a mean difference between conditions that is 9.8 standard errors away from the null hypothesis assertion. The result is essentially zero (i.e., .000000000000000000000056), meaning it’s highly improbable to get such an extreme value if the true population mean difference in WTP between conditions was 0.

When considering the extremity of our observed sample mean difference, ultimately we will also want to take into account the possibility of observing equally extreme values in the opposite direction since we stated a two-tailed hypothesis (i.e, our alternative hypothesis is that the condition means are different — not that one is higher than the other). In other words, in our example, while we observed a mean difference that is 9.8 standard errors above the null hypothesis value, we’re equally interested in how unlikely a mean difference that is 9.8 standard errors below the null hypothesis value would be, assuming our null hypothesis is true. In this case, the mean for the PI condition would exceed the mean for the CI condition.

To compute this two-tailed probability, we can sum the probabilities of a z-score above 9.8 or a z-score below -9.8 as follows:

pnorm(q = 9.8, lower.tail = FALSE) + pnorm(q = -9.8) 
[1] 1.125856e-22

This calculation indicates that there’s a near zero probability of observing a sample mean difference in WTP between the two conditions that is 9.8 standard errors or more (in either direction) away from the null hypothesis mean difference. In other words, such extreme results are highly unlikely to occur by random chance alone if the true population difference in WTP between the two conditions is 0.

In the graph below, the population mean difference under the null hypothesis in denoted in solid green and the green dashed lines denote the upper and lower bounds of the middle 95% of the expected distribution if the null hypothesis holds true. Recall that in a standard normal distribution, 95% of the distribution falls with 1.96 standard deviations of the mean. The area between these dashed lines can be seen as the “Do not reject the null hypothesis” zone, since any sample mean difference falling within this range aligns with the null hypothesis. Conversely, areas beyond these bounds, either to the left or right, represent the “Reject the null hypothesis” zone; obtaining a sample mean difference in these regions would be uncommon if the null hypothesis were accurate.

If our sample mean difference falls within the rejection region (the areas labeled “Reject the null hypothesis”), it suggests an outcome so extreme that it casts doubt on the null hypothesis. In such cases, we would reject the null hypothesis, concluding that the mean difference in WTP between the two conditions is significantly different from 0. Conversely, if the sample mean difference lies within the “Do not reject the null hypothesis” zone, it indicates that a 0 difference in WTP between the two conditions in the population is reasonable, and thus, we cannot dismiss the null hypothesis.

Let’s overlay our observed sample statistic onto the graph of the sampling distribution for the null hypothesis, with the rejection regions marked. Our observed statistic (i.e., the difference in WTP between the two conditions) is 9.8 standard errors away from the null hypothesis and is represented by the pink line on the graph. This observed sample statistic is well outside the threshold for what we’d expect to observe if the null hypothesis is true. Therefore, it is highly unlikely that the true difference in WTP between the two conditions is 0. That is, the null hypothesis should be rejected.

Type I and Type II errors

After stating our null and alternative hypotheses, and then evaluating whether the null hypothesis is probable, there are four possible outcomes, as defined by the matrix below.

The columns delineate two potential realities:

  • Left Column (tagged “True”): Signifies a world where the null hypothesis is valid, suggesting that there is no difference in WTP between the two conditions.

  • Right Column (tagged “False”): Depicts a world where the null hypothesis is invalid, implying that there is a difference in WTP between the two conditions.

The rows of this matrix represent possible study outcomes:

  • Top Row (tagged “Rejected”): Illustrates when our research dismisses the null hypothesis, indicating it found significant difference in WTP between the conditions.

  • Bottom Row (tagged “Not rejected”): Shows when our research does not find evidence to reject the null hypothesis, indicating no difference in WTP between the two conditions.

Combining the larger population’s reality with our sample findings, we identify four potential outcomes:

  1. Correct Conclusions (Aqua Boxes):
  • Top Aqua Box: Our research detects a difference in WTP, aligning with real-world results.

  • Bottom Aqua Box: Our research does not detect a notable difference in WTP, consistent with actual non-difference in the population.

  1. Incorrect Conclusions (Pink Boxes):
  • Type I Error (Top Pink Box): Our study rejects the null hypothesis, but in reality, it’s valid. This is like a false alarm; we’ve identified a non-existent difference/effect.

  • Type II Error (Bottom Pink Box): Our study does not reject the null hypothesis, yet it’s false in the real world. This resembles a missed detection, where we overlook an actual difference/effect.

Type I Error Probability (\(\alpha\) - “alpha”): This represents the likelihood of mistakenly rejecting a true null hypothesis. Similar to setting the confidence level for confidence intervals (e.g., 95%, 99%), we determine our risk threshold for a Type I error beforehand. Frequently in Psychology, the threshold, \(\alpha\), is set at 0.05. While this is conventional, researchers should judiciously decide on \(\alpha\). If set at 0.05, it implies a 5% risk of wrongly rejecting the true null hypothesis — e.g., claiming their is a significant difference in WTP between conditions when really there is not.

Type II Error Probability (\(\beta\) = “beta”): This is the chance of incorrectly failing to reject a false null hypothesis. Its complement, 1-\(\beta\), known as the test’s “power,” reflects our capability to discover a genuine effect. Enhancing sample size or refining measurements can diminish \(\beta\), subsequently bolstering our test’s power.

To further solidify your understanding of these topics, please take a moment to watch this Crash Course Statistics video on Type I and Type II errors.

The general framework for conducting a hypothesis test

Working through the Hofman and colleagues’ study, we imagined that we knew the standard error (we claimed it was 3). But, typically, we do not know the standard error of the sampling distribution — it must be estimated from the data. Recall from Module 13 that we can capture the difference in a continuous outcomes across a grouping variable by fitting a linear regression model where the outcome (\(y_i\), in this case wtp_final) is regressed on a binary indicator of condition (\(x_i\), in this case interval_CI). Recall that the variable interval_CI is coded 1 if the individual saw a CI and 0 if they saw a PI.

Let’s fit the needed model (since we’re considering just cases where text_extra equals 0 (i.e., no extra/supplementary text) — that is, Conditions 1 and 3), then we need to subset the data before fitting the model):

viz_text <- 
  df |> 
  filter(condition.f %in% c("1: CI with viz stats only", "3: PI with viz stats only"))

interval_diff <- lm(wtp_final ~ interval_CI, data = viz_text)
interval_diff |> tidy(conf.int = TRUE, conf.level = .95) |> select(term, estimate, std.error, conf.low, conf.high)

Notice in the output that the estimate for the intercept represents the mean WTP for the PI condition (i.e. the reference group), and the estimate for the slope (interval_CI) represents the difference in the mean WTP for the CI condition compared to the PI condition — that is, our parameter estimate of interest as defined earlier.

Now that we have the gist of the process for conducting a NHST, let’s walk though step by step to test our hypothesis using the observed data from the Hofman and colleagues’ experiment.

Step 1: Define the hypothesis

Null Hypothesis
We can state the null hypothesis in any of the equivalent ways:

  • \(\mu_{CI} - \mu_{PI} = 0\) (the population mean difference between the two conditions is equal to 0)
  • \(\mu_{CI} = \mu_{PI}\) (the population means are equal)
  • \(\beta_1 = 0\) (the regression coefficient for the binary treatment indicator is equal to 0, indicating no effect of viewing the CI (versus the PI) on WTP)

Alternative Hypothesis
We can state the alternative hypothesis in any of the equivalent ways:

  • \(\mu_{CI} - \mu_{PI} \neq 0\) (the population mean difference between the two conditions is not equal to 0)
  • \(\mu_{CI} \neq \mu_{PI}\) (the population means are not equal)
  • \(\beta_1 \neq 0\) (the regression coefficient for the binary treatment indicator is not equal to 0, indicating an effect of viewing the CI (versus the PI) on WTP)

Step 2: Determine the appropriate statistical test

In this scenario, where we regressed wtp_final on interval_CI the appropriate statistical test is a t-test for the regression slope. This test evaluates whether the observed relationship between the binary indicator (CI condition vs. PI condition) and the outcome variable (WTP) is statistically significant.

The formula for the t-statistic for a regression coefficient is:

\[ t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} \]

Where:

  • \(\hat{\beta}_1\) is the estimated regression coefficient for the binary treatment indicator (interval_CI)

  • \(SE(\hat{\beta}_1)\) is the standard error of the estimated regression coefficient

This t-statistic measures how many standard errors the estimated coefficient is away from 0. If the null hypothesis is true (no effect of the CI condition on WTP or, stated differenty, no difference in the mean WTP between the two conditions), we would expect the t-statistic to be close to 0. A large absolute value of the t-statistic suggests that the observed coefficient is unlikely to be due to chance, and thus, the null hypothesis can be rejected in favor of the alternative.

By now, you’re familiar with the concept of z-scores or standardized scores, which provide a way to understand how many standard deviations a value lies from the mean in a standard normal distribution. When we conduct hypothesis testing, the t-statistic serves a similar purpose. Rather than being expressed in the raw units of the outcome variable (e.g., WTP as expressed in ice dollars), the t-statistic is expressed in standardized units, which gives us a relative measure of the distance of our sample statistic from the null hypothesis value. In our context, a t-statistic provides a standardized measure of deviation from the null hypothesis value (i.e., a mean difference in WTP of 0 between conditions in this instance). The larger the absolute value of the t-statistic, the greater the deviation from the null hypothesis. This makes it less likely that our observed sample statistic is merely a result of random chance, assuming the null hypothesis is true.

If the null hypothesis is true and the real population parameter is 0, then we’d expect the t-statistic to be around 0 (because there’d be little to no difference between the sample parameter and the null hypothesis value). If it’s significantly different from 0, it suggests that our sample provides enough evidence to challenge the validity of the null hypothesis.

We’ll consider a few common test statistics in this Module. However, the most important task is to understand the gist of hypothesis testing. Once that is well understood, then finding and conducting the right type of test is straight forward1.

Step 3: Define the rejection criteria

Before computing our test statistic, it’s crucial to specify the conditions under which we will reject the null hypothesis. For a two-tailed test, we evaluate both the lower (left) and upper (right) ends of the sampling distribution. We identify regions within each tail where, if our t-statistic lies, we would have sufficient evidence to reject the null hypothesis. These boundaries, which delineate where the t-statistic becomes too extreme under the null hypothesis, are termed critical t-values or critical values of t.

The concepts of alpha (\(\alpha\)) and degrees of freedom (df) are pivotal when determining these critical values to define the rejection region.

  • The alpha level is the probability of rejecting the null hypothesis when it is true. In other words, it represents the risk of making a Type I error. It’s a pre-defined threshold that researchers set before conducting a hypothesis test. The choice of \(\alpha\) is subjective, but in many fields, an \(\alpha\) of 0.05 is conventional. This means that there’s a 5% chance of rejecting the null hypothesis when it’s true. Other common values are 0.01 and 0.10, but the specific value chosen should be appropriate for the context of the study.

  • In the context of testing a linear regression coefficient, the df are defined as \(df = n - 1 - p\). Here, \(n\) represents the sample size, 1 represents the one degree of freedom that we lose for estimating the intercept of the regression model, and p represents the number of predictors (i.e., estimated slopes) in our regression model. We reduce the sample size by one df for each parameter that we estimate. In our current example, the n is 905, and p is 1, thus our df is 905 - 1 - 1 = 903.

  • Given an alpha level of 0.05 and df of 903, we can determine the critical t-values (lower and upper bound) that defines our rejection region for a two-tailed test.

Choosing an appropriate alpha level and understanding df are crucial steps in hypothesis testing. They inform the critical values that dictate whether or not we reject the null hypothesis. By setting these parameters before analyzing the data, researchers maintain the integrity and validity of their tests. For this example, we’ll set alpha to .05.

Once alpha and df are determined, to find the critical values (the boundaries of the rejection regions), we use the qt() function. Since it’s a two-tailed test, we split our alpha value, 0.05, into two equal parts, placing one half (0.025) in the lower tail and the other half in the upper tail — thus we need to find the 2.5th and 97.5th percentiles of the sampling distribution.

The following R code calculates the critical t-values for our example:

qt(p = c(.025, .975), df = 903)
[1] -1.962595  1.962595

Here’s a visual representation of the rejection criteria using our computed critical t-value.

If our calculated t-statistic is between -1.963 and 1.963, then we will NOT REJECT the null hypothesis. However, if the calculated t-statistic is outside of this range (either greater than 1.963 or less than -1.963), then WE WILL REJECT the null hypothesis.

Before we delve into the next step, which is to calculate the test statistic, please watch the following two Crash Course Statistics videos on the two topics just discussed — Test Statistics and Degrees of Freedom.

Step 4: Calculate the standard error

Recall from Module 12, that we considered two frequentist methods for estimating the standard error (SE) of a regression slope — bootstrapping and theory-based (i.e., parametric) formulas.

Bootstrap estimation of the standard error
The bootstrap method is a robust, non-parametric technique used to estimate the standard error of a regression slope by resampling data. This approach approximates the sampling distribution of a statistic without relying on strict parametric assumptions. The following code demonstrates how to use the bootstrap method to estimate the standard error for our example, specifically calculating the mean difference in WTP between the two conditions. This involves generating 5000 bootstrap resamples to simulate the sampling distribution, resulting in an estimated mean difference and its associated standard deviation.

# Define the number of bootstrap samples
n_bootstrap <- 5000

# Generate bootstrap samples and compute the coefficient for interval_CI
bootstrap_distributions <- map_dfr(1:n_bootstrap, ~ {
  # Resample the data with replacement
  resampled_data <- viz_text |> sample_n(n(), replace = TRUE)
  
  # Fit the model to the resampled data
  bootstrap_model <- lm(wtp_final ~ interval_CI, data = resampled_data)
  
  # Extract the coefficient for interval_CI
  tidy(bootstrap_model) %>% filter(term == "interval_CI")
}) |> rename(bootstrap_estimate = estimate) # Rename the estimate column for clarity

# Create a data frame to hold the estimated slopes across the bootstrap resamples
bootstrap_estimates_for_slope <- bootstrap_distributions |> as_tibble()

# Calculate the standard deviation of the bootstrapped standard deviation as an estimate of the standard error
bootstrap_results <- 
  bootstrap_distributions |> 
  summarise(mean_estimate = mean(bootstrap_estimate), sd_estimate = sd(bootstrap_estimate))

bootstrap_results

Using the bootstrap approach, we find that the estimate of the mean difference in WTP between the two conditions is 29.3, and the standard deviation is 3.4. Recall that the standard deviation of the bootstrap distribution of our statistic provides an estimate of the statistic’s standard error (SE).

Parametric (i.e., theory-based) estimation of the standard error
The parametric approach to estimating the standard error of a regression slope relies on assumptions about the distribution of the error terms in the linear regression model. Provided these are met, then we can use this simplified version of estimating the mean difference in WTP and the associated standard error. The code below applies the parametric/theory-based method to obtain the SE for the slope (which is labeled std.error in the output):

interval_diff <- lm(wtp_final ~ interval_CI, data = viz_text)
interval_diff |> tidy() |> filter(term == "interval_CI") |> select(term, estimate, std.error)

Using the parametric approach, we find that the estimate of the mean difference in WTP between the two conditions is 29.4, and the standard deviation is 3.4.

Summary of standard error estimation methods
The SE values obtained from both the bootstrap and parametric approaches are quite similar, and each represent estimates of the uncertainty or variability in the regression slope that quantifies the effect of the independent variable (condition) on the WTP, measured in ice dollars. These SE values provide a quantitative measure of how much the estimated slope, or the predicted difference in WTP contrasting the CI condition to the PI condition, might vary if we were to repeat the study with new samples from the same population.

This means, in a practical sense, that the true slope value is expected to fall within ±3.4 ice dollars of our calculated slope estimate in about 68% of random samples (assuming a normal distribution of slope estimates, as per the Empirical Rule). This variation underscores the precision of our estimate. The closer this SE is to zero, the less variability we expect in slope estimates across different samples, indicating a more precise estimate of the true slope. Conversely, a larger SE would signify greater variability and less precision, suggesting that our estimate might fluctuate more significantly with different samples.

Step 5: Calculate the test statistic

Now, we’re ready to calculate the test statistic. Recall that the test statistic is just simply the estimate divided by the standard error.

For the bootstrap estimation

bootstrap_results |> mutate(statistic = mean_estimate/sd_estimate)

\[ t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} = \frac{29.32459 - 0}{3.394314} = 8.639 \]

Thus, our estimate of the regression slope (which captures the mean difference in WTP between the CI condition and the PI condition) is 29.32459, the standard error for this estimate is 3.394314, and by dividing the estimate by the standard error we compute the test statistic, 8.639327, which provides us with the number of standard errors are sample estimate is away from the null hypothesis value (0).

For the theory-based estimation

interval_diff <- lm(wtp_final ~ interval_CI, data = viz_text)
interval_diff |> tidy() |> select(term, estimate, std.error, statistic)

\[ t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} = \frac{29.36551 - 0}{3.408571} = 8.615 \]

Thus, our estimate of the regression slope (which captures the mean difference in WTP between the CI condition and the PI condition) is 29.36551, the standard error for this estimate is 3.408571, and by dividing the estimate by the standard error we compute the test statistic, 8.615199, which provides us with the number of standard errors are sample estimate is away from the null hypothesis value (0). Note that the tidy() function computes this for us, it’s labeled statistic in the output.

Summary of test-statistic calculation

In this instance, we find that the test statistic when using the non-parametric and parametric approach is quite similar. In both cases the sample estimate is about 8.6 standard errors away from the null hypothesis value of 0.

Step 6: Determine if the calculated test-statistic is in the null hypothesis rejection region

Our test statistic of ~8.6 conveys that our sample produces a mean difference in WTP between conditions that is 8.6 standard errors away from the null hypothesis value of 0. Observing the critical values of our rejection criteria, it is evident that our t-statistic falls within the rejection range (i.e., it’s in the right tail and greater than 1.963).

The corresponding p-value (see p.value in the output) encapsulates the probability of encountering a t-statistic as extreme as ours (or even more so, i.e., beyond ±1.963) under the presumption that the null hypothesis holds. Effectively, it gauges the likelihood of witnessing a mean as far detached from 0 (our null hypothesis value) by mere random chance. Let’s reveal the p-value in the output of our regression model fitted using the parametric approach:

interval_diff |> tidy() |> select(term, estimate, std.error, statistic, p.value)

For our test, this p-value is minuscule, approximated at 0 (observed as p < 3.084582e-17, which translates to 0.00000000000000003084582). If the p-value is lesser than our chosen alpha level, we proceed to reject the null hypothesis. Consistent with our rejection criteria, a t-statistic residing within the rejection region invariably yields a p-value below alpha. That is, the rejection criterion and p-value are two ways of performing the same evaluation.

If we like, we can use the pt() function to calculate the p-value ourselves. In a two-tailed t-test, we are testing for the possibility of the observed sample parameter being significantly different from the hypothesized parameter, either greater or smaller. In our example, we need to assess both tails of the distribution: the right tail provides the probability of observing a sample mean as extreme as, or more extreme than, what we observed given the null hypothesis is true, while the left tail provides the equivalent probability for values as extreme in the opposite direction. By summing the probabilities from both tails, we acquire the complete p-value for the two-tailed test, which represents the total probability of observing our sample test statistic, or something more extreme, under the assumption that the null hypothesis is true.

pt(q = 8.615, df = 903, lower.tail = FALSE) +
  pt(q = -8.615, df = 903)
[1] 3.089531e-17

Please note that this is the same method used by lm(), the calculated p-value differs just slightly due to rounding. You can then use this same approach to calculate a p-value using the test statistic garnered from the bootstrap approach.

Step 7: Determine if the null hypothesis should be rejected

Because the t-statistic falls in the rejection region, we reject the null hypothesis. This means that the observed difference in willingness to pay between those who viewed the CI visualization and those who viewed the PI visualization is statistically significant. In simpler terms, the likelihood of seeing such a large difference (or an even larger one) by random chance is very low if there is no real difference in the population.

Substantively, this result suggests that the way information is presented (using confidence intervals versus prediction intervals) has a measurable impact on people’s willingness to pay. Specifically, the CI visualization appears to make the special boulder seem more appealing, leading to a higher willingness to pay. This could be because the narrower range of the CI gives a more precise estimate of the expected improvement, which might increase confidence in the effectiveness of the special boulder.

In practical terms, this finding highlights the importance of how statistical information is communicated to consumers. Marketers and researchers should consider the implications of using different types of visualizations to convey uncertainty, as these choices can significantly influence decision-making and perceived value.

The plot below illustrates the results of our test. Only 5% of samples should produce a mean difference between conditions that is more than 1.963 standard errors from the null hypothesis value (0), if the null is true. Since our sample produced a mean that is more than 1.963 standard errors from the mean, we reject the null hypothesis. The pink line is placed at 8.6 standard errors from the mean — the t-statistic calculated from our t-test (based on the parametric approach).

Step 8: Calculate confidence intervals

We’ve studied the calculation of CIs for our parameter estimates in many examples — so now you should feel comfortable with these. Here’s how we can calculate these for our current example.

For the bootstrap estimation

Recall from Modules 9 and 12 that we can calculate a CI based on the estimated parameters across the bootstrap resamples by taking the 2.5th and 97.5th percentiles of the distribution.

bootstrap_estimates_for_slope |> 
  summarize(lower = quantile(bootstrap_estimate, probs = .025),
            upper = quantile(bootstrap_estimate, probs = .975))

Here, we see that the 95% CI for the difference in WTP between the two conditions, based on the bootstrap resamples, is 22.7, 36.0.

For the theory-based estimation

As you already know, the tidy() output from a lm() object also provides the 95% CI.

interval_diff |> tidy(conf.int = TRUE, conf.level = .95)

Here, we see that the 95% CI for the difference in WTP between the two conditions is 22.7, 36.1. This is calculated as the estimate for the slope plus the critical t-values times the standard error. We use the same critical t-values used to calculate the rejection region for the hypothesis test. For example, the 95% confidence interval is calculated as follows: \(29.36551 \pm 1.962595\times3.408571\).

Summary of CIs

Confidence intervals (CIs) offer a range of plausible values for an estimated parameter and are crucial for understanding the precision of an estimate. In the context of studying the mean difference in willingness to pay (WTP) between two groups, the 95% CI provides a range of plausible values for the estimate. It’s important to recognize that across many random samples, 5% of these CIs would not include the true population parameter. This interval is calculated from the sample data and reflects the variability of the estimate: wider intervals indicate more variability, while narrower intervals suggest more precision.

The CI provides more information than a simple test of statistical significance because it offers a range of values that are consistent with the observed data, giving insight into the estimate’s precision and the effect size. This is particularly valuable as it contextualizes the point estimate within a range of plausible values, enhancing our understanding of the estimate’s reliability.

Relating confidence intervals to hypothesis testing, a key point is that if the 95% CI for the difference between two group means does not include zero, this suggests a statistically significant difference between the groups at the 5% significance level (i.e., alpha). The exclusion of zero from the interval indicates that the observed difference is unlikely to be due to sampling variability alone. In other words, 0 is not a plausible estimate for the difference in the outcome (e.g., WTP) between groups, and we have evidence against the null hypothesis.

Using a parametric approach, these three things will always coincide:

  1. A t-statistic that is in the rejection region,
  2. A confidence interval (CI) that does not include the null value (zero for no difference),
  3. A p-value that is smaller than the alpha level (commonly 0.05).

This is because all three metrics — t-statistic, confidence interval, and p-value — are derived from the same underlying statistical theory and data. When the t-statistic is in the rejection region, it corresponds to a p-value smaller than the alpha level, and the CI will not include the null value, indicating a statistically significant result.

Step 9: Quantify the effect size

Effect size is a crucial component of hypothesis testing, providing a quantitative measure of the magnitude of a phenomenon. While statistical significance can tell us if an effect exists, the effect size tells us how large that effect is. In the context of comparing two groups, Cohen’s d is a widely used measure of effect size. It quantifies the difference between two group means in terms of standard deviations, offering a standardized metric of difference that is not influenced by the scale of the measurements.

The formula for Cohen’s d is:

\[ d = \frac{M_1 - M_2}{SD_{pooled}} \]

Where:

  • \(M_1\) and \(M_2\) are the mean of \(y_i\) of the two groups being compared.
  • \(SD_{pooled}\) is the pooled standard deviation of \(y_i\) of the two groups, calculated as:

\[ SD_{pooled} = \sqrt{\frac{(n_1 - 1) \times SD_1^2 + (n_2 - 1) \times SD_2^2}{n_1 + n_2 - 2}} \]

In this equation:

  • \(n_1\) and \(n_2\) are the sample sizes of the two groups.
  • \(SD_1\) and \(SD_2\) are the standard deviations of the two groups.

The effectsize package in R simplifies the process of calculating various effect sizes, including Cohen’s d. The cohens_d() function calculates Cohen’s d.

The code snippet below demonstrates the application of the cohens_d() function to calculate the effect size for our specific case. Prior to leveraging the cohens_d() function, a crucial step of recoding is necessary. This preparatory step stems from how the cohens_d() function differentiates between groups based on the predictor variable, in this case, interval_CI. Originally, interval_CI employs a binary coding system where 0 signifies the PI condition and 1 signifies the CI condition. Given this numerical ordering, cohens_d() inherently identifies the PI condition as the first group and the CI condition as the second group. However, to align the sign of the calculated effect size with the earlier linear model estimations — and to facilitate a more intuitive interpretation — it is desirable for the CI condition to be recognized as the first group and the PI condition as the second.

To achieve this, we can redefine interval_CI into a factor variable named interval_CI.f, wherein we explicitly dictate the sequence of the groups according to our preference. This redefinition not only corrects the order in which the conditions are compared but also ensures that the resulting sign of Cohen’s d mirrors the directional effect observed in our linear model analysis.

library(effectsize)

viz_text <-
  viz_text |> 
  mutate(interval_CI.f = factor(interval_CI, 
                                levels = c(1,0), labels = c("CI viz", "PI viz")))
         
cohens_d(wtp_final ~ interval_CI.f, ci = .95, data = viz_text)

The output from the cohens_d() function provides the value of Cohen’s d along with its 95% CI (CI). In this case, the Cohen’s d value is 0.57, with a 95% CI ranging from 0.44 to 0.71. This indicates that the mean WTP in the CI condition is 0.57 standard deviations higher than in the PI condition.

The CI provides additional context by indicating the range of plausible values for Cohen’s d based on the sample data. That the entire interval lies above zero further supports the conclusion that the CI condition is associated with a higher mean WTP compared to the PI condition.

Cohen’s conventions for interpreting the size of d suggest that a value of 0.2 to 0.5 represents a small effect, around 0.5 to around .8 a medium effect, and 0.8 or greater a large effect. Therefore, a Cohen’s d of 0.57 would be considered a moderate effect size, indicating a meaningful difference between the two conditions that is likely to have practical significance.

Step 10: Check model assumptions

An essential aspect of conducting reliable statistical analyses, particularly when employing models such as linear regression, involves verifying that the assumptions underlying these statistical tests are met. These assumptions include linearity of relationships, independence of observations, homoscedasticity (equal variances) of errors, and normality of error terms, among others. Each of these assumptions plays a critical role in ensuring the validity of the test results and the accuracy of the conclusions drawn from them. Failing to meet these assumptions can lead to biased estimates, incorrect inferences, and ultimately misleading conclusions.

Therefore, it’s crucial for researchers to perform diagnostic checks and consider appropriate remedial measures when assumptions are violated. The process of checking these assumptions, understanding their importance, and addressing potential violations is a nuanced area that requires careful attention. Recognizing the significance of this step, the next Module (i.e., Module 17) for this course will be dedicated to exploring these assumptions in detail. We will delve into methods for assessing the validity of each assumption, strategies for diagnosing potential issues, and techniques for making corrections when necessary.

A word of caution

Null Hypothesis Significance Testing (NHST) has long been a cornerstone of statistical analysis in research, offering a framework to evaluate the probability that observed data would occur under a specific null hypothesis. However, NHST is not without its criticisms and limitations, which have sparked ongoing debate among statisticians and researchers. One of the primary criticisms of NHST is its binary nature, classifying results strictly as “significant” or “not significant” based on a pre-defined alpha level, typically set at 0.05. This arbitrary cutoff does not account for the magnitude of the effect, the precision of the estimate, or the practical significance of the findings. As a result, important but smaller effects may be dismissed, while statistically significant findings may be emphasized without consideration of their real-world relevance.

Moreover, the reliance on p-values as the sole metric for statistical significance can be misleading and great caution when using p-values to conduct hypothesis tests is in order. While p-values are a fundamental part of statistical hypothesis testing, they should be used and interpreted with caution. A p-value is not a measure of the magnitude or importance of an effect, nor does it provide direct information about the null hypothesis being true. Rather, it tells us the probability of observing an effect as large or larger than the one we observed in our sample, given that the null hypothesis is true. Furthermore, a p-value is susceptible to sample size: with large samples, small and perhaps unimportant differences can be detected as statistically significant, while with small samples, even large and potentially important effects may not be detected as significant. Therefore, relying solely on p-values for decision-making can lead to misinterpretations. It’s also important to remember that a p-value doesn’t consider the possibility of data errors or biases in the study design. For a more complete understanding of the data, p-values should be used in conjunction with other statistical measures and tools, such as confidence intervals, and with a careful consideration of the research context and design. Alternatively, researchers can take a Bayesian approach, which offers a probabilistic framework that incorporates prior knowledge and provides a more comprehensive interpretation of the results.

Please watch the following video on the potential problems that we can encounter with p-values.

Hypothesis tests for the overall model

When we analyze data using linear regression, our primary goal is often to understand how well our set of explanatory variables can predict or explain the variation in the outcome variable. One intuitive measure for assessing this is the \(R^2\) value, which you’ve already studied. \(R^2\) represents the proportion of the variance in the outcome variable that’s predictable from the predictors. In simple terms, it tells us how much of the change in our outcome variable can be explained by changes in our predictor variables. However, while \(R^2\) gives us a good indication of the model’s explanatory power, it doesn’t tell us whether the observed relationship between our predictors and the outcome is statistically significant.

It turns out that there is an additional statistical test, called an F-test, that we can use to determine if the overall variability explained collectively by the predictors in the model explains a statistically significant proportion of variance. The F-test provides a way to formally assess the overall significance of a regression model. It helps us answer the question: “Are the relationships that we observe between the predictors and the outcome variable just due to chance, or are these predictors truly able to predict the outcome?” Essentially, the F-test compares a model with no predictors (just the intercept, which predicts the mean outcome for all observations) against our actual model with predictors, to see if our predictors improve the model’s ability to explain the variability in the outcome variable significantly.

Why is this important? Imagine you’ve calculated an \(R^2\) value from a multiple linear regression model that suggests your model explains a good portion of the variance in the outcome. That sounds promising, but without the F-test, we don’t know if this result is statistically robust or if it could simply be due to random variation in the data. The F-test addresses this by providing a p-value, which tells us whether we can reject the null hypothesis that our model with the specified predictors does not offer a better fit than a model without them.

The F-distribution, also known as the Fisher-Snedecor distribution, is a statistical distribution similar to the normal and t-distributions. It is derived from the ratio of two chi-squared distributions, each divided by their respective degrees of freedom (df). This results in a distribution that is right-skewed (positively skewed). The shape of the F-distribution is determined by two parameters: the df for the numerator (df1) and the df for the denominator (df2).

  • df1 (numerator df): This is related to the variance explained by the model.
  • df2 (denominator df): This is related to the residual or unexplained variance.

The degrees of freedom for both the numerator and the denominator come from the variance estimates used to calculate the F-statistic. The F-distribution is one-tailed because it is used to test hypotheses about variances and always represents the ratio of two variances, which are always positive. Unlike the t-distribution, which can take both positive and negative values (as it tests differences in means), the F-distribution only considers positive values because it measures the extent to which observed variances deviate from expected variances.

\[ F = \frac{MS_{model}}{MS_{residual}} \]

where:

  • \(MS_{model}\) is the Mean Square for the Model, calculated as the Sum of Squares Regression (SSR) divided by the degrees of freedom for the model (\(df_{model}\)).
  • \(MS_{residual}\) is the Mean Square for the Residuals, calculated as the Sum of Squares of Error (SSE) divided by the degrees of freedom for the residuals (\(df_{residual}\)).

These components can be expressed more fully as:

\[ MS_{model} = \frac{SSR}{df_{model}} = \frac{\sum(\hat{y}_i - \bar{y})^2}{p} \]

\[ MS_{residual} = \frac{SSE}{df_{residual}} = \frac{\sum(y_i - \hat{y}_i)^2}{n - p - 1} \]

Where:

  • \(SSR\) (Sum of Squares Regression, also called the Model Sum of Squares) measures how much of the total variability in the dependent variable (\(y_i\)) is explained by the model.

  • \(SSE\) (Sum of Squares of Error) measures the variability in \(y_i\) that is not explained by the model.

  • \(\hat{y}_i\) is the predicted value of \(y\) for the \(i^{th}\) observation.

  • \(\bar{y}\) is the mean of observed values of \(y\).

  • \(y_i\) is the observed value of \(y\) for the \(i^{th}\) observation.

  • \(n\) is the total number of observations.

  • \(p\) is the number of predictors in the model.

  • \(df_{model}\) represents the degrees of freedom for the model (i.e., the number of predictors).

  • \(df_{residual}\) represents the degrees of freedom for the residuals (calculated as \(n - p - 1\), where \(n\) is the sample size and \(p\) is the number of predictors).

The F-statistic thus calculated is then compared against a critical value from the F-distribution with \((df_{model}, df_{error})\) degrees of freedom to determine statistical significance. A significant F-statistic indicates that the model explains a significant portion of the variance in the outcome (\(y_i\)).

Let’s consider an example. Recall that in the Hofman and colleagues experiment that we’ve been considering (experiment 1 from the paper), there are actually two experimental factors.

  1. The type of interval seen in the visualization (CI versus PI)
  2. Whether the visualization included text that only discussed the type of interval in the assigned visualization or if the text described both types of intervals.

The variable interval_CI is coded 1 if the visualization presented a CI and 0 if the visualization presented a PI, and text_extra is coded 1 if the visualization included extra/supplemental information about both types of intervals and 0 if it only included information about the type of interval presented in the visualization.

Model 1

To assess the individual and combined influence of these factors on WTP, we may construct a linear regression model where wtp_final (the amount participants were willing to pay) is regressed on both of the condition indicators: interval_CI and text_extra.

model1 <- lm(wtp_final ~ interval_CI + text_extra, data = df) 
model1 |> tidy(conf.int = TRUE, conf.level = .95)
model1 |> glance() 

Let’s begin with the tidy() output:

  • The intercept, estimated at 52.1 ice dollars, represents the average WTP for participants in the reference group—those who viewed a prediction interval (PI) without additional explanatory text (since both interval_CI and text_extra are coded 0 for this group). This value serves as a baseline against which the effects of the experimental conditions are compared.

  • The slope for interval_CI, estimated at 24.1, indicates a significant positive effect of viewing a confidence interval (CI) versus a prediction interval (PI) on WTP, holding the presence of extra text constant. Specifically, on average, participants exposed to a CI visualization were willing to pay approximately 24.1 ice dollars more for the special boulder than those who viewed a PI visualization. This substantial increase, with a very small p-value, provides evidence that the type of interval presented significantly affects participants’ valuation.

  • Conversely, the slope for text_extra, estimated at -0.7, suggests a negligible and non-significant effect of including additional explanatory text on WTP, holding the type of interval constant. The negative sign indicates a slight decrease in WTP associated with extra text, but the effect is not statistically significant (p-value ≈ 0.79), suggesting that the presence of additional textual information about intervals does not meaningfully influence participants’ WTP.

Turning to the glance() output, which offers a comprehensive summary of the model’s performance and overall significance:

  • The \(R^2\) value of 0.0519627 indicates that approximately 5.2% of the variance in WTP is explained by the model. While this suggests the additive effects of these two predictors can explain a portion of the variability in WTP, a substantial amount of variance remains unexplained.

  • The F-statistic of 47.68541 (labeled statistic in the glance() output) and its associated p-value of approximately \(6.89 \times 10^{-21}\) test the null hypothesis that none of the predictors in the model have an effect on WTP. The extremely low p-value leads us to reject this null hypothesis, concluding that the model, as a whole, significantly explains variance in WTP beyond what would be expected by chance. This finding underscores the combined contribution of interval_CI and text_extra to predict WTP, despite the latter’s individual non-significance. In discussing the F-statistic of 47.68541 for the model, which includes two predictors (interval_CI and text_extra) with a sample size of n = 1743, it’s insightful to understand how the critical value of F is determined for conducting the F-test (akin to the critical value of t and the t-statistic that we used to determine statistical significance for the regression slopes). The critical value is a threshold used to decide whether the observed F-statistic is large enough to reject the null hypothesis at a given significance level, typically denoted by alpha. This value is obtained from the F-distribution, which depends on the df of the model (number of predictors) and the degrees of freedom of the error/residuals (n - 1 - p). For our model (called model1), with two predictors and n = 1,743, the df for the numerator of the F-test is 2 and the df for the denominator of the F-test is 1,740. These df values are listed in the glance() output as df and df.residual respectively. To find the critical value for the F-test, we can use the qf() function. For alpha = 0.05, we compute:

qf(0.95, df1 = 2, df2 = 1740)
[1] 3.000896

This function returns the value from the F-distribution for which 95% of the distribution lies below it. If the calculated F-statistic from our model exceeds this critical value, we may reject the null hypothesis, indicating that at least one of the predictors is significantly related to the outcome variable.

For our model (model1), with an F-statistic of 47.68541, comparing this value to the critical value obtained from qf() determines the statistical significance of the model predictors collectively. Our calculated F-statistic of 47.68541 clearly exceeds the critical value of F (3.000896), Thus, it is evident without specific calculation that our F-statistic far exceeds the critical threshold, leading to rejection of the null hypothesis. This confirms that the predictors, as a whole, significantly explain variance in WTP beyond what would be expected by chance.

To calculate the p-value for an F-test manually using R, we can utilize the pf() function, which gives the cumulative distribution function (CDF) for the F-distribution. This function is useful for determining the probability of observing an F-statistic as extreme as, or more extreme than, the one calculated from our data, under the null hypothesis.

pf(q = 47.68541, df1 = 2, df2 = 1740, lower.tail = FALSE)
[1] 6.88843e-21

This is the same p-value printed for the F-test in the glance() output (with a slight difference due to rounding).

We can visualize the test as follows:

Model 2

Next, we explore an alternative specification of our regression model to deepen our understanding of the experimental effects. Rather than treating the two experimental condition indicators — interval_CI and text_extra — as having independent effects on the outcome variable (wtp_final), we introduce an interaction term between these predictors. This approach allows us to investigate whether the impact of one experimental condition on participants’ willingness to pay for the special boulder depends on the level of the other condition. This is a common type of model fit for a 2 by 2 (2X2) factorial design.

Incorporating an interaction term into the regression model acknowledges the possibility of a synergistic or conditional relationship between viewing confidence intervals (CI) versus prediction intervals (PI) and the presence of additional explanatory text. It provides a nuanced analysis that can uncover patterns not evident when considering these factors in isolation.

To fit the model with the interaction, we modify our Model 1 code to fit a second model as follows:

model2 <- lm(wtp_final ~ interval_CI*text_extra, data = df) 
model2 |> tidy(conf.int = TRUE, conf.level = .95)
model2 |> glance() 

Let’s begin with the tidy() output:

  • The estimated intercept is 49.6 ice dollars. This represents the baseline average WTP for participants who were exposed to the Prediction Interval (PI) visualization without any additional explanatory text, as both interval_CI and text_extra are coded as 0 for this group.

  • The coefficient for interval_CI is significantly positive at 29.4 ice dollars. This indicates that, among participants who saw just the standard text (i.e., text_extra == 0), participants viewing a Confidence Interval (CI) visualization, as opposed to a Prediction Interval (PI) visualization, were willing to pay an additional 29.4 ice dollars for the special boulder. The significance of this coefficient (p-value ≈ \(2.10 \times 10^{-17}\)) strongly suggests that the type of statistical interval presented has a meaningful impact on participants’ valuation when no extra/supplementary text is provided.

  • The coefficient for text_extra is 4.8 ice dollars, suggesting a positive but non-significant effect of including additional explanatory text alongside the visualization on WTP among people who saw the PI visualization (i.e., interval_CI == 0). However, given its p-value (≈ 0.17), this increase does not reach statistical significance.

  • The interaction term has an estimated coefficient of -11.0 ice dollars, significant at the p-value of ≈ 0.026. This finding indicates that the effect of presenting a CI visualization is moderated by the addition of extra text. Specifically, this coefficient represents the difference in the effect of seeing a CI versus a PI for people who saw extra text as compared to just the standard text. Since the estimate of 29.4 for interval_CI represents the effect of seeing a CI as opposed to a PI if the text included only information about the visualization (i.e., text_extra = 0), then to calculate the estimated slope for the effect of CI if the text included information about both types of intervals (i.e., text_extra = 1), we compute 29.365514 + -10.963395 = 18.4. Thus the boost in WTP obtained by presenting a CI vs. a PI is attenuated if the text included with the visualization contains information about both CIs and PIs. That is, the optimism that participants gain by seeing a CI can be attenuated if the text provides information about CIs and PIs to put the CI in context.

We can have the marginaleffects package do these simple slope calculations for us via the slopes() function that we learned about in Module 14. The table below provides us with the effect of type of interval on WTP when the participant received only the visualization text and when the participant received the extra text. Notice the estimate matches what we calculated by hand above, with the added benefit here that we also get a standard error for the simple slopes as well as a 95% CI.

slopes(model2, 
       variables = "interval_CI", by = "text_extra", conf_level = .95) |> 
  as_tibble() |> 
  select(term, text_extra, estimate, std.error, conf.low, conf.high)

Overall model summary

Turning to the glance() output, which offers a comprehensive summary of the model’s performance and overall significance, we find that:

  • With a \(R^2\) value of 0.05464827, the model explains approximately 5.5% of the variance in WTP. This modest value indicates that while our model captures a certain amount of variability in WTP, there remains a significant portion of variance unaccounted for. Recall that the \(R^2\) for the model without the interaction was 0.05196, thus, the increase in \(R^2\) is quite small.

  • The F-statistic of 33.50899, with an accompanying p-value of ≈ \(4.72 × 10 − 21\), provides strong evidence against the null hypothesis that the model with predictors (including the interaction term) does not improve the explanation of variance in WTP over a model with no predictors. This result underscores the statistical significance of the model as a whole, indicating that the predictors, collectively, have a meaningful effect on WTP.

Compare the two fitted models

We can compare two nested linear models via a partial F-test. To do so, we use the anova() function. Two models are considered nested when one model (the simpler or reduced model) is a special case of the other model (the more complex or full model) because it contains a subset of the predictors used in the more complex model. In other words, the simpler model can be obtained by constraining some of the parameters (e.g., coefficients of predictors) in the more complex model to be zero.

The anova() function performs a hypothesis test to determine if the more complex model (with more predictors, including interaction terms) provides a significantly better fit to the data than the simpler model. The null hypothesis in this setting is that the two models are equivalent, and the alternative hypothesis is that the fuller model fits significantly better than the simpler model. Because Model 1 is nested within Model 2 in our example, we can compare them using a partial F-test. The partial F-test is particularly useful for testing the significance of one or more variables added to a model.

The df for the partial F-test are calculated a bit different than for a tradition F-test. The df for the numerator (\(df_{1}\)) of the F-statistic correspond to the difference in the number of parameters estimated between the more complex model and the simpler model. Essentially, it reflects the number of additional predictors included in the more complex model that are not in the simpler model.

  • Formula: \(df_{1} = p_{complex} - p_{simple}\)
  • Where:
    • \(p_{complex}\) is the number of parameters (including the intercept) in the more complex model.
    • \(p_{simple}\) is the number of parameters (including the intercept) in the simpler model.

For example, Model 1 has 3 parameters (intercept and two slopes) and the more complex model has 4 parameters (intercept and three slopes), therefore, \(df_{1} = 4 - 3 = 1\).

The degrees of freedom for the denominator (\(df_{2}\)) relate to the residual error in the more complex model. It reflects the amount of data available after accounting for the estimated parameters.

  • Formula: \(df_{2} = n - p_{complex}\)
  • Where:
    • \(n\) is the total number of observations in the data.
    • \(p_{complex}\) is the number of parameters in the more complex model.

For example, we have 1,743 observations and the more complex model has 4 parameters, therefore, \(df_{2} = 1743 - 4 = 1739\).

The calculated \(df_{1}\) and \(df_{2}\) are used to determine the critical value of the F-distribution for the significance level of the test (e.g., \(\alpha = 0.05\)). For our partial F-test, the numerator df is 1, and the denominator df is 1739, thus our critical value of F is:

qf(0.95, df1 = 1, df2 = 1739)
[1] 3.846812

If our partial F-test exceeds this value, then we will reject the null hypothesis that the two models are equivalent.

Given our two models, model1 (simpler model without interaction terms) and model2 (more complex model with interaction terms), the partial F-test can be conducted as follows:

anova(model1, model2)

This command compares the two models, model1 and model2, where model1 is nested within model2 because model2 includes all the predictors in model1 plus additional terms (e.g., the interaction between interval_CI and text_extra). The anova() function will output a table that includes the F-statistic and the associated p-value for the comparison.

Interpreting the Results:

  • F-statistic: This value measures the ratio of the mean squared error (MSE) reduction per additional parameter in the more complex model to the MSE of the simpler model. A higher F-statistic indicates that the additional parameters in the more complex model provide a significant improvement in explaining the variability in the response variable.
  • p-value: This value indicates the probability of observing the F-statistic, or one more extreme, under the null hypothesis that the two models are equivalent, and the additional parameters in the more complex model do not significantly improve the model fit. A small p-value (typically < 0.05) suggests that the more complex model significantly improves the fit compared to the simpler model, justifying the inclusion of the additional parameters. If the p-value is significant, we can conclude that the interaction term in model2 contributes meaningful information to the model that was not captured by model1, indicating that the effect of one predictor on the response variable depends on the level of the other predictor. If the p-value is not significant, this suggests that the simpler model without the interaction term is sufficient to explain the variability in the response variable.

The Analysis of Variance (ANOVA) table provided for our example shows the results of a partial F-test comparing two nested models to determine the significance of adding an interaction term between the two conditions on WTP. Let’s break down how to interpret the key components of the output and the implications of the test results:

ANOVA Table Interpretation

  • Model Comparison: Model 1, the simpler model, includes wtp_final ~ interval_CI + text_extra, representing the effects of interval_CI and text_extra without considering their interaction. Model 2, the more complex model, includes wtp_final ~ interval_CI*text_extra, adding the interaction term between the two predictors.
  • Residual Degrees of Freedom (Res.Df): The residual degrees of freedom decrease from 1740 in Model 1 to 1739 in Model 2. This reduction by 1 reflects the addition of one parameter (the interaction term) in going from Model 1 to Model 2.
  • Residual Sum of Squares (RSS): The RSS, also referred to as sum of squares error (SSE), which measures the total variance in the outcome not explained by the model, decreases from 4598847 in Model 1 to 4585820 in Model 2. This reduction indicates that Model 2, by including the interaction term, explains more variance in the outcome than Model 1.
  • Difference in Degrees of Freedom (Df): The difference in df between the two models is 1, corresponding to the addition of the interaction term in Model 2.
  • Sum of Squares (Sum of Sq): This value, 13027.48, represents the decrease in RSS attributable to adding the interaction term to the model (i.e, the difference in RSS between Models 1 and 2). It quantifies the additional variance in the outcome explained by considering the interaction between the two condition variables.
  • F-statistic (F): The F-statistic of 4.940184 tests whether the additional variance explained by the interaction term is significantly greater than what would be expected by chance. It compares the model fit improvement per additional parameter (in this case, the interaction term) to the unexplained variance in the simpler model.
  • p-value (Pr(>F)): The p-value of 0.02636741 indicates that the improvement in model fit due to adding the interaction term is statistically significant at the 5% significance level (i.e., alpha = .05). This suggests that the interaction between the two variables has a meaningful impact on the outcome.

It is of interest to note that the Partial F-test in this example essentially tests the same hypothesis as the t-test for the interaction term coefficient in Model 2. The equivalence of these tests is evident in the matching p-value of 0.02637 for both the interaction term’s significance in Model 2 and the partial F-test comparing Model 1 and Model 2. Furthermore, squaring the t-statistic for the interaction term in Model 2 yields the F-statistic observed in the partial F-test (i.e., \(-2.222653^2 = 4.940184\)), reinforcing their conceptual similarity and the direct relationship between these statistics in testing the importance of the interaction term.

This analysis underscores the value of including interaction terms in regression models when hypothesizing that the effect of one predictor on the outcome variable may depend on the level of another predictor. The significant p-value associated with the interaction term, mirrored by the partial F-test, indicates that the effect of viewing a CI, rather than a PI, on WTP is indeed moderated by whether or not the figure has additional text to describe both types of intervals, offering nuanced insights into how these factors jointly influence participants’ WTP.

We can use the marginaleffects function to plot the results.

plot_predictions(model2, by = c("interval_CI", "text_extra")) + 
  theme_minimal() + 
  labs(title = "Differential effect of viewing a CI (versus a PI) when \nextra/supplemental text is offered",
       y = "Willingness to pay (in ice dollars)",
       x = "Saw a CI (0 = No, 1 = Yes)",
       color = "Saw extra text (0 = No, 1 = Yes)") +
  theme(legend.position = "bottom")

Recap of these ideas

To recap these ideas, please take a moment to watch the following two Crash Course Statistics videos — one that focuses on ANOVA and the other that considers group interactions.

Wrap up

Hypothesis testing is crucial because it provides a structured framework for making inferences about populations based on sample data. It allows researchers to test theories, validate assumptions, and make data-driven decisions. The core idea is to determine whether the observed data provide sufficient evidence to reject a null hypothesis, which typically represents a default or no-effect scenario.

Core Components

  1. Null and Alternative Hypotheses: Establishing these hypotheses is foundational. The null hypothesis assumes no effect or no difference, serving as a baseline. The alternative hypothesis suggests the presence of an effect or difference. Understanding these concepts is vital for framing research questions and interpreting results.

  2. Type I and Type II Errors: These errors highlight the risks in hypothesis testing. A Type I error (false positive) occurs when the null hypothesis is incorrectly rejected, while a Type II error (false negative) happens when a false null hypothesis is not rejected. Recognizing these errors is crucial for understanding the limitations and reliability of test results.

  3. Alpha and p-Value: The alpha level (often set at 0.05) represents the threshold for rejecting the null hypothesis. The p-value indicates the probability of obtaining the observed results, or more extreme, if the null hypothesis is true. These metrics are central to determining statistical significance and making informed decisions.

Practical Application and Interpretation

Conducting hypothesis tests using statistical software (like R) involves calculating test statistics and p-values to evaluate hypotheses. By interpreting these results, researchers can draw meaningful conclusions about their data. For instance, in the context of the Hofman, Goldstein, and Hullman study, understanding how different visualizations affect participants’ willingness to pay provides insights into effective communication of statistical information.

Broader Implications

Hypothesis testing extends beyond academic research. It is widely used in various fields such as medicine, economics, and social sciences to inform policy decisions, clinical trials, and business strategies. By providing a rigorous method for evaluating evidence, hypothesis testing helps ensure that conclusions are based on data rather than assumptions or biases.

Conclusion

The principles and procedures of NHST covered in this Module are fundamental for conducting rigorous statistical analyses. By mastering these concepts, we can enhance our ability to critically evaluate data, draw valid conclusions, and contribute to evidence-based practices in our respective fields. This Module’s insights into hypothesis testing not only build technical skills but also foster a deeper appreciation for the scientific method and its role in advancing knowledge.

Footnotes

  1. There are several great templates for choosing a test statistic. This one from UCLA’s Statistical Methods and Data Analytics group is helpful.↩︎