library(skimr)
library(gtsummary)
library(gt)
library(here)
library(tidyverse)
Understanding Populations through Sampling and Descriptive Statistics
Module 7
Learning objectives
- Contrast the difference between a population and a sample
- Describe the importance of random sampling
- Identify the benefits of calculating descriptive statistics from a representative sample
- Provide a general overview of survey research
- Efficiently compute descriptive statistics in R
- Create publication-ready tables to depict descriptive statistics
- Create and interpret cross-tabulations of two qualitative variables
- Practice replicating the findings of a published paper
- Provide an example of how survey research can identify an unfolding crisis
Overview
Scientific research typically aims to achieve one or more of the following objectives: description, prediction, or causal inference. Each goal necessitates distinct methodologies concerning study design, data collection, and analysis. These points are well articulated by Ellen Hamaker and colleagues in this paper.
Description involves observing and characterizing phenomena as they naturally occur, without attempting to influence the outcomes. This approach often answers “what” is happening within a given context and relies heavily on data collection techniques like surveys and observational studies. For example, a Cognitive or Industrial Organizational Psychologist might explore the prevalence of Alzheimer’s disease among older adults in the U.S., and examine potential differences as a function of occupation. This descriptive analysis helps establish patterns or relationships between variables but does not imply causation. That is, if differences in the prevalence of Alzheimer’s disease across different occupations is observed in the descriptive statistics — this does not provide evidence in itself that certain occupations cause (or prevent) Alzheimer’s disease.
Prediction focuses on forecasting future outcomes based on existing data. This requires models that utilize historical data to predict future events or states. Such models are particularly valuable in scenarios where predicting an outcome can lead to early intervention or improved decision-making. For instance, a Clinical Psychologist might be interested in finding the key predictors of therapeutic alliance between a client and a practitioner. It’s important to note however that establishing a certain variable as a reliable predictor of a certain outcome does not provide evidence in itself that the predictor causes the outcome.
Causal Inference seeks to determine the causal effect of one variable on another, establishing a cause-and-effect relationship. This goal surpasses mere description (as in descriptive studies) or correlation (as in predictive studies) by aiming to demonstrate that changes in one variable directly result in changes in another. Experimental designs, especially randomized controlled trials (RCTs), are considered the gold standard for causal inference because they eliminate confounding variables—factors that affect both the independent variable and the outcome. In an RCT, random assignment to treatment or control groups ensures no background variables are associated with the treatment assignment, eliminating the risk of confounding under ideal circumstances. For instance, a Health Psychologist may use an RCT to evaluate a new behavioral intervention for quitting smoking, ensuring any observed effects are due to the intervention itself. In cases where variables cannot be ethically or practically manipulated, such as gender or naturally occurring exposures, researchers must rely on observational studies. These studies are more susceptible to confounding variables, which can bias results. To address this, researchers use techniques like stratification, matching, and statistical models to adjust for potential confounders. For example, an epidemiologist might use longitudinal data to study the impact of air pollution on respiratory health, adjusting for confounders like age, smoking habits, and socio-economic status to more accurately estimate the causal effect. In summary, while RCTs are ideal for establishing causality by controlling for confounding variables, observational studies require rigorous methods to mitigate confounding and achieve valid causal inferences.
Throughout the course, we will explore each of these research goals in depth, utilizing various data types and research designs. By understanding the strengths and limitations of each approach, you will be better equipped to choose appropriate methods for your research questions and critically assess the research of others. This foundational knowledge not only enhances your analytical skills but also prepares you for more advanced topics in modeling and statistical inference, which we will cover in upcoming Modules.
In this Module, we delve into descriptive statistics. This involves summarizing datasets to uncover patterns and insights that inform our understanding of a subject.
Introduction to descriptive statistics
In data analysis, descriptive statistics is a critical tool for understanding and interpreting data. It provides methods for gathering, summarizing, and reporting essential information about a population, helping to identify underlying patterns and inform effective solutions. This module focuses on the role of survey research in understanding population characteristics, highlighting how descriptive statistics facilitates the transition from raw data to meaningful insights.
Descriptive statistics is fundamental to research in the social and behavioral sciences, serving as the initial step in data interpretation across various domains, including mental health prevalence, cognitive development, and social behavior. For example, a scientist studying disease prevention may use descriptive statistics to identify the prevalence of a disease, monitor its trends over time, or identify demographic groups at higher risk. These straightforward summaries offer a clear overview of the data, uncovering valuable insights about the population under study. Well-crafted data descriptions ultimately enable informed decision-making and the development of strategies to address relevant issues.
Survey research
In this Module, we’ll explore descriptive statistics in the context of survey research, a commonly employed method for gathering data from a sample to infer attributes of the entire population. Survey research — a systematic approach to collect data from a random subset of a population — serves as a valuable tool to comprehend the attributes of that population as a whole. Scientists and practitioners alike widely use this method for its ability to efficiently amass information on a large scale. Surveys, conducted through questionnaires, interviews, or digital platforms, can capture both quantitative and qualitative data.
The United States regularly conducts several national surveys to understand the nation’s physical and mental health. Here are a few noteworthy examples:
National Health Interview Survey (NHIS)
The National Health Interview Survey (NHIS) is one of the major data collection programs of the National Center for Health Statistics (NCHS), which is part of the Centers for Disease Control and Prevention (CDC). The NHIS has been conducted annually since 1957 and provides critical data on various health topics, including illness, chronic conditions, access to healthcare services, and health behaviors. The NHIS collects data through in-person household interviews, and its findings are used for health policy-making, research, and monitoring trends in health conditions and behaviors.
Behavioral Risk Factor Surveillance System (BRFSS)
The Behavioral Risk Factor Surveillance System (BRFSS) is an ongoing telephone-based survey conducted by the CDC. The BRFSS collects data from adults in the United States regarding their health-related behaviors, chronic health conditions, and preventive services. Topics covered in the BRFSS include alcohol consumption, tobacco use, diet, physical activity, and management of chronic diseases. It is one of the largest health surveys in the world and provides vital information for developing and evaluating health promotion programs.
National Survey on Drug Use and Health (NSDUH)
Conducted by the Substance Abuse and Mental Health Services Administration (SAMHSA), the National Survey on Drug Use and Health (NSDUH) provides data on the use of tobacco, alcohol, illicit drugs, and mental health in the United States. It is conducted annually and involves interviews with individuals aged 12 and older. NSDUH data is critical for monitoring trends in substance use and mental health, and for developing effective policies and programs aimed at prevention and treatment.
National Health and Nutrition Examination Survey (NHANES)
The National Health and Nutrition Examination Survey (NHANES) is conducted by the National Center for Health Statistics (NCHS) and is unique in combining interviews with physical examinations. NHANES examines a nationally representative sample of about 5,000 persons each year, and the participants are located in counties across the country. The data collected provides a broad picture of the health and nutritional status of the U.S. population, and is used to assess the prevalence and risk factors for major diseases.
Monitoring the Future (MTF)
The Monitoring the Future Study is another important survey in the United States that focuses on the behaviors, attitudes, and values of American secondary school students, college students, and young adults. The survey is conducted annually and is sponsored by the National Institute on Drug Abuse (NIDA), a part of the National Institutes of Health (NIH). The University of Michigan’s Institute for Social Research has been the primary investigator since the inception of the survey in 1975. The main emphasis of Monitoring the Future is on tracking and understanding trends in drug use among adolescents. However, it goes beyond just drug use and also examines behaviors and attitudes related to alcohol and tobacco use, mental health, and other factors affecting youth health and well-being. A companion study called the Our Youth Our Future (OYOF) Study employs a similar methodology but focuses specifically on U.S. schools on or near American Indian researvations. The OYOF Study is conducted by scientists at Colorado State University’s Tri-Ethnic Center for Prevention Research.
All of these surveys are instrumental in providing policymakers, researchers, and healthcare professionals with the information needed to monitor the health of the population and develop strategies for improving health outcomes.
Sampling
One of the initial steps in survey research is to define the population of interest. The population refers to the entire group about which the researcher wants to gain insights. For instance, the National Survey on Drug Use and Health (NSDUH) Study described above defines their population as all U.S. residents 12 years and older. The Monitoring the Future (MTF) study defines their population as all adolescents in 8th, 10th or 12th grade in the U.S. As a final example, a researcher studying the effects of a new educational curriculum may define the population as all the students in a particular school district.
It is crucial to define the population accurately to ensure that the results are relevant and can be generalized to the group of interest.
As surveying the entire population is usually not feasible due to constraints of time, money and resources, researchers use a process called sampling to select a subset of the population. One of the most effective methods of sampling is random sampling. In random sampling, each member of the population has an equal chance of being selected. This ensures that the sample is representative of the population.
Representativeness is key in survey research. A representative sample reflects the characteristics of the population closely. For instance, if the population of interest has a certain percentage of men and women, the sample should ideally have a similar percentage. The more representative the sample is, the more confidently the results can be generalized to the entire population.
Please take a moment to watch this short video from the Pew Resesarch Center on random sampling.
Indeed, the benefits of survey research can be immense, but it’s important to recognize that drawing a random sample from a population and conducting survey research is a complex process with many challenges. These can include:
Obtaining the Sampling Frame: A sampling frame is a list of all the elements within the population from which the sample is drawn. Obtaining an accurate and complete sampling frame is often difficult. For example, a national survey might aim to use a list of residents, but such lists may be outdated or incomplete. Inadequate sampling frames can lead to selection bias if certain segments of the population are systematically excluded.
Unit Non-response: This occurs when selected respondents do not participate in the survey at all. It may be due to refusal, inability to contact them, or other reasons. This type of non-response can introduce bias if the non-respondents differ systematically from those who do participate. For example, if a health survey is conducted via the internet, individuals without internet access (who may also be more likely to have health issues) are less likely to participate, leading to biased results.
Item Non-response: Even when respondents participate in a survey, they might not respond to every question. This can occur for various reasons such as not understanding the question, finding it too personal, or simply overlooking it. Item non-response can affect the quality of the data and lead to biased estimates if the missing data is not handled properly.
Misreporting: Respondents may provide inaccurate information, either unintentionally or deliberately. For example, in health surveys, individuals might under-report unhealthy behaviors due to social desirability bias or simply forget the details. Misreporting can affect the validity of the survey data and lead to incorrect conclusions.
Sample Size and Representativeness: A random sample should be large enough and representative of the population to yield reliable results. However, ensuring representativeness can be challenging. If certain groups are under-represented in the sample, the results might not accurately reflect the characteristics of the whole population.
Cost and Logistical Challenges: Survey research can be costly and time-consuming, especially if it involves large samples and extensive data collection methods. Logistics, such as contacting individuals, administering the survey, and processing responses, require careful planning and resources.
Changing Technology and Communication: With the advent of new communication technologies, traditional survey methods like telephone and mail surveys may no longer be as effective. People’s preferences for communication are changing, and survey methods must adapt accordingly.
In conclusion, drawing a random sample and conducting survey research requires meticulous planning and execution to overcome the challenges associated with sampling frames, non-responses, misreporting, and ensuring representativeness. Researchers need to be vigilant about these challenges and employ strategies to mitigate potential biases and errors in the data. However, when done correctly survey research allows scientists to examine and understand the characteristics of a particular population, and the results of this work help scientists, policy makers and practitioners to make informed conclusions and decisions about the survey topics that have broader implications for the population of interest.
In this Module, we will study how survey research helped to identify a concerning trend of increasing teen depression over the past two decades.
Introduction to the data
We will explore data from the aforementioned National Survey on Drug Use and Health (NSDUH). It is an annual survey conducted in the United States, focusing on the prevalence, patterns, and consequences of alcohol, tobacco, and illegal drug use and misuse, as well as mental health disorders among individuals aged 12 and older. The Substance Abuse and Mental Health Services Administration (SAMHSA), a branch of the U.S. Department of Health and Human Services, sponsors the NSDUH.
Here are some key aspects of the NSDUH:
Objective and Scope: The primary objective of NSDUH is to provide comprehensive data on substance use and mental health issues at the national, state, and sub-state levels. The information gathered is critical for monitoring trends, estimating the need for treatment and informing policy and programs.
Sampling and Methodology: NSDUH employs a multistage area probability sample to select participants. The goal is to create a representative sample of the U.S. civilian, non-institutionalized population aged 12 and older. The survey involves face-to-face interviews conducted in the respondents’ homes. The NSDUH also incorporates Audio Computer-Assisted Self-Interviewing (ACASI) technology, which allows respondents to read or listen to questions via headphones and enter responses directly into a computer, ensuring privacy and increasing the likelihood of accurate reporting on sensitive topics.
Content: The NSDUH includes questions about the use of various substances, including but not limited to: Alcohol, Tobacco products, Marijuana, Cocaine, Heroin, Prescription drugs (used non-medically), and Other illicit drugs. In addition, it covers mental health issues, including depression, anxiety, and other mental health disorders. The survey also assesses the co-occurrence of substance use and mental disorders.
Reporting and Use of Data: After data collection and analysis, SAMHSA releases a series of detailed reports on the findings of the NSDUH. These reports provide essential insights for policymakers, healthcare providers, researchers, and public health officials. The data is used to identify substance use and mental health issues, inform policy, allocate resources, and develop programs to address these challenges. For example, click here to peruse reports written using the data or here for the a recent overview of findings from 2021. The data files are also made publicly available for scientists and practitioners to use.
In summary, the National Survey on Drug Use and Health is an invaluable resource for understanding the patterns and consequences of substance use and mental health issues in the United States. Its data plays a pivotal role in the development and evaluation of public health policies and programs.
In this Module, we’ll focus on NSDUH data that pertain to adolescent depression, focusing on survey respondents between the ages of 12 and 17. We will replicate some of the findings presented in a recent paper published in the Journal of Adolescent Health by Michael Daly, Ph.D.
As described in the paper:
Depression is a leading cause of impairment and disability globally and major contributor to suicidal behavior. The prevalence of fatal suicide among U.S. adolescents and young adults increased by 57.4% between 2007 and 2018. This trend has been attributed to a range of potential causes including increased prevalence of anxiety and depressive disorders over this period. Of particular concern is a rise in the prevalence of major depression reported by U.S. adolescents since 2010. Although national trends in depression have been documented, reliable up-to-date estimates of trends among key subpopulations are needed. To address this gap, this study drew on nationally representative data to estimate temporal trends in the prevalence of past-year major depressive episode (MDE) across sex, race/ethnicity, and household income groups between 2009 and 2019.
In this Module, we will take a look at the prevalence of Major Depressive Episodes (MDE) across sex from 2009 to 2019.
Before we begin, let’s load the libraries that we will need for the next tasks:
The data frame that we will use includes the following variables:
Variable | Description |
---|---|
record_id | Unique ID for participant |
year | Survey year |
sex | Biological sex of respondent |
age | Age of respondent |
raceeth | Race/ethnicity of respondent |
mde_past_year | Did the respondent meet criteria for a major depressive episode in the past-year? negative = did not meet criteria, positive = did meet criteria. |
mde_lifetime | Did the respondent meet criteria for a major depressive episode in their lifetime? negative = did not meet criteria, positive = did meet criteria. |
mh_sawprof | Did the respondent see a mental health care professional in the past-year? |
severity_chores | If mde_pastyear is positive, how severely did depression interfere with doing home chores? |
severity_work | If mde_pastyear is positive, how severely did depression interfere with school/work? |
severity_family | If mde_lifetime is positive, how severely did depression interfere with family relationships? |
severity_chores | If mde_pastyear is positive, how severely did depression interfere with social life? |
mde_lifetime_severe | If mde_pastyear is positive, was the MDE classified as severe? |
substance_disorder | Did the respondent meet criteria for a past-year substance use disorder (alcohol or drug)? — available for 2019 data only |
adol_weight | Sampling weight - from sampling protocol |
vestr | Stratum — from sampling protocol |
verep | Primary sampling unit — from sampling protocol |
This is a really large data frame, let’s import the whole data frame — and then print out the first 10 rows.
<- read_rds(here("data", "NSDUH_adol_depression.Rds"))
nsduh_20092019 |> head(n = 10) nsduh_20092019
To gain a little more initial insight and familiarize ourselves with the data frame, here is a glimpse of the data frame.
|> glimpse() nsduh_20092019
Rows: 172,183
Columns: 17
$ record_id <chr> "2009_60065143", "2009_12006143", "2009_97316143",…
$ year <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 20…
$ sex <chr> "male", "female", "male", "male", "male", "female"…
$ age <dbl> 14, 17, 16, 17, 12, 17, 15, 13, 15, 16, 12, 15, 15…
$ raceeth <chr> "Hispanic", "Black", "Non-hispanic White", "Non-hi…
$ mde_lifetime <chr> NA, "negative", "negative", "negative", "negative"…
$ mde_pastyear <chr> NA, "negative", "negative", "negative", "negative"…
$ mde_pastyear_severe <chr> NA, "negative", "negative", "negative", "negative"…
$ mh_sawprof <chr> NA, NA, NA, NA, NA, NA, NA, "did not receive care"…
$ severity_chores <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ severity_work <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ severity_family <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ severity_social <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ substance_disorder <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ adol_weight <dbl> 3524.60777, 1951.52937, 881.28399, 284.11492, 632.…
$ vestr <dbl> 30017, 30012, 30019, 30017, 30027, 30058, 30031, 3…
$ verep <dbl> 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2,…
A skim of the data frame is also useful for getting familiar with the structure.
|> skim() nsduh_20092019
Name | nsduh_20092019 |
Number of rows | 172183 |
Number of columns | 17 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
record_id | 0 | 1.00 | 13 | 13 | 0 | 172183 | 0 |
sex | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
raceeth | 0 | 1.00 | 5 | 31 | 0 | 7 | 0 |
mde_lifetime | 3731 | 0.98 | 8 | 8 | 0 | 2 | 0 |
mde_pastyear | 4400 | 0.97 | 8 | 8 | 0 | 2 | 0 |
mde_pastyear_severe | 4465 | 0.97 | 8 | 8 | 0 | 2 | 0 |
mh_sawprof | 144086 | 0.16 | 13 | 20 | 0 | 2 | 0 |
substance_disorder | 158786 | 0.08 | 8 | 8 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.0 | 2013.60 | 3.15 | 2009.00 | 2011.00 | 2013.00 | 2016.00 | 2019.00 | ▇▅▃▅▃ |
age | 0 | 1.0 | 14.56 | 1.69 | 12.00 | 13.00 | 15.00 | 16.00 | 17.00 | ▇▅▅▅▅ |
severity_chores | 154761 | 0.1 | 5.54 | 2.39 | 1.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▃▅▇▆▃ |
severity_work | 154400 | 0.1 | 5.82 | 2.51 | 1.00 | 4.00 | 6.00 | 8.00 | 10.00 | ▃▅▇▇▅ |
severity_family | 154264 | 0.1 | 6.29 | 2.53 | 1.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▂▅▇▇▆ |
severity_social | 154177 | 0.1 | 6.34 | 2.52 | 1.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▂▅▇▇▇ |
adol_weight | 0 | 1.0 | 1586.47 | 1425.59 | 1.04 | 571.59 | 1222.21 | 2150.64 | 22959.61 | ▇▁▁▁▁ |
vestr | 0 | 1.0 | 34782.51 | 4991.52 | 30001.00 | 30029.00 | 30058.00 | 40024.00 | 40050.00 | ▇▁▁▁▇ |
verep | 0 | 1.0 | 1.50 | 0.50 | 1.00 | 1.00 | 2.00 | 2.00 | 2.00 | ▇▁▁▁▇ |
Please take a moment to read through these basic descriptions just presented to familiarize yourself with the data frame and variables within it.
With some basic understanding of the data frame — let’s now dig in and describe the data more thoroughly. Notice from our summaries above that the data frame includes 172,183 adolescents. These adolescents were surveyed during one year sometime between 2009 and 2019. We can use the group_by() and tally() functions from the tidyverse to determine how many adolescents participated each year.
Calculate the number of participants at each wave
The functions group_by() and tally() serve to count and display the number of adolescents who took part in the survey each year.
|>
nsduh_20092019 group_by(year) |>
tally()
For example, we see that in 2009, 17,527 adolescents participated in the interview.
Calculate the number of male and female participants at each wave
What if we wanted to determine how many males and females participated each year? Simple — just add sex to the group_by() function code:
|>
nsduh_20092019 group_by(year, sex) |>
tally()
For example, here we see that in 2009, 8,650 participants reported being female and 8,877 reported being male.
Create a publication-ready table of participation
The group_by() and tally() functions are great for obtaining descriptive summaries as you comb through and conduct your analysis. But, you may want a table that is more publication-ready. The gtsummary package is an excellent resource for this endeavor. Let’s see how it can work for our present application. We’ll use the tbl_summary() function here.
|>
nsduh_20092019 select(year, sex) |>
tbl_summary(by = sex)
Characteristic | female N = 84,3771 |
male N = 87,8061 |
---|---|---|
year | 2,013.0 (2,011.0, 2,016.0) | 2,013.0 (2,011.0, 2,016.0) |
1 Median (Q1, Q3) |
In the code chunk, we selected our two variables of interest (year and sex), then we requested a summary table of this subsetted data frame, stratified by sex. The table relies just on the defaults of tbl_summary() — and the default for a numeric, continuous variable1 is to print the median and interquartile range (IQR). This isn’t terribly useful in this context. So, we need to add customization to get a more useful table. There are many options available to create a table that is right for your application. In this Module we’ll study a few, and you can explore on your own if you are so inclined. Here’s a new version of the graph that adds customization.
|>
nsduh_20092019 select(year, sex) |>
tbl_summary(by = sex,
type = list(year ~ "categorical"),
label = year ~ "Year of Survey") |>
as_gt() |>
tab_header(title = md("**Table 1. Number of participants by year and sex**"))
Table 1. Number of participants by year and sex | ||
---|---|---|
Characteristic | female N = 84,3771 |
male N = 87,8061 |
Year of Survey | ||
2009 | 8,650 (10%) | 8,877 (10%) |
2010 | 9,049 (11%) | 9,345 (11%) |
2011 | 9,383 (11%) | 9,881 (11%) |
2012 | 8,613 (10%) | 8,786 (10%) |
2013 | 8,617 (10%) | 9,119 (10%) |
2014 | 6,690 (7.9%) | 6,910 (7.9%) |
2015 | 6,677 (7.9%) | 6,908 (7.9%) |
2016 | 6,984 (8.3%) | 7,288 (8.3%) |
2017 | 6,672 (7.9%) | 7,050 (8.0%) |
2018 | 6,501 (7.7%) | 6,786 (7.7%) |
2019 | 6,541 (7.8%) | 6,856 (7.8%) |
1 n (%) |
Here we see that for each year, tbl_summary() provides the number of adolescents who completed the survey as well as the percentage of the column. For example, in 2009, 8,650 females completed the survey — which is 10% of the females who completed the survey across the 11 years (2009 to 2019) - i.e., 8,650/84,377 = 10%).
Let’s break down the code to create this table step by step:
nsduh_20092019
: This is the data frame and it is piped,|>
, into the subsequent function.select(year, sex)
: The select() function is used to select two variables from the data frame, year and sex. This indicates that the subsequent analysis will only focus on these two variables — this is important here because relying on the defaults, tbl_summary() will create a table that includes all of the variables being fed into the function.tbl_summary(by = sex, type = list(year ~ "categorical"), label = year ~ "Year of Survey")
: The tbl_summary() function comes from the gtsummary package. The function takes several arguments:by = sex
: This argument indicates that the summary table should be stratified by the sex variable, meaning there will be separate summaries for each sex.type = list(year ~ "categorical")
: This argument specifies that the year variable should be treated as a categorical variable in the summary table. When excluded (as you can see in the first version of the table), tbl_summary() treats this variable as a continuous variable and provides summary statistics (e.g., the median and interquartile range is the default for continuous, numeric variables).label = year ~ "Year of Survey"
: This argument renames the label of the year variable to “Year of Survey” in the summary table.
as_gt()
: This converts the summarized table to a gt object. The gt package offers countless methods for further customization. Here, we use the tab_header() function from the package to add a title. The md() function, which is part of gt renders text as Markdown2. In this context,md("**Table 1. Number of participants by year and sex**")
converts the string into Markdown format, allowing for the use of Markdown syntax to style the text. Here, the double asterisks around the text make it bold. If interested, you can learn a lot more about Markdown here.
Explore the 2019 data
Before we dive into replicating the Daly paper mentioned above, and estimating the proportion of male and female adolescents who experienced a MDE in the past-year — let’s explore just the 2019 data a bit. We will begin by creating a version of the data frame that includes just year 2019.
<-
nsduh_2019 |>
nsduh_20092019 filter(year == 2019)
A table of all relevant variables
Using the 2019 data frame, let’s request a quick summary of all of the relevant variables in the data frame using the tbl_summary() function again. Notice that now, we select several variables to describe, but we do not stratify by sex. This will just give us basic information about the variables collected in 2019 as a whole. Study the output below.
|>
nsduh_2019 select(sex, age, raceeth, starts_with("MDE"), starts_with("severity"), mh_sawprof, substance_disorder) |>
tbl_summary()
Characteristic | N = 13,3971 |
---|---|
sex | |
female | 6,541 (49%) |
male | 6,856 (51%) |
age | |
12 | 2,086 (16%) |
13 | 2,247 (17%) |
14 | 2,222 (17%) |
15 | 2,323 (17%) |
16 | 2,293 (17%) |
17 | 2,226 (17%) |
raceeth | |
Asian | 550 (4.1%) |
Black | 1,781 (13%) |
Hispanic | 3,186 (24%) |
Multiracial | 747 (5.6%) |
Native American | 206 (1.5%) |
Native Hawaiin/Pacific Islander | 64 (0.5%) |
Non-hispanic White | 6,863 (51%) |
mde_lifetime | |
negative | 10,146 (78%) |
positive | 2,869 (22%) |
Unknown | 382 |
mde_pastyear | |
negative | 10,852 (84%) |
positive | 2,098 (16%) |
Unknown | 447 |
mde_pastyear_severe | |
negative | 11,439 (88%) |
positive | 1,497 (12%) |
Unknown | 461 |
severity_chores | 6.00 (4.00, 7.00) |
Unknown | 11,432 |
severity_work | 6.00 (4.00, 8.00) |
Unknown | 11,400 |
severity_family | 6.00 (5.00, 8.00) |
Unknown | 11,415 |
severity_social | 7.00 (5.00, 8.00) |
Unknown | 11,396 |
mh_sawprof | |
did not receive care | 1,777 (61%) |
received care | 1,114 (39%) |
Unknown | 10,506 |
substance_disorder | |
negative | 12,741 (95%) |
positive | 656 (4.9%) |
1 n (%); Median (Q1, Q3) |
Add variable labels to the table
This is a fine table, however, the variable names are recorded. Instead, a label that describes the variable would be preferred. We can do this in one of two ways. First, we can add labels directly to our tbl_summary() function. Here’s an example of this approach:
|>
nsduh_2019 select(sex, age, raceeth, starts_with("MDE"), starts_with("severity"), mh_sawprof, substance_disorder) |>
tbl_summary(label = list(
~ "Sex",
sex ~ "Age",
age ~ "Race/ethnicity",
raceeth ~ "Lifetime prevalence of a Major Depressive Episode",
mde_lifetime ~ "Past-year prevalence of a Major Depressive Episode",
mde_pastyear ~ "Past-year prevalence of a severe Major Depressive Episode",
mde_pastyear_severe ~ "MDE-caused interference with doing home chores",
severity_chores ~ "MDE-caused interference with ability to function at school/work",
severity_work ~ "MDE-caused interference with family relationships",
severity_family ~ "MDE-caused interference with social life",
severity_social ~ "Saw a professional for mental health care in past-year",
mh_sawprof ~ "Past-year prevalence of a substance use disorder"
substance_disorder ))
Characteristic | N = 13,3971 |
---|---|
Sex | |
female | 6,541 (49%) |
male | 6,856 (51%) |
Age | |
12 | 2,086 (16%) |
13 | 2,247 (17%) |
14 | 2,222 (17%) |
15 | 2,323 (17%) |
16 | 2,293 (17%) |
17 | 2,226 (17%) |
Race/ethnicity | |
Asian | 550 (4.1%) |
Black | 1,781 (13%) |
Hispanic | 3,186 (24%) |
Multiracial | 747 (5.6%) |
Native American | 206 (1.5%) |
Native Hawaiin/Pacific Islander | 64 (0.5%) |
Non-hispanic White | 6,863 (51%) |
Lifetime prevalence of a Major Depressive Episode | |
negative | 10,146 (78%) |
positive | 2,869 (22%) |
Unknown | 382 |
Past-year prevalence of a Major Depressive Episode | |
negative | 10,852 (84%) |
positive | 2,098 (16%) |
Unknown | 447 |
Past-year prevalence of a severe Major Depressive Episode | |
negative | 11,439 (88%) |
positive | 1,497 (12%) |
Unknown | 461 |
MDE-caused interference with doing home chores | 6.00 (4.00, 7.00) |
Unknown | 11,432 |
MDE-caused interference with ability to function at school/work | 6.00 (4.00, 8.00) |
Unknown | 11,400 |
MDE-caused interference with family relationships | 6.00 (5.00, 8.00) |
Unknown | 11,415 |
MDE-caused interference with social life | 7.00 (5.00, 8.00) |
Unknown | 11,396 |
Saw a professional for mental health care in past-year | |
did not receive care | 1,777 (61%) |
received care | 1,114 (39%) |
Unknown | 10,506 |
Past-year prevalence of a substance use disorder | |
negative | 12,741 (95%) |
positive | 656 (4.9%) |
1 n (%); Median (Q1, Q3) |
Alternatively, we can add variable labels directly to the data frame — and then request the table. For this, we need the labelled package, specifically it’s set_variable_labels() function. Labelling all of your variables up front can save you time, because, once added, any time you use the data frame — the labels will be printed out.
To use the set_variable_labels() function, simply list the variable name, followed by =
and the label you want to provide in quotations. Before applying labels to the full data frame — take a look at the simple example below to provide a label for year. The str() function gives details on the structure of an object — when I request this — you can see that year now has a label associated with it.
|>
nsduh_2019 ::set_variable_labels(
labelledyear = "Data collection year") |>
select(year) |>
str()
tibble [13,397 × 1] (S3: tbl_df/tbl/data.frame)
$ year: num [1:13397] 2019 2019 2019 2019 2019 ...
..- attr(*, "label")= chr "Data collection year"
Now, let’s apply labels to all of the variables.
<-
nsduh_2019 |>
nsduh_2019 ::set_variable_labels(
labelledrecord_id = "Participant ID",
year = "Data collection year",
sex = "Sex",
age = "Age",
raceeth = "Race/ethnicity",
mde_lifetime = "Lifetime Major Depressive Episode",
mde_pastyear = "Past-year Major Depressive Episode",
mde_pastyear_severe = "Past-year severe Major Depressive Episode",
severity_chores = "MDE-caused interference with doing home chores",
severity_work = "MDE-caused interference with ability to function at school/work",
severity_family = "MDE-caused interference with family relationships",
severity_social = "MDE-caused interference with social life",
mh_sawprof = "Saw a professional for mental health care in past-year",
substance_disorder = "Past-year substance use disorder",
adol_weight = "Sampling weight - from sampling protocol",
vestr = "Stratum --- from sampling protocol",
verep = "Cluster replicates --- from sampling protocol")
Once the labels are formatted, we can use a nice function from the labelled package called look_for() — this function will create a handy data dictionary.
|>
nsduh_2019 ::look_for() |>
labelledselect(variable, label)
I’ll show you one last trick. I like the datatable() function from the DT package for printing the look_for() table because it’s searchable. Check out the output below.
|>
nsduh_2019 ::look_for() |>
labelledselect(variable, label) |>
::datatable() DT
The searchable table above is nice to have in your analysis notebook because it provides a handy way for you or a reader of your report to search for variable names and their labels. This is particularly useful when you are reporting on a very large number of variables.
Moreover, because we added the variable labels directly to the data frame, when we use tbl_summary() to create a table, the variable labels from our data frame will be used. Let’s create the table one last time using the updated data frame with the labels. I will also add a title.
|>
nsduh_2019 select(sex, age, raceeth, starts_with("MDE"), starts_with("severity"), mh_sawprof, substance_disorder) |>
tbl_summary() |>
as_gt() |>
tab_header(
title = md("**Table 2. Descriptive statistics for key variables**"),
subtitle = "Data from the National Survey on Drug Use and Health, 2019"
)
Table 2. Descriptive statistics for key variables | |
---|---|
Data from the National Survey on Drug Use and Health, 2019 | |
Characteristic | N = 13,3971 |
Sex | |
female | 6,541 (49%) |
male | 6,856 (51%) |
Age | |
12 | 2,086 (16%) |
13 | 2,247 (17%) |
14 | 2,222 (17%) |
15 | 2,323 (17%) |
16 | 2,293 (17%) |
17 | 2,226 (17%) |
Race/ethnicity | |
Asian | 550 (4.1%) |
Black | 1,781 (13%) |
Hispanic | 3,186 (24%) |
Multiracial | 747 (5.6%) |
Native American | 206 (1.5%) |
Native Hawaiin/Pacific Islander | 64 (0.5%) |
Non-hispanic White | 6,863 (51%) |
Lifetime Major Depressive Episode | |
negative | 10,146 (78%) |
positive | 2,869 (22%) |
Unknown | 382 |
Past-year Major Depressive Episode | |
negative | 10,852 (84%) |
positive | 2,098 (16%) |
Unknown | 447 |
Past-year severe Major Depressive Episode | |
negative | 11,439 (88%) |
positive | 1,497 (12%) |
Unknown | 461 |
MDE-caused interference with doing home chores | 6.00 (4.00, 7.00) |
Unknown | 11,432 |
MDE-caused interference with ability to function at school/work | 6.00 (4.00, 8.00) |
Unknown | 11,400 |
MDE-caused interference with family relationships | 6.00 (5.00, 8.00) |
Unknown | 11,415 |
MDE-caused interference with social life | 7.00 (5.00, 8.00) |
Unknown | 11,396 |
Saw a professional for mental health care in past-year | |
did not receive care | 1,777 (61%) |
received care | 1,114 (39%) |
Unknown | 10,506 |
Past-year substance use disorder | |
negative | 12,741 (95%) |
positive | 656 (4.9%) |
1 n (%); Median (Q1, Q3) |
Calculate the proportion of participants with MDE by sex
Using the 2019 data frame, let’s calculate the proportion of male and female participants who met the criteria for Major Depressive Episode (MDE) — we’ll obtain these proportions for both lifetime and past-year MDE.
|>
nsduh_2019 select(mde_lifetime, mde_pastyear, sex) |>
tbl_summary(by = sex) |>
as_gt() |>
tab_header(
title = md("**Table 3. Prevalence of depression among U.S. adolescents by sex**"),
subtitle = "Data from the National Survey on Drug Use and Health, 2019"
)
Table 3. Prevalence of depression among U.S. adolescents by sex | ||
---|---|---|
Data from the National Survey on Drug Use and Health, 2019 | ||
Characteristic | female N = 6,5411 |
male N = 6,8561 |
Lifetime Major Depressive Episode | ||
negative | 4,352 (69%) | 5,794 (87%) |
positive | 1,994 (31%) | 875 (13%) |
Unknown | 195 | 187 |
Past-year Major Depressive Episode | ||
negative | 4,808 (76%) | 6,044 (91%) |
positive | 1,487 (24%) | 611 (9.2%) |
Unknown | 246 | 201 |
1 n (%) |
Study the table to determine what proportion of males and females experienced a past-year MDE.
Calculate the proportion who received mental health care by sex
Perhaps we’d like to examine the prevalence of mental health care by sex among those with a past-year MDE. We can do that with the following code:
|>
nsduh_2019 filter(mde_pastyear == "positive") |>
select(mh_sawprof, sex) |>
tbl_summary(by = sex) |>
as_gt() |>
tab_header(
title = md("**Table 4. Prevalence of mental health care among adolescents with a past-year MDE**"),
subtitle = "Data from the National Survey on Drug Use and Health, 2019"
)
Table 4. Prevalence of mental health care among adolescents with a past-year MDE | ||
---|---|---|
Data from the National Survey on Drug Use and Health, 2019 | ||
Characteristic | female N = 1,4871 |
male N = 6111 |
Saw a professional for mental health care in past-year | ||
did not receive care | 812 (55%) | 378 (62%) |
received care | 667 (45%) | 229 (38%) |
Unknown | 8 | 4 |
1 n (%) |
Examine severity of depression items
Building on the last example, let’s take a look at information about the severity of depression. To measure this construct (i.e., severity of depression) the level of functional impairment caused by poor mental health over the course of the past 12 months was assessed using the Sheehan Disability Scale (SDS). For adolescents, the SDS measures mental health-related impairment in four major life activities or role domains: chores at home; ability to do well at school or work; ability to get along with family; and ability to have a social life. For adolescents with a past-year MDE, respondents were instructed to think about the time in the past 12 months when problems with their mood were at their worst and use the 0 to 10 scale to describe how much their depression symptoms caused problems in these four areas.
|>
nsduh_2019 filter(mde_pastyear == "positive") |>
select(starts_with("severity_"), sex) |>
tbl_summary(by = sex,
statistic = list(starts_with("severity_") ~ "{mean} ({sd})"),
missing = "no") |>
as_gt() |>
tab_header(
title = md("**Table 5. Impacts of depression among adolescents with a past-year MDE**"),
subtitle = "Data from the National Survey on Drug Use and Health, 2019"
)
Table 5. Impacts of depression among adolescents with a past-year MDE | ||
---|---|---|
Data from the National Survey on Drug Use and Health, 2019 | ||
Characteristic | female N = 1,4871 |
male N = 6111 |
MDE-caused interference with doing home chores | 5.62 (2.36) | 5.59 (2.29) |
MDE-caused interference with ability to function at school/work | 5.88 (2.57) | 5.90 (2.54) |
MDE-caused interference with family relationships | 6.36 (2.45) | 5.84 (2.52) |
MDE-caused interference with social life | 6.48 (2.51) | 6.10 (2.56) |
1 Mean (SD) |
There are a couple of new options for tbl_summary() that I used in this chunk of code. First, notice that I used the statistic =
argument to specify that, for the severity variables, the mean and the standard deviation (in parentheses) should be printed. Recall that the default for a numeric continuous variable is the median and IQR. I also used the missing = "no"
argument to suppress printing the missing/unknown cases.
Use graphs to display the individual severity items
The table we just created is informative, but we can further explore these severity variables with a plot. Let’s create both a histogram and a density plot to display the distribution of each severity score for males and females. First though, because we want to facet the plots by the severity item (e.g., chores, school, family, social) we need to pivot the data to longer using the pivot_longer() function (which we learned about in Module 4). In this pivoted data frame — each adolescent will have up to 4 rows of data — one for each of the severity items that they reported on.
# put data into the right format
<-
severity_long |>
nsduh_2019 filter(mde_pastyear == "positive") |>
select(record_id, sex, starts_with("severity_")) |>
pivot_longer(cols = starts_with("severity_"), names_to = "variable", values_to = "score") |>
drop_na()
|>
severity_long head(n = 20)
# a histogram
|>
severity_long ggplot(mapping = aes(x = score, fill = sex)) +
geom_histogram(binwidth = 1) +
facet_wrap(~ variable) +
theme_minimal() +
labs(title = "Histogram of MDE-caused severity items for males and females",
x = "Severity Score")
# a density plot
|>
severity_long ggplot(mapping = aes(x = score, fill = sex)) +
geom_density(alpha = .5) +
facet_wrap(~ variable) +
theme_minimal() +
labs(title = "Density plot of MDE-caused severity items for males and females",
x = "Severity Score")
In the histogram, you’ll notice that the bars for females are considerably higher than those for males. This is because the histogram reflects not only the distribution but also the number of cases — in this instance, 24% of females compared to about 9% of males met the criteria for MDE and reported on these symptom severity items. That is, in the histogram, the height of the bars represents the number of cases (or frequency) within each score range. The higher frequency for females is visually evident in the taller bars. Thus, because histograms show raw counts, they help us understand not only the distribution but also the number of observations in each category.
On the other hand, the density plots appear similar for males and females. Density plots focus on the shape of the distribution rather than the actual counts. They provide a smoothed version of the histogram and are normalized so that the area under the curve sums to 1. This normalization means that the density plot for males and females can appear similar, even if the actual number of cases is different. The density plot allows us to compare the distribution shapes of severity scores between males and females without being influenced by the different sample sizes.
Both the histograms and density plots have benefits and drawbacks, and these are apparent in the examples we created. Histograms are excellent for visually capturing the difference in the number of cases between two groups, making the disparity in prevalence immediately apparent. However, the variation in counts can overshadow the underlying distribution of the data. Density plots, on the other hand, are useful for comparing the shapes of distributions without getting distracted by the difference in sample sizes. However, they do not portray the actual prevalence in the sample. In sum, histograms are great for understanding the magnitude, while density plots help in understanding the nature of distribution without the influence of sample size.
Let’s assume we prefer the density plot and want to improve its readability. Notice that the facets in the graph list the variable names, which might not be informative to the reader. Instead, we can apply more descriptive labels to the values in the data frame. An ideal way to assign these labels in R is to convert a variable to a factor.
In R, factors are a data type used to represent categorical data, useful for handling data with a fixed number of unique values (levels), such as gender or treatment groups. Factors are essential for statistical modeling and data visualization, as they allow R to understand and correctly handle categorical variables. Another major benefit is that you can use factors to assign specific labels to each level.
The factor() function in R converts a variable into a factor, which is a special type of vector. It takes a vector of values and assigns a set of unique levels to it. The factor() function can also label these levels with descriptive names. In the example below, the factor() function is used to create meaningful labels for the severity variables, making the data more understandable and ready for visualization.
Please study the code below. The factor() function converts the variable column into a factor. The levels
argument specifies the original column names, while the labels
argument provides descriptive labels for these levels. The ordering of the levels and labels should be consistent, so, for example, severity_chores (the first level listed) corresponds to MDE-caused interference with doing home chores (the first label listed).
Notice that in the newly created graph, an informative label, rather than the variable name, denotes each facet.
<-
severity_long |>
nsduh_2019 filter(mde_pastyear == "positive") |>
select(record_id, sex, starts_with("severity_")) |>
pivot_longer(cols = starts_with("severity_"), names_to = "variable", values_to = "score") |>
drop_na() |>
mutate(variable = factor(variable,
levels = c("severity_chores", "severity_work", "severity_family", "severity_social"),
labels = c("MDE-caused interference with doing home chores",
"MDE-caused interference with ability to function at school/work",
"MDE-caused interference with family relationships",
"MDE-caused interference with social life")))
|>
severity_long ggplot(mapping = aes(x = score, fill = sex)) +
geom_density(alpha = .5) +
facet_wrap(~ variable) +
theme_minimal() +
labs(title = "Density plot of MDE-caused severity items for males and females",
x = "Severity Score",
y = "Density",
fill = "Sex",
caption = "MDE = Major Depressive Episode")
Create a variable that combines the severity items into one scale
Instead of considering each severity item separately, we might have an interest in combining them to form a composite score. One common method for this is to take the average of the items. Let’s form a severity scale that takes the average of the 4 severity items. The article here shows a variety of ways to perform row-wise operations (i.e., averaging or summing across rows of your data frame) — I’ll demonstrate a straightforward way to accomplish this below using rowMeans() and pick().
<-
nsduh_2019_scale |>
nsduh_2019 filter(mde_pastyear == "positive") |>
select(sex, starts_with("severity_")) |>
mutate(severity_scale = rowMeans(pick(severity_chores, severity_work, severity_family, severity_social)))
|> select(starts_with("severity_")) |> head() nsduh_2019_scale
Here’s a breakdown of the code to create the scale score:
pick(severity_chores, severity_work, severity_family, severity_social)
: This expression uses the pick() function from the tidyverse to select certain variables from the existing data — i.e., the severity items.rowMeans(...)
: This function calculates the mean value of each row in the selected variables. This will return a single value for each row in the data frame, representing the mean of the selected variables for that row.mutate(severity_scale = ...)
: This adds a new column to the data. The new column/variable will be named severity_scale, and its values will be the row-wise means calculated by rowMeans().
In summary, this line of code is adding a new column to a data frame, where the value in each row of the new column is the mean of the values in the severity_chores, severity_work, severity_family, and severity_social columns for that row. It’s essentially calculating an average severity score across the four selected variables for each observation in the data frame.
If you’ve been careful and deliberate with naming the variables that will go into your scale, you can use a short cut that saves you from typing each variable name — check out the use of starts_with()
in the code below:
<-
nsduh_2019_scale |>
nsduh_2019 filter(mde_pastyear == "positive") |>
select(sex, starts_with("severity_")) |>
mutate(severity_scale = rowMeans(pick(starts_with("severity_"))))
|> select(starts_with("severity_")) |> head() nsduh_2019_scale
Notice that, by default, anyone who is missing one or more of the severity items will be assigned NA for the score. This is because the default for rowMeans() is to drop cases with missing data. If instead you desired to form the mean with whatever variables were present, you could add na.rm = TRUE
to the rowMeans() call as follows:
<-
nsduh_2019_scale |>
nsduh_2019 filter(mde_pastyear == "positive") |>
select(sex, starts_with("severity_")) |>
mutate(severity_scale = rowMeans(pick(starts_with("severity_")), na.rm = TRUE))
|> select(starts_with("severity_")) |> head() nsduh_2019_scale
Alternatively, with a little more coding you could specify that you’d like the scale to be formed only if a certain number of the items are observed for the individual — for example, the code below requires that an individual answered at least 3 of the 4 severity items in order for the scale to be formed. This is a slightly more advanced specification that utilizes case_when() from Module 4.
<-
nsduh_2019_scale |>
nsduh_2019 filter(mde_pastyear == "positive") |>
select(sex, starts_with("severity_")) |>
mutate(severity_scale = case_when(rowSums(!is.na(pick(starts_with("severity_")))) >= 3 ~ rowMeans(pick(starts_with("severity_")), na.rm = TRUE)))
|> select(starts_with("severity_")) |> head() nsduh_2019_scale
Use a graph to display the severity scale scores
We did all the work to form this new severity scale score — let’s take a moment to create a graph of the distribution of it across males and females. I’ll add a few extra options below to make it more attractive, including choosing different colors for sex and using patchwork to create side by side graphs of the histogram and density plot (click here for more information about these patchwork options).
<- c("male" = "#6ce1ff", "female" = "#e0ce75")
group.colors
<-
hist |>
nsduh_2019_scale drop_na() |>
ggplot(mapping = aes(x = severity_scale, fill = sex)) +
geom_histogram(binwidth = .5) +
scale_fill_manual(values = group.colors) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "A histogram of the data",
x = "Severity of MDE",
y = "Count",
fill = "Biological Sex")
<-
dens |>
nsduh_2019_scale drop_na() |>
ggplot(mapping = aes(x = severity_scale, fill = sex)) +
geom_density(alpha = .75) +
scale_fill_manual(values = group.colors) +
theme_minimal() +
labs(title = "A density plot of the data",
x = "Severity of MDE",
y = "Density",
fill = "Sex")
library(patchwork)
<- hist + dens
my_plots + plot_annotation(
my_plots title = "Distribution of the severity of MDE-caused interference with life functioning among adolescents with a past-year MDE",
subtitle = "Data from the National Survey on Drug Use and Health (NSDUH), 2019",
caption = "MDE = Major Depressive Episode; 24% of adoelscent females and 9.2% of adolescent males in the 2019 NSDUH survey met criteria for a past-year MDE."
)
Looks great! Which graph do you prefer — the histogram or the density plot?
Cross tabulations of categorical/qualitative data
Cross tabulation of MDE and substance use disorder
Before returning to the full data frame (with all years between 2009 and 2019), let’s explore methods for looking at a cross tabulation of variables. We explored cross tabulations a bit in Module 5, where we learned about and practiced using the tbl_cross() from the gtsummary package to obtain a cross tabulation of two categorical variables. Recall that a cross tabulation is a statistical tool used for analyzing the relationship between two or more categorical variables. It’s a type of table that displays the frequency distribution of the variables in a matrix format, making it easier to understand the relationship between them.
In a basic cross-tabulation with two variables, rows usually represent categories of one variable and columns represent categories of another variable. The cells in the table contain the counts or frequencies of the combined categories.
Let’s examine the relationship between past-year MDE (mde_pastyear) and past-year substance use disorder (substance_disorder) using a cross tabulation.
|>
nsduh_2019 select(mde_pastyear, substance_disorder) |>
tbl_cross(row = mde_pastyear, col = substance_disorder)
Past-year substance use disorder
|
Total | ||
---|---|---|---|
negative | positive | ||
Past-year Major Depressive Episode | |||
negative | 10,460 | 392 | 10,852 |
positive | 1,856 | 242 | 2,098 |
Unknown | 425 | 22 | 447 |
Total | 12,741 | 656 | 13,397 |
In the table above, past-year MDE is displayed on the rows, and past-year substance disorder is displayed on the columns. We can switch these if we like (i.e, switch the variables listed in the row =
and col =
arguments).
Let’s explore some additional useful options to enhance our table. We can request percentages in addition to counts. First, let’s request percents to be calculated by row (i.e., past-year MDE in our example). Notice the argument percent = "row"
in the code below. Notice also in the code below that after selecting the desired variables, I include drop_na(). This drops all missing data in a list-wise fashion — so any individuals with missing data on either variable in the cross-tabulation will be removed before computing and displaying the cross table.
|>
nsduh_2019 select(mde_pastyear, substance_disorder) |>
drop_na() |>
tbl_cross(row = mde_pastyear, col = substance_disorder, percent = "row")
Past-year substance use disorder
|
Total | ||
---|---|---|---|
negative | positive | ||
Past-year Major Depressive Episode | |||
negative | 10,460 (96%) | 392 (3.6%) | 10,852 (100%) |
positive | 1,856 (88%) | 242 (12%) | 2,098 (100%) |
Total | 12,316 (95%) | 634 (4.9%) | 12,950 (100%) |
The table shows us that there are a total of 2,098 adolescents who reported a past-year MDE, 242 of whom had a past-year substance disorder. This means that 12% (242/2098) of the adolescents who reported a past-year MDE, also reported a past-year substance use disorder. On the other hand, there are a total of 10,852 adolescents who did not report a past-year MDE in the sample, and 392 (representing 3.6% of the sample of adolescents without a past-year MDE, that is 392/10852) had a substance use disorder. Summarizing this information, the prevalence of a past-year substance use disorder is much more common (about three times more common) among adolescents with a past-year MDE than adolescents without a past-year MDE.
How does the table change if we specify percent = "column"
instead of percent = "row"
?
|>
nsduh_2019 select(mde_pastyear, substance_disorder) |>
drop_na() |>
tbl_cross(row = mde_pastyear, col = substance_disorder, percent = "column")
Past-year substance use disorder
|
Total | ||
---|---|---|---|
negative | positive | ||
Past-year Major Depressive Episode | |||
negative | 10,460 (85%) | 392 (62%) | 10,852 (84%) |
positive | 1,856 (15%) | 242 (38%) | 2,098 (16%) |
Total | 12,316 (100%) | 634 (100%) | 12,950 (100%) |
With the percents displayed relative to the columns, we find that there are a total of 634 adolescents who had a substance use disorder in the past-year, 38% (242/634) of those adolescents had a past-year MDE, and 62% (392/634) did not have a past-year MDE. Likewise, a total of 12,316 adolescents did not have a substance use disorder, 15% of these adolescents reported a past-year MDE, and 85% did not report a past-year MDE.
There is one other option for displaying percentages that we can choose. The argument percent = "cell"
provides the percentage of people in each cell relative to the overall total.
|>
nsduh_2019 select(mde_pastyear, substance_disorder) |>
drop_na() |>
tbl_cross(row = mde_pastyear, col = substance_disorder, percent = "cell")
Past-year substance use disorder
|
Total | ||
---|---|---|---|
negative | positive | ||
Past-year Major Depressive Episode | |||
negative | 10,460 (81%) | 392 (3.0%) | 10,852 (84%) |
positive | 1,856 (14%) | 242 (1.9%) | 2,098 (16%) |
Total | 12,316 (95%) | 634 (4.9%) | 12,950 (100%) |
In this third example, the percent in each cell with respect to the grand total is displayed. For example, there are 242 adolescents who reported both a MDE and a substance use disorder in the past year, which equals 1.9% or the total (i.e., 242/12950). These adolescents could be classified as having co-morbidity with respect to depression and substance use issues — comorbidity is defined as having two or more illnesses or disorders occurring at the same time.
Replication of Daly paper
Past-year MDE by sex from 2009 - 2019
Now, let’s return to the full data frame that includes survey years between 2009 and 2019. This full data frame is what was used in the Daly (2021) paper described above. In this final section of the Module, we’ll reproduce panel B from this figure in the paper:
Let’s begin by subsetting the data frame to exclude cases missing on past-year MDE or sex.
<-
for_trend_graph |>
nsduh_20092019 select(year, sex, mde_pastyear) |>
drop_na()
Now, let’s compute the proportion of males and females who met criteria for a past-year MDE in each survey year. Because mde_pastyear is a character variable — and we can’t calculate a proportion using a character variable, we need to first create a numeric variable so that we can perform our calculation. Here, the new variable is called mde_pastyear_positive and this new variable is coded 1 if the individual met the criteria for past-year MDE (i.e., mde_pastyear is equal to “positive”) and 0 if they did not meet criteria for a past-year MDE (i.e., mde_pastyear is equal to “negative”).
Once this variable is created we can group the data frame by year and sex, then compute the proportion. By calculating the mean of a binary variable coded 1 and 0 — the result is the proportion of cases coded 1 — which in our example indicates the proportion of individuals with a past-year MDE.
<-
for_trend_graph_summary |>
for_trend_graphmutate(mde_pastyear_positive = case_when(mde_pastyear == "positive" ~ 1,
== "negative" ~ 0)) |>
mde_pastyear group_by(year, sex) |>
summarize(proportion_mde_pastyear_positive = mean(mde_pastyear_positive, na.rm = TRUE), .groups = "drop")
for_trend_graph_summary
With this summarized data frame in hand, we can create our line graph. We’ll put year on the x-axis, and proportion_mde_pastyear_positive on the y-axis. We’ll also include the individual data points for each year to match the JAH graph we are replicating (via the geom_point() geometry). We’ll color the points and lines with the same colors we set earlier.
<- c("male" = "#6ce1ff", "female" = "#e0ce75")
group.colors
|>
for_trend_graph_summary ggplot(mapping = aes(x = year, y = proportion_mde_pastyear_positive, group = sex, color = sex)) +
geom_line(linewidth = 2) +
geom_point(size = 3) +
scale_x_continuous(breaks = 2009:2019) +
scale_y_continuous(limits = c(0,.25), labels = scales::percent_format(accuracy = 1)) +
theme_minimal() +
scale_color_manual(values = group.colors) +
labs(title = "Proportion of adolescents who met criteria for a past-year Major Depressive Episode (MDE)",
subtitle = "Data from the National Survey on Drug Use and Health (NSDUH), 2009 - 2019",
x = "Year",
y = "Percent of adolescents meeting criteria for past-year MDE",
color = "Sex")
Our graph replicates the corresponding graph from the Daly (2021) paper. Note that the figure presented in the paper includes confidence intervals around the points. The confidence intervals denote the degree of uncertainty in the estimates. We’ll learn about confidence intervals and how to compute them in Module 8.
The Daly (2021) paper explains these findings as follows:
Depression levels among female participants increased by 12 percentage points (95% CI 10.4-13.5) between 2009 and 2019, from 11.4% to 23.4%. This increase was 8.3 percentage points (95% CI 6.2-10.4) larger than the increase experienced by males over the same period (3.7%, 95% CI 2.5-4.8) (Figure 1B). The gender difference in adolescent depression levels increased from 6.4 percentage points (95% CI 5.4-7.5) in 2009 to 14.8 percentage points (95% CI 12.9-16.6) in 2019.
And, includes the following discussion of the findings:
This study of over 160,000 adolescents aged 12-17 years drew on nationally representative data to show that the prevalence of MDE approximately doubled between 2009 and 2019 (from 8.1% to 15.8%). This prolonged rise in MDE is concerning because adolescent depression tends to persist into adulthood and forecasts adverse health and socioeconomic consequences throughout life.
In the current study, the gender disparity in depression more than doubled (from 6.4 to 14.8 percentage points) between 2009 and 2019 driven by a substantial rise in the prevalence of MDE among females over this period. This finding extends existing evidence that has found significant though less pronounced increases in internalizing problems and depressive symptoms among adolescent girls since the beginning of the 21st century. Potential reasons for this increase are manifold and include increases in bullying and victimization and use of social media and technology which may have been more impactful for girls than boys. In addition, reduced sleep quality and quantityy, the long-term impact of the Great Recession, and rising educational expectations may have contributed to the rise in depression.
Importantly, Daly (2021) also recognizes limitations of his study and findings:
The current study is limited in its reliance on self-reports of depressive symptoms that may differ from clinical evaluations and could be subject to recall bias. This study utilized repeated cross-sectional data and it remains possible that the surveyed populations differed over time, though the NSDUH’s high response rate, consistent sampling design, and use of survey weights safeguards against this possibility.
Last, the Daly (2021) paper offers these concluding remarks:
The results highlight the need for further investment in adolescent mental health promotion, mental health services, and intervention programs. School group-based interventions, exercise, and psychological therapy have demonstrated effectiveness in reducing adolescent depressive symptoms and could help tackle the marked rise in depression among adolescent girls identified in this study.
This brief documentary put together by the New York Times offers interesting insights into this ongoing crisis of adolescent depression.
Bonus material — getting better population estimates
The NSDUH uses a stratified, multistage area probability sampling strategy to select participants. This approach allows for nationally representative estimates for non-institutionalized individuals aged 12 and older. This complex sampling is beyond the scope of this Module. For those interested, this resource gives an overview of the methodology.
For those readers who might be interested, I will provide the code below that can be used to obtain estimates of past-year and lifetime MDE for 2019 (i.e., Table 3 that we created earlier) that take into account the sampling design strategy. This uses the svydesign() function from the survey package, and the sampling variables described in the list of variables in the data frame (at the beginning of the Module).
First, here is a bit more information about these sampling variables:
vestr: This variable is used to identify the sampling stratum. A stratum in survey sampling is a subgroup or layer of the population. Stratification is done to achieve a more representative sample and to improve the accuracy of estimates. Each stratum comprises units of the population that have similar characteristics.
verep: This variable stands for Primary Sampling Unit (PSU). In complex survey design, a PSU is a unit or cluster of the population used in the first stage of multi-stage sampling. The PSUs are often geographic areas like towns, counties, or regions, and all individuals within the chosen PSU are included in the sample.
adol_weight: This is the final analysis weight for each respondent. It’s a measure that adjusts the responses to account for the complex sampling design used in the NSDUH, as well as for non-response and post-stratification. In other words, it ensures that the responses are representative of the entire U.S. population, despite variations in selection probability and response rates among different groups. When analyzing the survey results, this weight is applied to each respondent’s data. For example, if a respondent’s weight is 2, it means that person represents two people in the larger population. By using these weights, researchers can make valid inferences about the larger U.S. population based on the sample data.
So in summary, vestr, verep, and adol_weight are all variables in the NSDUH that play key roles in ensuring that the survey data is accurately representative of the larger population, allowing researchers to draw meaningful and valid conclusions about substance use and mental health in the United States.
And, here is the code that can be applied to obtain accurate population-level estimates.
library(survey)
<-
df_for_weighted |>
nsduh_2019 select(starts_with("mde_"), sex, adol_weight, vestr, verep)
<-
weighted svydesign(
id = ~ verep,
strata = ~ vestr,
data = df_for_weighted,
weights = ~ adol_weight,
nest = TRUE
)
|>
weighted tbl_svysummary(by = sex,
include = c("mde_lifetime", "mde_pastyear"),
label = list(
~ "Lifetime prevalence of a Major Depressive Episode",
mde_lifetime ~ "Past-year prevalence of a Major Depressive Episode")) |>
mde_pastyear as_gt() |>
tab_header(
title = md("**Table 3 (revised). Prevalence of depression among U.S. adolescents by sex**"),
subtitle = "Data from the National Survey on Drug Use and Health, 2019") |>
tab_footnote(
footnote = "Table prevalences are adjusted for the NSDUH complex sample design",
locations = cells_title(groups = "title")
)
Table 3 (revised). Prevalence of depression among U.S. adolescents by sex1 | ||
---|---|---|
Data from the National Survey on Drug Use and Health, 2019 | ||
Characteristic | female N = 12,222,9892 |
male N = 12,682,0502 |
Lifetime prevalence of a Major Depressive Episode | ||
negative | 8,130,293 (69%) | 10,776,797 (87%) |
positive | 3,728,867 (31%) | 1,556,756 (13%) |
Unknown | 363,829 | 348,497 |
Past-year prevalence of a Major Depressive Episode | ||
negative | 9,019,046 (77%) | 11,252,466 (91%) |
positive | 2,752,836 (23%) | 1,062,285 (8.6%) |
Unknown | 451,107 | 367,298 |
1 Table prevalences are adjusted for the NSDUH complex sample design | ||
2 n (%) |
As you can see, the percentages are quite similar to our original Table 3 (which used the raw data) — but now we take into account the sampling design and the survey weights. The counts of people overall and in each cell are weighted to match the US population in 2019.
As another example, we can compute better estimates for the replication of the Daly paper (i.e., estimates for the proportion of adolescents with a past-year MDE by year and sex) by taking into account the survey design. Here’s the code to carry out that process:
# Create a survey design object
<-
df_for_weighted_20092019 |>
nsduh_20092019 select(year, starts_with("mde_"), sex, adol_weight, vestr, verep) |>
mutate(mde_pastyear_positive = case_when(mde_pastyear == "positive" ~ 1,
== "negative" ~ 0))
mde_pastyear
<-
weighted_20092019 svydesign(
id = ~ verep,
strata = ~ vestr,
data = df_for_weighted_20092019,
weights = ~ adol_weight,
nest = TRUE
)
# Calculate the weighted summary by sex and year
<-
weighted_summary_20092019 svyby(
formula = ~mde_pastyear_positive,
by = ~year + sex,
design = weighted_20092019,
FUN = svymean,
na.rm = TRUE
|>
) as.data.frame()
# View the result
|>
weighted_summary_20092019 select(-se)
These proportions are similar to the proportions that ignored the sample design — but clearly different. For example, initially when ignoring sample design, we estimated the proportion of females with a past-year MDE in 2009 to be 0.1234, while here it is .1140. Notice that the values in this new table that takes into account the survey design match precisely with those reported by Daly:
Depression levels among female participants increased by 12 percentage points (95% CI 10.4–13.5) between 2009 and 2019, from 11.4% to 23.4%.
That is, notice that the 2009 estimate for females from the table above which accounts for the survey design is .1140 * 100 = 11.4% and the 2019 estimate for females from the table above is .2338 * 100 = 23.4% — matching the values provided in the exerpt from the Daly paper above.
As described by the NSDUH methodologists: The person-level analysis weights (adol_weight) are the product of 16 weight components. Each weight component accounts for either a selection probability at a selection stage or an adjustment factor adjusting for non-response, coverage, or extreme weights. The sum of the weight over all of the respondents on the data file represents an estimate of the total number of individuals in the target population. In view of the use of weights as expansion factors in forming estimates, the weights can be interpreted as representing the total number of target population individuals each record in the file represents.
By applying these weights, our estimates most accurately reflect the population we aim to describe.
Wrap up
Throughout this Module, you’ve built a solid foundation in key statistical concepts that are crucial for any aspiring data scientist. Your new skills in computing and interpreting descriptive statistics enable you to summarize complex data frames effectively. This skill is essential as it transforms raw data into comprehensible information, providing clear insights into trends and patterns.
The Module has also introduced you to the intricacies of survey research, a common method for gathering data in many fields. This preparation is vital for the next stages of your journey in data science.
Footnotes
In the gtsummary package, numeric variables with fewer than 10 unique levels default to type categorical. To change a numeric variable to continuous that defaulted to categorical, use
type = list(varname ~ "continuous")
.↩︎Markdown is a straightforward markup language that enhances plain text with formatting elements like headings, lists, and links without needing a formal text editor or HTML tags. It works consistently across different devices, maintaining the format of the text regardless of the platform used.↩︎