Apply and Practice Activity

Explore cross-tabulations with NSDUH

Introduction

For this activity you will work with the adolescent data from the National Survey on Drug Use and Health — the same data we worked with in Module 7. The data frame is called NSDUH_adol_depression.Rds.

Here are the variables in the data frame:

Variable	Description
year	Survey year
sex	Biological sex of respondent
age	Age of respondent
raceeth	Race/ethnicity of respondent
mde_past_year	Did the respondent meet criteria for a major depressive episode in the past-year? negative = did not meet criteria, positive = did meet criteria.
mde_lifetime	Did the respondent meet criteria for a major depressive episode in their lifetime? negative = did not meet criteria, positive = did meet criteria.
mh_sawprof	Did the respondent see a mental health care professional in the past-year?
severity_chores	If mde_pastyear is positive, how severely did depression interfere with doing home chores?
severity_work	If mde_pastyear is positive, how severely did depression interfere with school/work?
severity_family	If mde_lifetime is positive, how severely did depression interfere with family relationships?
severity_chores	If mde_pastyear is positive, how severely did depression interfere with social life?
mde_lifetime_severe	If mde_pastyear is positive, was the MDE classified as severe?
substance_disorder	Did the respondent meet criteria for a past-year substance use disorder (alcohol or drug)? — available for 2019 data only
adol_weight	Sampling weight - from sampling protocol
vestr	Stratum — from sampling protocol
verep	Primary sampling unit — from sampling protocol

As you learned in Module 5, understanding the basic rules of probability is a fundamental skill in statistics and data analysis. As we did in both Modules 5 and 7, one practical way to apply these rules is by examining cross-tabulated data, which allows us to explore relationships between two categorical variables. In this activity, we’ll use a cross-tabulation (or contingency table) of past-year Major Depressive Episode (MDE) and past-year Substance Use Disorder (SUD) to practice calculating various probabilities.

Please follow the steps below to complete this activity.

Step by step directions

Step 1

In the Posit Cloud foundations project, start a new quarto document. Click File -> New File -> Quarto Document. A dialog box will pop up — in it, give your new document a title (e.g., Explore Cross-Tabulations), then type your name (beside Author). Uncheck the box beside Use visual markdown editor. Then click Create.

Once the file is created, click File -> Save As, then save your file in the apply_and_practice_programs folder inside the programs folder of the foundations course project. Name it cross_tabs.qmd.

Quarto auto-populates some text and code chunk examples to help you get started. You can delete all of this. Just highlight everything BELOW the YAML header (i.e., the part that starts and ends with the three dashes ---) and then click delete.

To ensure you are working in a fresh session, close any other open tabs (save them if needed). Click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output.

Step 2

First we need to load the packages that are needed for this activity. Create a first level header called

# Load packages

Then insert a code chunk (click on the Green C at the top of the RStudio Session) and load the following packages.

library(gtsummary)
library(here)   
library(tidyverse)

Once entered, click run (the green play button in the upper left of the code chunk) on the Load packages code chunk. Now, the packages are ready for you to use.

Step 3

Create a first level header

# Import data

Insert a code chunk, then import the NSDUH data frame.

nsduh_20092019 <- read_rds(here("data", "NSDUH_adol_depression.Rds"))

Step 4

Create a first level header

# Subset the data to 2019

Insert a code chunk and create a new data frame called nsduh_2019 that includes just the 2019 data.

Step 5

Create a first level header

# A cross-tabulation of past-year MDE and substance use for females

Then insert a code chunk. Write a pipe that does the following:

Take the nsduh_2019 data frame
Filter the data frame to include only females
Select the variables mde_pastyear and substance_disorder
Drop any missing cases using drop_na()
Create a cross tabulation using tbl_cross() — put mde_pastyear on the rows and substance_disorder on the columns (don’t request percentages).

Your table should look like this:

	substance_disorder		Total
	negative	positive	Total
mde_pastyear
negative	4,637	171	4,808
positive	1,317	170	1,487
Total	5,954	341	6,295

Step 6

Use the cross tabulation from Step 5 to answer the following questions about female co-morbidity of MDE and SUD in 2019.

What is the probability of a past-year major depressive episode (MDE)?
What is the probability of a past-year substance use disorder (SUD)?

Record your answers as text in the .qmd analysis notebook — underneath the code chunk and outputted cross tabulation (i.e., NOT inside the code chunk — rather, in the white part of the notebook).

Step 6 answers

Probability of MDE (Major Depressive Episode):

The probability (denoted P) of having MDE is the number of individuals with MDE (either with or without SUD) divided by the total number of individuals. From the table:

P(MDE) = Total number of individuals with MDE / Total number of individuals = 1,487 / 6,295 = 0.236

Probability of SUD (Substance Use Disorder):

The probability (denoted P) of having a SUD is the number of individuals with a SUD (either with or without MDE) divided by the total number of individuals. From the table:

P(SUD) = Total number of individuals with a SUD / Total number of individuals = 341 / 6,295 = 0.054

In summary, in this population, the probability of having an MDE is approximately .24 and the probability of having a SUD is approximately .05.

Step 7

Use the cross tabulation from Step 5 to answer the following question.

What is the probability of having both MDE and SUD?

Record your answers as text in the .qmd analysis notebook.

Step 7 answers

The value you calculated in Step 7 is called the joint probability — that is, the probability of having both MDE and a SUD. It is calculated as the number of individuals with both disorders divided by the total number of individuals.

P(MDE ∩ SUD) = 170 / 6,295 = 0.027

Note that the symbol ∩ is a mathematical symbol that denotes intersection when dealing with sets. In the context of probability theory, A ∩ B represents the event that both A and B occur — that is, that a female in the population has both a MDE and SUD in the past year.

Step 8

Use the cross tabulation from Step 5 to answer the following question.

What is the probability of having a MDE, a SUD, or both?

Record your answers as text in the .qmd analysis notebook.

Step 8 answers

The value you calculated in Step 8 is called the union of two events. The union of these two events, written as A ∪ B, represents the probability that a person is either positive for MDE, positive for a SUD, or positive for both.

To calculate P(A ∪ B), you add up the probabilities of A and B and then subtract the joint probability of A and B (since we’re double counting the case where both A and B occur in the first two terms).

In terms of counts from the cross table:

total MDE positive cases = 1,487,
total SUD positive cases = 341,
both MDE and SUD positive cases = 170.

Therefore,

P(A ∪ B) = (total MDE positive + total SUD positive - both positive) / total population = (1,487 + 341 - 170) / 6,295 = 0.263. That is, the probability of having a MDE, a SUD, or both is about .26 in the population.

Step 9

Use the cross tabulation from Step 5 to answer the following questions.

What is the probability of having a SUD given the individual has a MDE?
What is the probability of having a MDE given the individual has a SUD?

Step 9 answers

The values you calculated in Step 9 are called conditional probabilities.

The conditional probability of having a SUD given MDE is the number of individuals with both MDE and a SUD divided by the total number of individuals with a MDE.

P(SUD | MDE) = 170 / 1,487 = 0.114

The conditional probability of having MDE given a SUD is the number of individuals with both MDE and a SUD divided by the total number of individuals with a SUD.

P(MDE | SUD) = 170 / 341 = 0.498

Step 10

All of the probabilities we calculated in this activity so far can be recovered by requesting the percentages in the cross-tabulations.

To get the simple (i.e., marginal) probability of MDE and the simple probability of SUD, request percent = "cell" in the tbl_cross() function. The probabilities (expressed as percentages — i.e., probability times 100) are on the positive row total for MDE and the positive column total for SUD.

nsduh_2019 |> 
  filter(sex == "female") |> 
  select(mde_pastyear, substance_disorder) |> 
  drop_na() |> 
  tbl_cross(row = mde_pastyear, col = substance_disorder, percent = "cell")

	substance_disorder		Total
	negative	positive	Total
mde_pastyear
negative	4,637 (74%)	171 (2.7%)	4,808 (76%)
positive	1,317 (21%)	170 (2.7%)	1,487 (24%)
Total	5,954 (95%)	341 (5.4%)	6,295 (100%)

In a cross-tabulation (also known as a contingency table), the “margins” often refer to the row and column totals. These are typically displayed in the right-most column and the bottom row of the table. They are called “margins” because they are displayed at the edge, or margin, of the table.

For example, let’s refer back to our table. In this example, the margins of the table are the row totals (4,808 and 1,487) and the column totals (5,954 and 341). The “Total” in the bottom right corner (6,295) is the overall total, counting every individual across all categories.

The margins are useful because they allow us to calculate the marginal probabilities, which are the probabilities of a single event occurring without consideration to the other event. For instance, in this example, the marginal probability of a person having a Major Depressive Episode (MDE) in the past year would be the total number of MDE positive individuals (1,487) divided by the total population (6,295), regardless of their SUD status.

The term “marginal” refers to the fact that these probabilities are derived from the totals at the margins of the cross-tabulation table.

The joint probability is also recovered from the table with percent = “cell” (see the cell for positive/positive with 170 cases — which represents 2.7% of the population).

The conditional probability of substance use disorder given MDE: P(SUD | MDE) = 170 / 1,487 = 0.114, is obtained by requesting percent = “row”.

nsduh_2019 |> 
  filter(sex == "female") |> 
  select(mde_pastyear, substance_disorder) |> 
  drop_na() |> 
  tbl_cross(row = mde_pastyear, col = substance_disorder, percent = "row")

	substance_disorder		Total
	negative	positive	Total
mde_pastyear
negative	4,637 (96%)	171 (3.6%)	4,808 (100%)
positive	1,317 (89%)	170 (11%)	1,487 (100%)
Total	5,954 (95%)	341 (5.4%)	6,295 (100%)

The conditional probability of MDE given SUD, P(MDE | SUD) = 170 / 341 = 0.498, is obtained by requesting percent = “column”.

nsduh_2019 |> 
  filter(sex == "female") |> 
  select(mde_pastyear, substance_disorder) |> 
  drop_na() |> 
  tbl_cross(row = mde_pastyear, col = substance_disorder, percent = "column")

	substance_disorder		Total
	negative	positive	Total
mde_pastyear
negative	4,637 (78%)	171 (50%)	4,808 (76%)
positive	1,317 (22%)	170 (50%)	1,487 (24%)
Total	5,954 (100%)	341 (100%)	6,295 (100%)

There is one exception — the tbl_cross() function in R does not directly compute the probability that represents the union of two events — that is, P(A ∪ B). This is because the union operation involves both events A and B and considers cases where either A occurs, B occurs, or both occur. This involves both rows and columns of the table, which is something tbl_cross() doesn’t compute directly.

In your .qmd analysis notebook, practice writing the tbl_cross() syntax for these examples — requesting the percentages for cells, rows, and columns, and matching up the corresponding probabilities that we’ve learned about so far.

Step 11

Let’s finish up by looking at one other cross tabulation — specifically, the cross tabulation between MDE and severe MDE. We’ll use males and females for this example.

In your analysis notebook, create a first level header

#| Cross tabulation of past-year MDE and severe MDE

Insert a code chunk, then from the nsduh_2019 data frame, create a cross tabulation of mde_pastyear and mde_pastyear_severe. To do this, first select these two variables, then use drop_na() to drop missing cases. NSDUH defines severe MDE if an adolescent has a score of 7 or higher on any of the four severity items (i.e., MDE’s negative impact on chores, school/work, family or social life). Note that mde_pastyear_severe is only relevant if mde_pastyear is positive. Request percent = "cell". Your output should look like this:

	mde_pastyear_severe		Total
	negative	positive	Total
mde_pastyear
negative	10,852 (84%)	0 (0%)	10,852 (84%)
positive	586 (4.5%)	1,497 (12%)	2,083 (16%)
Total	11,438 (88%)	1,497 (12%)	12,935 (100%)

In your notebook add a second cross tabulation that requests percent = "row", and a third cross tabulation that requests percent = "column"

Using these two cross tabulations — we can run through the same set of probabilities we calculated for the last example.

Compute the following quantities using your cross tabulations:

Simple (Marginal) Probabilities
- What is the probability of having a Major Depressive Episode (MDE) in the past year?
Joint Probability
- What is the probability of having both a Major Depressive Episode (MDE) in the past year and it being severe?
Conditional Probability
- Given that an individual had a Major Depressive Episode (MDE) in the past year, what is the probability that it was severe?
Union of the Two Events
- What is the probability of either having a Major Depressive Episode (MDE) in the past year, a severe MDE, or both?

Record these quantities in your analysis notebook. Additionally, write a paragraph to describe two or three interesting elements that you’ve garnered from this cross tabulation.

Step 12

Finalize and submit.

Now that you’ve completed all tasks, to help ensure reproducibility, click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output. Scroll through your notebook and see that all of the output is now gone. Now, click the down arrow beside the Run button again, then click Restart R and Run All Chunks. Scroll through the file and make sure that everything ran as you would expect. You will find a red bar on the side of a code chunk if an error has occurred. Taking this step ensures that all code chunks are running from top to bottom, in the intended sequence, and producing output that will be reproduced the next time you work on this project.

Now that all code chunks are working as you’d like, click Render. This will create an .html output of your report. Scroll through to make sure everything is correct. The .html output file will be saved along side the corresponding .qmd notebook file.

Follow the directions on Canvas for the Apply and Practice Assignment entitled “NSDUH Cross Tabulations Apply and Practice Activity” to get credit for completing this assignment.