Apply and Practice Activity

ARCOS: Compare States

Introduction

For this activity you will work with the WAPO compiled data repository on opioid pills that you studied in the Module 4 Handout on Data Wrangling. We will work with the county level data. Each county in the US has up to nine rows of data in this data frame, one for each year from 2006 to 2014. The table below describes the variables that we will utilize:

Variable Description
fips A 5 digit code that identifies each county
county County name
state State abbreviation
year Corresponding year for data
number_pills The number of Oxycodone & Hydrocodone pills purchased by pharmacies in the county
population The total number of people living in the county

Step by step directions

Step 1

Let’s get started. Navigate to the Posit Cloud and enter the foundations project.

Navigate to the apply_and_practice folder in the programs folder of the foundations project. Open up the file called arcos_compare_states.qmd.

To ensure you are working in a fresh session, close any other open tabs (save them if needed). Click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output.

Once the .qmd file is open, add your name to the author section of the YAML metadata.

Step 2

First we need to load the packages that are needed for this activity. Find the code chunk labeled: Load packages in the .qmd file. Then load these packages:

library(here)   
library(tidyverse)

Once entered, click run on the Load packages code chunk. Now, the packages are ready for you to use.

Step 3

Under the first level header

# Import data

Import the opioid counties data frame that we worked with in Module 4 (it’s in the data folder of the course project — and called opioid_counties.Rds). Call the data frame opioid_counties.

Step 4

Under the first level header

# Subset the data

Create a new data frame called states that includes only US states and the District of Columbia. This is necessary because, in the initial data frame, there are some other territories — like Puerto Rico for example — that we’ll exclude for this examination.

To accomplish this task, you can use this vector of states/DC:

valid_states <- c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
                  "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",                 
                  "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",                 
                  "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",                 
                  "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", "DC")

Copy and paste this vector into the code chunk under # Subset the data, then write a pipe to create the new data frame (called states) which is based on opioid_counties, but filters to include only the states listed in the vector and only counties that have a fips code. Hint — use the %in% operator that you discovered in Module 4 to limit to US states and DC: filter(state %in% valid_states), and use drop_na(fips) to drop rows with a missing fips code.

Once the code is complete, run this code chunk to create the data frame and inspect it to make sure the code worked properly. Your new data frame should have 26,998 observations.

Step 5

Under the first level header

# Create a line plot for Colorado

Create a line graph that depicts the opioid pills per capita (at the state level) for each year from 2006 to 2014 for Colorado. Accomplishing this task will require several steps.

  1. First, create a new data frame (call it co_by_year) that is based on the states data frame that you just created.
  2. Filter to keep only counties in Colorado
  3. Group the data frame by year.
  4. Use summarize to create two new variables — total_pills (which is the sum of the pills sold for the year) and total_pop (which is the sum of the population for the year). These two values give us the total number of pills sold and the total number of people in the state for each year. Be sure to include na.rm = TRUE.
  5. Last, use mutate to create a new variable called pills_per_capita which is computed by dividing total_pills by total_pop. This provides the pills per capita for each year.

Try to do these steps on your own, but if you need hints — please check the code below:

Once these steps are complete, take a look at the created data frame (co_by_year). You should have a data frame with 9 observations — one row for each year. The primary variable we will use in the next step is pills_per_capita for each year.

Now, create the line graph where year is on the x-axis and pills_per_capita is on the y-axis. Be sure to include a title and axis labels.

Once created, write a few sentences to describe what you see and your thoughts on the graph.

co_by_year <-
  states |> 
  filter(state == "CO") |> 
  group_by(year) |> 
  summarize(total_pills = sum(number_pills, na.rm = TRUE),
            total_pop = sum(population, na.rm = TRUE)) |> 
  mutate(pills_per_capita = total_pills/total_pop) 

co_by_year |> 
  ggplot(mapping = aes(x = year, y = pills_per_capita)) +
  geom_line() +
  theme_minimal() +
  labs(title = "The change in opioid pills per capita from 2006 to 2014 in Colorado",
       x = "Year",
       y = "Opioid pills per capita")

Step 6

Now, let’s create a faceted plot that shows the change in opioid pills per capital for all states and the District of Columbia.

First create a data frame called all_by_year which calculates the pills per capita in each state for each year. This will be similar to what you did for CO, but now you will now have a data frame that has 459 observations — 9 rows of data for each state plus DC.

Using the all_by_year data frame, create a line graph that depicts the opioid pills per capita (at the state level) from 2006 to 2014 faceted by state. Notice in the Quarto notebook on the Posit Cloud that I added coded block directions to provide the desired height and width of the outputted figure — here I request a figure that is 12 inches high and 8 inches wide. Once you render your report later in the activity, you can change these values if you’d prefer a larger or smaller graph in your report. Be sure to include a title and axis labels.

Try to write this section of the code on your own, but if you need help, please check the code below:

all_by_year <-
  states |> 
  group_by(state, year) |> 
  summarize(total_pills = sum(number_pills, na.rm = TRUE),
            total_pop = sum(population, na.rm = TRUE), .groups = "drop") |> 
  mutate(pills_per_capita = total_pills/total_pop) 

all_by_year |> 
  ggplot(mapping = aes(x = year, y = pills_per_capita)) +
  geom_line() +
  theme_minimal() +
  facet_wrap(~state) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "The change in opioid pills per capita from 2006 to 2014 across US states and DC",
       x = "Year",
       y = "Opioid pills per capita")

After you produce the graph, write a few sentences to provide your thoughts on the graph.

Step 7

From your faceted line graph, it’s evident that many of the states with the highest overall pills per capita are located in the Appalachian Region. Appalachia spans 423 counties across 13 states, covering 206,000 square miles from southern New York to northern Mississippi. This region’s 26.4 million residents are distributed across parts of Alabama, Georgia, Kentucky, Maryland, Mississippi, New York, North Carolina, Ohio, Pennsylvania, South Carolina, Tennessee, Virginia, and all of West Virginia.

The opioid epidemic, significantly fueled by pharmaceutical companies aggressively marketing pain pills, has had a devastating impact on Appalachia. Court documents and investigative reporting reveal that Purdue Pharma specifically targeted this region, leading to widespread addiction. Executives at Purdue Pharma even referred to OxyContin as “hillbilly heroin,” viewing the economically disadvantaged Appalachian region as a prime market for their products. This targeted marketing strategy significantly contributed to the severe opioid crisis experienced by Appalachian communities.

If you’d like to learn more about the atrocious marketing strategies of Purdue Pharma and others, please watch this video (please note there is some lewd language — you can skip with missing critical material).



The Appalachian Regional Commission (ARC) is an economic development partnership entity of the federal government and 13 state governments focusing on the counties of the Appalachian Region. ARC’s mission is “to innovate, partner, and invest to build community capacity and strengthen economic growth in Appalachia.” They provide a wealth of data about the Appalachian counties. Let’s download, import, and merge some of their data to our opioids data frame.

  1. First, go to this website. To familiarize yourself, take a look at the map of Appalachia, then scan down to the link for an excel file of the ARC’s recent economic data for the region — the link is called: County Economic Status FY 2023 Data Tables. Click on the link to download the excel file. The file is called CountyEconomicStatusandDistressAreasFY2023DataTables.xls. Drag it from your downloads folder to your desktop.

  2. Next, upload this data file to the Posit Cloud. In the Files tab (lower right quadrant) of your foundations project, navigate to the data folder: Cloud > project > data. Click on the icon with the yellow upward arrow (when hovering over it, it will say “Upload files to server.)” Click on Choose File then choose the excel file from your desktop, then click Upload. You should see the file now exists in your data folder.

  3. Under the header # Examine pills in Appalachia, copy and paste the code below to read in the excel file. Study the code (and the original excel file that you downloaded) to get a sense of what is happening here. The tip box below walks through each step. Notice the variable called county_economic_status_fy_2023. ARC analyzes three-year average unemployment rates, per capita market income, and poverty rates, for each one of Appalachia’s 423 counties and then classifies the counties into one of five economic status designations — distressed, at-risk, transitional, competitive, or attainment. This is the key variable will we consider from ARC.

appalachia <- readxl::read_excel(here("data", "CountyEconomicStatusandDistressAreasFY2023DataTables.xls"), 
                         sheet = "ARC Counties", 
                         range = "A5:D429") |> 
  filter(!is.na(FIPS)) |> 
  janitor::clean_names() |> 
  select(fips, county_economic_status_fy_2023) |> 
  mutate(county_economic_status_fy_2023 = factor(county_economic_status_fy_2023,
                                                 levels = c("Attainment",
                                                            "Competitive",
                                                            "Transitional",
                                                            "At-Risk",
                                                            "Distressed")))

appalachia |> head()
Tip

Code explanation

  • Read the Excel File:

    • readxl::read_excel: This function reads data from an Excel file.

    • here("data", "CountyEconomicStatusandDistressAreasFY2023DataTables.xls"): Constructs the file path to the Excel file located in the “data” folder.

    • sheet = "ARC Counties": Specifies the sheet named “ARC Counties” to read from.

    • range = "A5:D429": Specifies the range of cells to read (from row 5 to row 429, and columns A to D).

  • Filter Data:

    • filter(!is.na(FIPS)): Removes rows where the FIPS column has missing (NA) values. There is a blank first row in the initial excel data file.
  • Clean Column Names:

    • janitor::clean_names(): Renames the columns to a consistent, snake_case format, making them easier to work with.
  • Select Needed Variables:

    • select(fips, county_economic_status_fy_2023): Selects just the variables we need.
  • Make county_economic_status_fy_2023 a Factor:

    • You’ve learned that a factor is a special type of character variable in R. It’s special because we can specify the ordering of the levels. Here I use the factor() function to specify that the order for these levels is “Attainment”, “Competitive”, “Transitional”, “At-Risk”, “Distressed”, corresponding to the most prosperous to the most distressed economic status.

Step 8

Underneath the code to import the Appalachian data from ARC, let’s merge the states data frame with the appalachia data frame just created. Specifically create a data frame called ap_states by left joining states with appalachia (i.e., list states first, then left join appalachia). Also, add a new variable called is_appalachia using mutate() and case_when(). The new variable should be coded “yes” if the county is in Appalachia (i.e., !is.na(county_economic_status_fy_2023) ~ "yes" — where !is.na means “is not missing”, and the new variable should be coded as “no” if the county is not in Appalachia (i.e., is.na(county_economic_status_fy_2023) ~ "no" — where is.na means “is missing”). You’ll also need to recreate the pills_per_capita variable in this data frame.

Now, create a line graph that has two lines on it — one to present the total pills per capita for the Appalachian counties, and one for all non-Appalachian counties. To do this, your code will be quite similar to Step 6, however, for this step you will need to group by is_appalachia and year before forming the necessary summaries and then plotting.

Write a few sentences below the graph to describe what you see.

Try to write the code on your own, but if you need help, you can check the code below:

ap_states <-
  states |> 
  left_join(appalachia, by = "fips") |> 
  mutate(is_appalachia = case_when(!is.na(county_economic_status_fy_2023) ~ "yes",
                                   is.na(county_economic_status_fy_2023) ~ "no")
  ) |> 
  mutate(pills_per_capita = number_pills/population)

# ap_states |> count(is_appalachia)

ap_states |> 
  group_by(is_appalachia, year) |> 
  summarize(total_pills = sum(number_pills, na.rm = TRUE),
            total_pop = sum(population, na.rm = TRUE), .groups = "drop") |> 
  mutate(pills_per_capita = total_pills/total_pop) |> 
  ggplot(mapping = aes(x = year, y = pills_per_capita, group = is_appalachia, color = is_appalachia)) +
  geom_line() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "The change in opioid pills per capita from 2006 to 2014 \nfor Appalachian and non-Appalachian counties",
       x = "Year",
       y = "Opioid pills per capita",
       color = "Is the county in Appalachia?")

Step 9

As a final exploration let’s determine if the maximum opioid pills per capita during the prescription phase of the opioid epidemic (2006-2014 — the years contained in the WAPO data frame) are associated with the economic status of Appalachian counties as classified by the Appalachian Regional Commission (ARC).

To assess this question, please do the following tasks in the code chunk labeled # Maximum pills and economic status of Appalachian counties in 2023.

  1. Create a data frame called max based on the ap_states data frame — group by fips then use summarize to create a new variable called max_pills_per_capita which is the max pills per capita during the observation period (2006-2014).
  2. Merge the appalachia data frame with the max data frame just created so that you end up with a data frame that include all of the Appalachian counties. Note that there are a few counties in the ARC data frame that are not in the WAPO data frame — therefore, use the drop_na() function to drop counties with missing data for max_pills_per_capita.
  3. Create a box plot using the merged data frame — put county_economic_status_fy_2023 on the x-axis, and max_pills_per_capita on the y-axis. Note that because we specified county_economic_status_fy_2023 to be a factor earlier in the activity — the specified order of the economic status as it ranges from the most prosperous to most distressed status will appear on the x-axis, as opposed to the default which is to order alphabetically. Provide a suitable title and axis labels.

When done, study the graph and write a few sentences to provide your thoughts on the results.

Try to write the code on your own, but if you need help, check the code below:

max <- 
  ap_states |> 
  group_by(fips) |> 
  summarize(max_pills_per_capita = max(pills_per_capita))

appalachia |> 
  left_join(max, by = "fips") |> 
  drop_na(max_pills_per_capita) |> 
  ggplot(mapping = aes(x = county_economic_status_fy_2023, y = max_pills_per_capita)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Were more pills per capita during the prescription phase of the opioid epidemic \nassociated with more economic distress in 2023 among Appalachian counties?",
       y = "Maximum pills per capita from 2006-2014",
       x = "Economic Status of County in 2023",
       caption = "Opioid data from WAPO data repository, Economic status data from Appalachian Regional Commission")

Step 10

Now that you’ve completed all tasks, to help ensure reproducibility, click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output. Scroll through your notebook and see that all of the output is now gone. Now, click the down arrow beside the Run button again, then click Restart R and Run All Chunks. Scroll through the file and make sure that everything ran as you would expect. You will find a red bar on the side of a code chunk if an error has occurred. Taking this step ensures that all code chunks are running from top to bottom, in the intended sequence, and producing output that will be reproduced the next time you work on this project.

Now that all code chunks are working as you’d like, click Render. This will create an .html output of your report. Scroll through to make sure everything is correct. The .html output file will be saved along side the corresponding .qmd notebook file.

Step 11

Follow the directions on Canvas for the Apply and Practice Assignment entitled “ARCOS Compare States Apply and Practice Activity” to get credit for completing this assignment.