Apply and Practice Activity

ARCOS: Examine the Pill Buyers

Introduction

For this activity you will work with the WAPO compiled data repository on opioid pills. We will explore a data file that is different than the file used for Module 4 — here, we’ll explore data that provides every opioid purchase transaction between 2006 and 2014. These data frames are very large — therefore we’ll consider just one county — Larimer County, CO. Each row of data represents one purchase transaction.

We’ll work with the following set of variables:

Variable Description
BUYER_NAME The name of the buyer.
TRANSACTION_DATE Date of transaction, a string representing MMDDYYYY.
DOSAGE_UNITS The number of opioid pills that were delivered to the pharmacy for that specific transaction.
DRUG_NAME The name of the opioid purchased.

Step by step directions

Step 1

Let’s get started.

Navigate to the apply_and_practice folder in the programs folder of the course project. Open up the file called arcos_buyers.qmd.

To ensure you are working in a fresh session, close any other open tabs (save them if needed). Click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output.

Once the .qmd file is open, add your name to the author section of the YAML metadata.

Step 2

First we need to load the packages that are needed for this activity. Find the code chunk labeled: Load packages in the .qmd file. Then load these packages:

library(janitor)
library(here)   
library(tidyverse)

Once entered, click run on the Load packages code chunk. Now, the packages are ready for you to use.

Step 3

Go to this website and scroll down to the spot that says “Find the data for where you live”. Choose Colorado and Larimer County. It will provide a summary that you can peruse.

You can try downloading the data frame yourself and moving it to your desktop. Alternatively, I have downloaded it for you. Click the link below to download the data, and then move the file to your desktop.

Download the data file

Once the data file is downloaded to your desktop, you can upload it to your foundations Posit Cloud project. In the lower right quadrant of your RStudio session, click on the Files tab, then navigate to the data folder. Once the data folder contents are showing, click the Upload button. A box will pop up. Under File to Upload, click Choose File and navigate to the file called arcos-co-larimer-08069-itemized.csv that you saved on your desktop. Then click OK. You should now see that data file in your data folder on the Posit Cloud.

Now, we’re ready to import the data frame into our RStudio session. Copy and paste the code below into the code chunk under the # Import data header in your analysis notebook.

orig <- read_csv(here("data", "arcos-co-larimer-08069-itemized.csv"))

Once this is added to your acros_buyers.qmd analysis notebook, run the code chunk and then take a look at the orig data frame you just created (in the Environment tab).

Step 4

Under the first level header

# Prepare the data

Inside the code chunk, create a new data frame called larimer that is based on orig (the data frame you read in from the web). Use the function clean_names() from the janitor package to clean the names. This handy function makes the column names more consistent and easier to work with by performing several transformations:

  1. It converts all characters to lower case. This can help prevent errors caused by case-sensitivity.
  2. It replaces spaces and other non-alphanumeric characters with underscores. This makes it easier to reference the column names in code, as you can directly use the column name without back ticks or quotes.
  3. It removes leading and trailing white space. These can often be hard to notice but can cause unexpected behavior when trying to reference a column by its name.
  4. It makes sure that all column names are unique. Duplicate column names can lead to confusion and unexpected results.

Click here for documentation on the janitor package.

With clean_names() added, your code chunk should look like this:

larimer <-
  orig |> 
  clean_names()

Run these lines of code, then take a look at the data frame and see how the names have changed.

Step 5

Building on the pipe from Step 4, after the clean_names() call, create a date that parses the variable in transaction_date (which is in the format MMDDYYYY). We can use the lubridate package for this — which is part of the tidyverse. The mdy() function is used to convert a character string into a date object. The mdy() function specifically expects the dates to be in the format of “month-day-year”. So, for instance, “03252021” would represent March 25, 2021.

Your code should now look like this:

larimer <- 
  orig |> 
  clean_names() |> 
  mutate(date = mdy(transaction_date))

Open up the data frame and see that a new variable called date that is formatted as a R-recognizable date is present.

Once the new date variable is in a proper date format, we can easily pull out the month and year as stand alone variables.

larimer <- 
  orig |> 
  clean_names() |> 
  mutate(date = mdy(transaction_date)) |> 
  mutate(year = year(date),
         month = month(date, label = TRUE)) 

Specifically, this new code accomplishes the following tasks:

  • mutate(year = year(date): This line is using the mutate() function from the dplyr package to add a new column named year to the data frame. The values in this column are obtained by applying the year() function from the lubridate package to the date column of the data frame. The function extracts the year from a date or date-time object.

  • month = month(date, label = TRUE)): This line is continuing the [mutate()]{fxn-name} function call from the previous line. It’s adding another new column named month to the data frame. The values in this column are obtained by applying the month() function from the lubridate package to the date column of the data frame. The month() function extracts the month from a date or date-time object. The label = TRUE argument specifies that the function should return the abbreviated month names (e.g., “Jan”, “Feb”, “Mar”, etc.) instead of the numerical month values (1, 2, 3, etc.).

Add this new mutate() call (to create year and month) to your pipe in your arcos_buyer.qmd analysis notebook (this should all be one pipe under the # Prepare the data code chunk). Run the code chunk and inspect the new variables.

Step 6

With the data frame complete, let’s perform two tasks. First, lets compute the total pills purchased each month, for each year that the data is available (2006-2014). We’ll then use the information to determine if there is variability in total pills purchased by month. This will address the research question: “Are there certain times of the year when more pills were purchased?.

Under the first level header

# Create a boxplot

Insert a code chunk, then create a new data frame called larimer_by_month, that will be based on the larimer data frame that you created earlier. Here, sum the number of pills purchased across each month of years 2006 to 2014. To accomplish this, group by year and month, and then use summarize to sum the total number of pills purchased in the month/year combination. Call the summed variable total_pills_purchased. With 9 years of data and 12 months per year — this operation will give you a data frame with 108 rows of data.

Once the summarized data frame is created, then use it to create a box plot of total_pills_purchased by month across the years. This will allow us to determine if there were differences in the number of pills coming into pharmacies in Larimer County as a function of month. If you need help formulating the code, click on Step 7.

Write a couple of sentences to describe the result.

Step 7

larimer_by_month <-
  larimer |> 
  group_by(year, month) |> 
  summarize(total_pills_purchased = sum(dosage_unit, na.rm = TRUE), .groups = "drop") 

larimer_by_month |> 
  ggplot(mapping = aes(x = month, y = total_pills_purchased)) +
  geom_boxplot(fill = "violet") +
  theme_minimal() +
  labs(title = "The distribution of pills purchased in Larimer County, CO by calendar month",
       subtitle = "Data collected between 2006 and 2014",
       x = "",
       y = "Number of opoiod pills purchased")

In this code chunk, .groups = "drop" instructs R to drop groups after summarization. This isn’t necessary, but it can be helpful if you plan to do additional analyses after the summarization where the groups should no longer be considered.

Let’s consider an example. Suppose you’re summarizing some data grouped by multiple variables. After summarization, if the resulting data frame has one row per group, the grouping structure can become unnecessary or even confusing for further data manipulation and analysis.

Here is an example:

# Data
df <- data.frame(
  group1 = c("A", "A", "B", "B"),
  group2 = c("X", "Y", "X", "Y"),
  value = c(1, 2, 3, 4)
)

# Summarization without .groups = "drop"
result1 <- df |>
  group_by(group1, group2) |>
  summarize(sum = sum(value))

# Summarization with .groups = "drop"
result2 <- df |>
  group_by(group1, group2) |>
  summarize(sum = sum(value), .groups = "drop")

Here, result1 will retain the grouping structure, whereas result2 will not. This might not be a big deal in this simple example, but in a complex data manipulation pipeline, unintended grouping structures can sometimes cause unexpected results. Therefore, using .groups = "drop" when you don’t need the grouping anymore is a good practice.

However, you should note that this is not always necessary. If you’re going to continue manipulating data with operations that respect the grouping structure then you might want to keep the grouping in place. It really depends on your specific use case and what you plan to do with the data frame after summarization.

Another approach to ungroup a data frame that you no longer want to be grouped is to use the ungroup() function.

Here’s an example of that:

df |>
  group_by(group1, group2) |>
  summarize(sum = sum(value)) |>
  ungroup()

For more information about group_by(), see this documentation.

Step 8

Let’s perform one last task. Under the second level header

# Examine the most frequent buyers

Create a new code chunk, then take the larimer data frame and filter it to keep only year 2010.

Select the variables buyer_name and dosage_unit.

Group by buyer_name, and then summarize to create a variable called buyer_total which is the sum of dosage_unit for each buyer in the data frame. This will create the sum of all pills purchased in 2010 by each buyer in the data frame. Use the arrange() function to sort buyer_total in descending order. Take a look at the organizations who purchased the most pills in 2010.

Try to write this code yourself. If you have trouble getting the code just right — click on Step 9.

Once you create the sorted data frame — page though the results and then write a few sentences to describe what you observe.

Step 9

larimer |> 
  filter(year == 2010) |> 
  select(buyer_name, dosage_unit) |>
  group_by(buyer_name) |> 
  summarize(buyer_total = sum(dosage_unit)) |> 
  arrange(desc(buyer_total))

Step 10

Now that you’ve completed all tasks, to help ensure reproducibility, click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output. Scroll through your notebook and see that all of the output is now gone. Now, click the down arrow beside the Run button again, then click Restart R and Run All Chunks. Scroll through the file and make sure that everything ran as you would expect. You will find a red bar on the side of a code chunk if an error has occurred. Taking this step ensures that all code chunks are running from top to bottom, in the intended sequence, and producing output that will be reproduced the next time you work on this project.

Now that all code chunks are working as you’d like, click Render. This will create an .html output of your report. Scroll through to make sure everything is correct. The .html output file will be saved along side the corresponding .qmd notebook file.

Step 11

Follow the directions on Canvas for the Apply and Practice Assignment entitled “ARCOS Buyers Apply and Practice Activity” to get credit for completing this assignment.