Data Visualization

Module 3

Learning objectives

Describe the graphical layers of a plot
Identify the ggplot2 functions and arguments needed to create graphs
Create the most commonly used graphs in data science
Apply graphs to illustrate important information and create new insight

Overview

Unleashing the power of data starts with understanding its stories, and nothing accomplishes this better than the art and science of data visualization. This process transforms voluminous and complex data into engaging, comprehensible visual graphics, serving as the gateway to data-driven insights. As a key cornerstone of data science, data visualization is your first step in unraveling the narratives hidden within the numbers.

In the context of data science, where large volumes of data are commonplace, visualization serves as a fundamental exploratory instrument. It enables data scientists to gain a broad overview of the data, pinpoint outliers, identify irregularities, and discern underlying patterns. This comprehensive view informs and optimizes the data cleaning process, model selection, and model validation.

Data visualization also functions as an excellent communication medium. It allows data scientists to translate their technical findings into a format that a wider audience can comprehend, regardless of their technical knowledge. By visually presenting data, data scientists can more effectively share their insights, strengthen their arguments, and encourage data-driven decision-making processes.

In short, data visualization is a key aspect of data science. It acts as both a microscope for the data scientist to explore and understand data, and a universal language to communicate these discoveries. As we navigate this module, you’ll get hands-on experience with this vital tool, enriching your skill set as a data scientist.

Why start with data visualization?

As we embark on this journey to learn R, our first stop will be the realm of data visualization. I’ve chosen this as our starting point for several compelling reasons.

Firstly, data visualization provides an interactive and vibrant way to comprehend complex data, instantly transforming the abstract into the tangible. It’s an experience where you’ll be able to convert lines of code into impactful, easy-to-understand graphics. This process is not only deeply satisfying but also provides immediate visual feedback, giving a real sense of accomplishment right from the beginning.

Beyond the thrill of creating visually compelling graphics, data visualization is a powerful tool in the hands of a data scientist. It aids in deciphering intricate patterns, spotting trends, and identifying correlations within the noise of raw data. This powerful tool allows you to tell a story that numbers alone cannot express, making it an essential skill in any data scientist’s toolkit and helping the user to grow their intuition about data modeling.

Last, data visualization with ggplot2 (the R package we will use for graphing) offers more than just the ability to create impactful graphics; it lays the groundwork for essential coding skills. As a part of the tidyverse landscape, a collection of R packages designed for data science, ggplot2 operates on principles that are pervasive in coding and data manipulation tasks. It encourages a deep understanding of data structures and the grammar of graphics, a coherent system for describing and building graphs. This not only elevates your visualization abilities but also cultivates fundamental skills such as code debugging, logical thinking, and code construction. Each layer, aesthetic, or theme you add in ggplot is a small step towards learning how to manipulate data and write effective code. In essence, mastering ggplot2 becomes your springboard into becoming a skilled coder.

The utility of data visualization

Before we begin creating beautiful and meaningful graphics with R, please watch the two following introductory videos from Crash Course Statistics on data visualization that will introduce you to the usefulness of graphs and give an overview of different types of graphs.

Creating graphs in R

Data visualization in R offers a world of possibilities, providing extensive tools and techniques to create diverse and engaging graphs. As we navigate this module, we will lay the groundwork for understanding how to set up and customize graphs using the powerful package, ggplot2, part of the tidyverse.

The ggplot2 package provides a robust and flexible platform for creating visualizations. There’s much to learn about how to use R to create just the right graph for your project. You’ll learn the basics in this Module — but there are myriad wonderful resources to help you along your journey. I strongly encourage the use of the R-graph library, an extensive repository filled with myriad R graph examples complete with their code. This resource will inspire you and guide you in your journey to create compelling visualizations. In addition, the ggplot2 cheat sheet is a concise yet comprehensive guide that succinctly describes the wealth of options available within ggplot2. This resource will serve as a quick reference to ensure you can maximize the package’s potential.

Remember, mastering ggplot2 and data visualization in R is not a sprint but a marathon. As you replicate examples, utilize resources, and practice your skills, you’ll gradually unravel the expansive capacity of ggplot2 and become proficient in creating compelling, data-driven visualizations.

New users of R often find that searching for an example graph (like in the R graph library), copying and pasting the code to create the example graph, and then modifying the example to fit their needs is a great way to start. All of the available arguments can be a bit overwhelming at first. However, by exploring what different options do, you will slowly build your knowledge base and skill level. I encourage you to experiment — change options (e.g., the size of elements, colors, legend placement, etc.) to see what effects they have on the graph. This will help you gain a deeper understanding of the inner workings of ggplot2 — and will also help you build your experience in creating graphs that display the pertinent information in the best way possible and produce aesthetically-pleasing graphs.

Introduction to ggplot2

ggplot2 is a package that was developed to implement the graphing framework put forth by Dr. Leland Wilkinson in his book entitled The Grammar of Graphics. The grammar of graphics and ggplot2 are built around the 7 layers of a graph. These are listed in the table below. The primary ggplot2 function for creating a graph is called ggplot(). Each layer of a ggplot2 graph provides an element of the graph — and when they come together they tell the complete story of the data.

The layers of a ggplot2 graph

Layer	Description
Data	The data to be plotted.
Aesthetics	The mapping of variables to elements of the plot.
Geometrics	The visual elements to display the data — e.g., bar graph, scatterplot.
Facets	Plotting of small groupings within the plot area — i.e., subplots.
Statistics and Transformations	Additional statistics imposed onto plot or numerical transformations of variables to aid interpretation.
Coordinates	The space on which the data are plotted — e.g., Cartesian coordinate system (x- and y-axis), polar coordinates (pie chart).
Themes	All non-data elements — e.g., background color, labels.

Detailed Explanation of Each Layer

Data: This layer is the starting point and includes the dataset you want to visualize. The data is usually provided in the form of a data frame.
Aesthetics (aes): Aesthetics map the data variables to visual properties of the graph. For example, aes(x = var1, y = var2, color = var3) maps var1 to the x-axis, var2 to the y-axis, and var3 to the color of the points or lines in the plot.
Geometries (Geoms): Geometries define the type of plot. Common geoms include geom_point() for scatter plots, geom_line() for line plots, geom_bar() for bar charts, and geom_histogram() for histograms. Each geom function adds a layer to the plot representing a specific type of geometry.
Facets: Facets allow you to create multiple plots based on a factor variable. For example, facet_wrap(~ var4) will create a series of plots, each showing a subset of the data corresponding to the levels of var4.
Statistics and Transformations: This layer can include statistical summaries or transformations, such as stat_smooth() to add a regression line. Transformations can also include scaling data or computing aggregates.
Coordinates: The coordinate system defines how data points are placed on the plot. The default is Cartesian coordinates, but you can also use coord_polar() for circular plots or coord_flip() to swap the x- and y-axes.
Themes: Themes control the overall appearance of the plot. This includes elements like font size, font family, colors, and the presence of grid lines. The theme() function and pre-built themes like theme_minimal() or theme_classic() help customize the look and feel of the plot.

By understanding and utilizing these layers, you can create complex and informative visualizations that effectively communicate your data’s story. Each layer can be customized and combined to produce a wide variety of plots tailored to your specific needs.

Let’s apply the grammar of graphics to the plot below, which appeared in a 2020 issue of the Center for Disease Control and Prevention’s Morbidity and Mortality Weekly Report (MMWR).

The scientists aimed to determine if, in the very early days of the COVID-19 pandemic, adoption of more COVID-19 mitigation policies was associated with fewer COVID-19 deaths in European countries through the end of June 2020. The graphic below presents their findings.

Early policy stringency and cumulative mortality† from COVID-19 — 37 European countries, January 23–June 30, 2020

Plot footnotes:

Abbreviations: ALB = Albania; AUT = Austria; BEL = Belgium; BGR = Bulgaria; BIH = Bosnia and Herzegovina; BLR = Belarus; CHE = Switzerland; CI = confidence interval; COVID-19 = coronavirus disease 2019; CYP = Cyprus; CZE = Czechia; DEU = Germany; DNK = Denmark; ESP = Spain; EST = Estonia; FIN = Finland; FRA = France; GBR = United Kingdom; GRC = Greece; HRV = Croatia; HUN = Hungary; IRL = Ireland; ISL = Iceland; ITA = Italy; LTU = Lithuania; LUX = Luxembourg; LVA = Latvia; MDA = Moldova; NLD = Netherlands; NOR = Norway; POL = Poland; PRT = Portugal; ROU = Romania; SRB = Serbia; SVK = Slovakia; SVN = Slovenia; SWE = Sweden; TUR = Turkey; UKR = Ukraine.

Based on the Oxford Stringency Index (OSI) on the date the country reached the mortality threshold. The OSI is a composite index ranging from 0–100, based on the following nine mitigation policies: 1) cancellation of public events, 2) school closures, 3) gathering restrictions, 4) workplace closures, 5) border closures, 6) internal movement restrictions, 7) public transport closure, 8) stay-at-home recommendations, and 9) stay-at-home orders. The mortality threshold is the first date that each country reached a daily rate of 0.02 new COVID-19 deaths per 100,000 population, based on a 7-day moving average of the daily death rate. The color gradient represents the calendar date that each country reached the mortality threshold.

† Deaths per 100,000 population.

Please watch this video that further describes how the layers of the graph are displayed in this CDC graph.

To summarize, in this table, I list each of the possible layers of a graph, describe them, and then indicate if and how the layer is represented in the example graph above. Take a moment to match up each layer with the corresponding element of the MMWR graph.

The layers of a ggplot2 graphic — mapped to the example MMWR plot

Layer	Description	Example
Data	The data to be plotted.	Data compiled by the CDC
Aesthetics	The mapping of variables to elements of the plot.	OSI index is mapped to x-axis, Cumulative mortaility is mapped to y-axis, Population size is mapped to size of the points, Date mortality threshold was reached is mapped to the color of the points, Country abbreviations are mapped as text to the points.
Geometrics	The visual elements to display the data — e.g., bar graph, scatterplot.	Scatterplot
Facets	Plotting of small groupings within the plot area — i.e., subplots.	There are no facets in this example — there is just a single plot.
Statistics and Transformations	Additional statistics imposed onto plot or numerical transformations of variables to aid interpretation.	A best fit line (the black line running through the points), corresponding confidence interval (the grayed area around the best fit line)
Coordinate	The space on which the data are plotted — e.g., Cartesian coordinate system (x and y-axis), polar coordinates (pie chart).	Cartesian coordinate system
Themes	All non-data elements — e.g., background color, labels.	Axis labels for x and y-axis

With this introduction to graphing in our pocket, let’s begin our journey to learn about data visualization and the power of ggplot2!

The substantive topic for this module

All of the graphs in the remainder of this Module focus on income inequality and it’s impact on our society. Income inequality is portrayed by the significant disparity in the distribution of wealth and earnings among different socioeconomic groups. This concept underscores the pronounced gaps between the affluent and the underprivileged, with the wealthier segments amassing a disproportionately large share of income. This economic chasm is critical to address because it hinders social mobility, exacerbates poverty rates, and fuels socioeconomic disparities. Furthermore, it has wide-ranging implications on the health of the economy, social cohesion, and overall quality of life. Addressing income inequality is pivotal in promoting fairness, improving economic sustainability, and cultivating a society where opportunities for prosperity are accessible to all citizens, not just a privileged few.

To set the stage for digging into these data examples, please watch the following video:

The general structure of code for a ggplot2 graph

Before we begin creating and interpreting graphs, let’s take a moment to get familiar with the code structure for ggplot2. Every ggplot2 graph must have three components:

Data
A set of aesthetic¹ mappings between variables in the data and visual properties
At least one layer which describes how to display the graph — this is typically the geometry. All geometries in ggplot2 start with geom_ — followed by the type of geometry desired. For example, geom_point() creates a point/scatter plot, geom_histogram() creates a histogram, geom_line() creates a line plot.

Here’s a simple example that we’ll build on in the next section:

real_time_ineq_2022 |>
  ggplot(mapping = aes(x = group.f, y = factor_income_share)) +
  geom_col()

Let’s break down this code (notice the corresponding line numbers in the code — which line up with the numbered comments below).

Start with the name of the data frame, real_time_ineq_income_2022, and use the pipe operator to feed the data into the ggplot() function.
Use the ggplot() function to initialize a ggplot object. Map the aesthetics — here we assign the variable group.f to the x-axis and the variable factor_income_share to the y-axis using the aes() function, followed with the plus (+) sign/operator to indicate the next layer of the ggplot. Note the + operator must be included at the end of each line if you wish to add an additional layer to the graph. That is, the + operator is used to combine or add different layers to a plot. Each layer can include different elements like geometries (geom), aesthetic mappings (aes), scales (scale), and themes (theme).
Define the geometry of the graph — geom_col() is ggplot2’s name for a bar graph². Note that _col stands for column.

With this brief introduction in our pocket — we’re ready to begin creating beautiful and meaningful graphics.

Create a series of graph types

Recall from Module 1 that we need to load the packages needed for our session using the library() function. For Module 3, we will need the here package (to refer to the data frames that will be read in) and the tidyverse package (which includes ggplot2 — the package used to plot data). By loading these two packages (i.e., writing and then running the code) — we will pull the tools we need into our work space (i.e., session).

library(here)
library(tidyverse)

Bar graph

A bar graph is a type of plot that presents categorical data (nominal or ordinal) with rectangular bars proportional to the values they represent. In a vertical bar plot, categories are typically displayed along the horizontal axis (x-axis), while corresponding values are plotted on the vertical axis (y-axis). Conversely, in a horizontal bar plot, categories are shown on the y-axis, and their values extend along the x-axis.

For example, notice the difference between the vertical bar graph on the left, and the horizontal bar graph on the right in the set below.

Bar graphs can be used for comparing individual categories, displaying frequency distributions, or showing trends among groups. They can be displayed in a simple format with a single set of bars, or they can be more complex, featuring grouped or stacked bars to represent multiple sub-categories within each main category.

Let’s begin by creating a bar graph that depicts the degree of income inequality in the United States in 2022. We’ll use data compiled by Drs. Thomas Blanchet, Emmanuel Saez, and Gabriel Zucman of the Department of Economics, University of California, Berkeley. The data are hosted on a webpage called Realtime Inequality. Realtime Inequality delivers premier up-to-date statistics revealing the distribution of economic growth among different groups. The data highlight the benefits experienced by every income and wealth segment.

A paper describing the methodology can be found here.

The data we will consider is in a data frame called real_time_ineq_2022.Rds — hosted in the data folder of our project. The data were downloaded from the Realtime Inequality website in May of 2023. The following variables are included in the data frame:

Variable	Description
group.f	Mutually exclusive income groups based on total income for 2022 for all U.S. adults. The groups include: 1. The bottom 50% of income earners (0th to the 50th percentile), 2. The middle 40% of income earners (50th percentile to the 90th percentile), 3. The top 10% (excluding the top 1%), 4. The top 1% (excluding the top .1%), 5. The top .1% (excluding the top .01%), and 6. The top .01%
population	The total number of people in the group.
factor_income_share	The proportion of total income that belongs to the group. Income can come from several sources, such as wages, salaries, interests, dividends, and rents. For this study it includes all labor and capital income before taxes.
wealth_share	The proportion of total wealth that belongs to the group. Wealth includes all assets individuals own, such as houses, cars, savings, retirement accounts, and investments, minus their debts like mortgages and student loans. Wealth, therefore, represents the accumulation of resources over time and is more related to long-term financial security and opportunity. For this study wealth includes all financial and non-financial assets owned by households, net of all debts. Assets include all funded pensions (IRAs, 401(k)s, and funded defined benefits pensions). Vehicles and unfunded pensions (such as promises of future Social Security benefits and other unfunded defined benefits pensions) are excluded.

Let’s import the data frame and create a table of the data that we will consider here.

real_time_ineq_2022 <- read_rds(here("data", "real_time_ineq_2022.Rds")) 
real_time_ineq_2022

We will begin with a very simple version of the graph — this is the code displayed and described earlier in the module which uses a bar plot to display the share of income belonging to each group.

real_time_ineq_2022 |> 
  ggplot(mapping = aes(x = group.f, y = factor_income_share)) + 
  geom_col()

This very basic graph sets the main structure — but it’s not very informative or appealing. Let’s add additional layers to improve the graph.

Fix the x-axis labels: First, note that in the initial graph, the labels that define the graph are difficult to read because they print over one another. To fix this we can use the scale_x_discrete() function to force the x-axis labels for a discrete variable to wrap around after hitting a certain width.
```
real_time_ineq_2022 |> 
   ggplot(mapping = aes(x = group.f, y = factor_income_share)) + 
   geom_col() +
   scale_x_discrete(labels = scales::wrap_format(10))
```
This piece of code, scale_x_discrete(labels = scales::wrap_format(10)) customizes the x-axis labels of the plot. The function scale_x_discrete() is used to format the x-axis for a discrete (categorical) variable. The labels argument within scale_x_discrete() specifies how the labels on the x-axis should be displayed. Here, scales::wrap_format() is a function from the scales package that wraps text to a specified width, in this case, 10 characters. Recall from Module 1 that the :: operator in scales::percent_format() is used to specifically access a function from a package without the need to load the entire package with the library() function. Putting it all together, this line of code is saying: “Take the text of the labels for the variable mapped to the x-axis (i.e., group.f) and wrap them so that each line is no more than 10 characters wide.”
Display the y-axis as a percentage and specify the range of values for it: Note that the variable mapped to the y-axis — factor_income_share — is expressed as a proportion. Let’s ask ggplot2 to instead display it as a percentage, and let’s also specify the range for the y-axis that we desire (rather than relying on the default). We use the scale_y_continuous() function to modify the continuous variable mapped to the y-axis. Inside this function we use the label argument in conjunction with the percent_format() function from the scales package to indicate that we want the value displayed as a percentage. Also, we use the limits argument to indicate that we want the y-axis to display values from 0 to .5 (note that you must refer to the scale of the original variable here — which is expressed as a proportion — so 0 to .5 will cause the y-axis to range from 0 to 50%).
```
real_time_ineq_2022 |> 
  ggplot(mapping = aes(x = group.f, y = factor_income_share)) + 
  geom_col() +
  scale_x_discrete(labels = scales::wrap_format(10)) +
  scale_y_continuous(label = scales::percent_format(), limits = c(0,.5))
```
The graph is looking better — let’s keep improving on it though.
Display the percentage for each group above the bar: Let’s request that the percentage for each group be displayed at the top of the corresponding bar. For this, we need to add an additional geometry to the plot — namely a text geometry, which is added using the function geom_text(). Here we map the variable factor_income_share to the label aesthetic, which determines the text that is displayed as a label above the bar. The percent() function from the scales package is used here to convert factor_income_share into a percentage format for display (rather than a proportion — as it is recorded in the data frame). Notice the other argument to geom_text() — that is, vjust = -0.5: This adjusts the vertical justification of the text. The value -0.5 positions the text slightly above the top of the bar. Justification refers to the alignment of the text. A value of 0 would align the text with the bottom of the bar, 1 would align it with the top, and values outside of this range will position the text outside the bar. You can experiment with different values to find a position that you like best.
```
real_time_ineq_2022 |> 
  ggplot(mapping = aes(x = group.f, y = factor_income_share)) + 
  geom_col() +
  scale_x_discrete(labels = scales::wrap_format(10)) +
  scale_y_continuous(label = scales::percent_format(), limits = c(0,.5)) +
  geom_text(mapping = aes(label = scales::percent(factor_income_share)), vjust = -0.5)
```

Add a title, labels, and a caption: Moving on, let’s now add a title, a caption to acknowledge the data source, and axis labels. We use the labs() function to accomplish this.

real_time_ineq_2022 |> 
  ggplot(mapping = aes(x = group.f, y = factor_income_share)) + 
  geom_col() +
  scale_x_discrete(labels = scales::wrap_format(10)) +
  scale_y_continuous(label = scales::percent_format(), limits = c(0,.5)) +
  geom_text(mapping = aes(label = scales::percent(factor_income_share)), vjust = -0.5) +
  labs(title = "Share of income belonging to each income group",
      subtitle = "2022 income data from U.S. adults",
      caption = "Data retrieved from Realtime Inequality (https://realtimeinequality.org)",
      x = "",
      y = "% of total income belonging to each group")

Notice that the x-axis label states x = "" — denoting to leave it blank. An alternative would be to include: x = "Income Group" — however, the title already points the reader to this and the group labels are sufficient for the reader to understand. The graph looks less cluttered without the label.

Style the graph: Let’s make a few stylistic changes. In the code below. Notice that a different theme is specified that changes the background — theme_minimal() is one that I like. You can explore many different themes here. The color of the bars is changed to a shade of blue — “cadetblue”. We’ll explore additional color options later. One thing to note here, see that in the geom_col() function, fill = "cadetblue" is specified without any aesthetic mapping. This causes all bars to be filled with the same color. If instead, the code read: geom_col(mapping = aes(fill = group.f)) then each group’s bar would be a different color (i.e., a variable would be aesthetically mapped to an element of the graph).
```
real_time_ineq_2022 |> 
  ggplot(mapping = aes(x = group.f, y = factor_income_share)) + 
  geom_col(fill = "cadetblue") +
  scale_x_discrete(labels = scales::wrap_format(10)) +
  scale_y_continuous(label = scales::percent_format(), limits = c(0,.5)) +
  geom_text(mapping = aes(label = scales::percent(factor_income_share)), vjust = -0.5) +
  labs(title = "Share of income belonging to each income group",
     subtitle = "2022 income data from U.S. adults",
     caption = "Data retrieved from Realtime Inequality (https://realtimeinequality.org)",
     x = "",
     y = "% of total income belonging to each group") +
  theme_minimal()
```

Earlier I showed you two versions of a bar graph, a vertical bar graph and a horizontal bar graph. We’ve been working with the vertical version. How can we change the code to create a horizontal version? It’s easy — we just add a layer using the coord_flip() function.

In addition, because we are using geom_text() to add the corresponding percentage of each bar — I need to change vjust to hjust to accommodate the horizontal bar graph. I also adjust the value a bit — a value of -.1 puts the number just slightly to the right of the bar. Which version of the graph do you prefer — vertical or horizontal?

real_time_ineq_2022 |> 
  ggplot(mapping = aes(x = group.f, y = factor_income_share)) + 
  geom_col(fill = "cadetblue") +
  scale_x_discrete(labels = scales::wrap_format(10)) +
  scale_y_continuous(label = scales::percent_format(), limits = c(0,.5)) +
  coord_flip() +
  geom_text(mapping = aes(label = scales::percent(factor_income_share)), hjust = -.1) +
  labs(title = "Share of income belonging to each income group",
       subtitle = "2022 income data from U.S. adults",
       caption = "Data retrieved from Realtime Inequality (https://realtimeinequality.org)",
       x = "",
       y = "% of total income belonging to each group") +
  theme_minimal()

In summary, the graph provides us with important information about economic inequality in the United States. Half of the population (i.e., the bottom 50%, which comprises about 125 million people), received just over 10% of the income. In particular, this indicates a significant income disparity, as half of the population holds a very small fraction of the total income. On the other end of the spectrum, the share of the top 1% can be ascertained by summing the shares for the top three groups. For income share that is 10.85 + 5.24 + 4.74 or 20.83, meaning that the top 1% received nearly 21% of the income in 2022. These figures paint a clear picture of substantial income inequality in the United States, with a significant concentration of income in the hands of the top 1%, and especially in the top 0.1%, and 0.01%. This is a key aspect of ongoing discussions around economic policy, tax reform, and social equity. In the next section, we’ll take a closer look at the top 1%.

Line graph

A line graph is a type of chart typically used to display information that changes over time³. It is called a line graph because it uses lines to connect individual data points which represent a quantity at a certain point in time (an interval or ratio variable). These graphs are often used to show trends over intervals, such as days, months, years or even minutes and seconds, making them a great tool for tracking changes or trends over time.

The line graph consists of a horizontal axis (x-axis) and a vertical axis (y-axis). The x-axis often represents time intervals, while the y-axis represents the quantity measured and displayed on the graph. Each point on the line corresponds to a data value at a particular time. By connecting these points, the line graph gives a sense of direction, revealing an upward or downward trend, or fluctuations in the data.

Line graphs can have one or more lines, each representing different categories of data. For instance, you could use a line graph to compare the revenue growth of multiple companies over a number of years. Each company would have its own line on the graph, allowing you to easily compare their growth trends.

Overall, line graphs provide a clear and concise way to visualize complex data, making it easier to identify patterns, trends, and relationships in the data.

A line graph with one group

In the prior graph we saw that in 2022, about 21% of the total income in the US went to the top 1%. Now, let’s use a line graph to examine how this percentage has changed over the past century.

We will use data from the World Inequality Database to create our graph.

There are just two variables in the data frame:

Variable	Description
year	The year the data come from
share_top_1pct	The proportion of income belonging to the top 1 percent of the US adult population

Let’s import the data frame and take a look at the first few rows of data:

us_ineq_past_century <- read_rds(here("data", "us_ineq_past_century.Rds"))
us_ineq_past_century |> head(n = 20)

The code to create the graph is presented below.

us_ineq_past_century |> 
  ggplot(mapping = aes(x = year, y = share_top_1pct)) +
  geom_line() +
  scale_x_continuous(breaks = seq(from = 1910, to = 2025, by = 10)) +
  scale_y_continuous(label = scales::percent_format()) +
  labs(title = "Share of wealth belonging to the top 1 percent",
       subtitle = "1913 to 2021",
       caption = "Data retrieved from World Inequality Database (https://wid.world/country/usa/)",
       x = "",
       y = "% of total pre-tax national income belonging to the top 1 percent") +
theme_minimal()

Let’s walk through the code (please carefully match the code above with the description below).

year is mapped to the x-axis, and share_top_1pct is mapped to the y-axis.
The geom_line() geometry is used to create a line graph.
The scale_x_continuous() function controls the scaling of the x-axis variable (year — which is a continuous variable, thus we use scale_x_continuous() rather than scale_x_discrete() as in the bar graph example above where the variable on the x-axis was discrete). This function call sets the breaks on the x-axis to be every 10 years, from 1910 to 2025.
Similar to the bar graph example, the scale_y_continuous() function is used to modify the continuous variable mapped to the y-axis. Inside this function the label argument is used in conjunction with the percent_format() function to indicate that the value should be displayed as a percentage.

What does this graph tell us? In the early part of the 20th century, during the era known as the “Gilded Age,” income inequality was very high, with the top 1% controlling a large share of the nation’s income. The share of income going to the top 1% reached its peak in the late 1920s, just before the Great Depression. During the mid-20th century, from the 1930s to the 1970s, the share of income going to the top 1% decreased significantly. This period, sometimes referred to as “The Great Compression,” saw policies such as progressive taxation and the strengthening of labor unions, which led to a more equitable distribution of income. However, from the 1980s onwards, the trend reversed, and the income share of the top 1% started increasing again. This period, referred to as the “Great Divergence” or the “New Gilded Age,” saw significant increases in income inequality. By the early 21st century, the share of income going to the top 1% had returned to levels not seen since the early 20th century. Income inequality remains very high today.

A line graph with multiple groups

Let’s now return to the data collected by Realtime Inequality — but instead of just looking at 2022 as we did with the bar graph, we’ll take a longer term assessment of the six identified income groups. In addition, we’ll look at income growth over time instead of share of income belonging to the group. Income growth is is calculated from dollar figures that have been annualized and adjusted for price inflation to March 2023 dollars⁴.

Variables in the data frame include:

Variable	Description
year	Year of data.
group.f	Mutually exclusive income groups based on total income for 2022 for all U.S. adults.
real_factor_income_per_unit	Average income (including all labor and capital income before taxes) per adult (in March 2023 dollars).
real_factor_income_per_unit_growth	Growth in income (including all labor and capital income before taxes) since base year per adult (in March 2023 dollars).

Let’s read in the data frame, called real_time_ineq_income_growth. The table below presents the the first few rows of data.

real_time_ineq_income_growth <- read_rds(here("data", "real_time_ineq_income_growth.Rds"))
real_time_ineq_income_growth |> head(n = 20)

The graph is then fit with the following code:

real_time_ineq_income_growth |> 
  ggplot(mapping = aes(x = year, y = real_factor_income_per_unit_growth, group = group.f, color = group.f)) +
  geom_line() +
  theme_minimal() +
  scale_y_continuous(label = scales::percent_format()) +
  labs(title = "Cumulative percent change in income since 1976 by income group",
       subtitle = "Income growth is calculated from dollar figures that have been annualized and adjusted for price inflation to March 2023 dollars.",
       y = "Cumulative percent change in income",
       x = "",
       caption = "Data retrieved from Realtime Inequality (https://realtimeinequality.org)",
       color = "Income group")

Much of the code to create this graph should be familiar to you. Let’s focus in on the parts that are new.

First, notice that in the part where the aesthetic mappings occur, in addition to mapping the desired variables to the x- and y-axis, we also now indicate that the variable group.f (which denotes our income groups) is a grouping variable and that we want to map group.f to the color (specifically — line color since that it the geometry we will use)⁵. ggplot2 automatically creates a legend. In the labs() function, we indicate color = "Income group" to provide a name for the legend (ggplot2 defaults to labeling the legend with the variable name — e.g., “group.f”, which isn’t very informative). The placement of the legend can be moved from the default (i.e., to the right of the graph) — click here to learn about how to change this as desired.

Notice in the specification of the title

Let’s make a few modifications to enhance this graph.

First and foremost, notice that the subtitle is cut off. We can force a return by adding the characters \n where a return is desired (here is it placed before the “and” in the subtitle. The \n character is a special character in R (and many other programming languages) that represents a new line. This means that it tells R to start a new line at that point in the text. In the context of ggplot2, you can use \n in titles, labels, or text annotations to break the text into separate lines. This can be useful when you have long titles or labels and you want to format them in a more readable way. Check the code below to see this addition, and then observe how the graph looks with this fixed title.
```
real_time_ineq_income_growth |> 
  ggplot(mapping = aes(x = year, y = real_factor_income_per_unit_growth, group = group.f, color = group.f)) +
  geom_line() +
  theme_minimal() +
  scale_y_continuous(label = scales::percent_format()) +
  labs(title = "Cumulative percent change in income since 1976 by income group",
   subtitle = "Income growth is calculated from dollar figures that have been annualized \nand adjusted for price inflation to March 2023 dollars.",
   y = "Cumulative percent change in income",
   x = "",
   caption = "Data retrieved from Realtime Inequality (https://realtimeinequality.org)",
   color = "Income group")
```

Second, we can change the colors. I think the colors in this graph are perfectly fine — these are the default colors of ggplot2. However, we can change them to any other color desired. Let’s take a moment to learn about how to create custom colors. First, we can use one of the many pre-defined color packages for R. Here, I’ll show you color palettes from the RColorBrewer package. In the figure below, the set of letters on the left names the corresponding color palette. To use one of these palettes (let’s pick PuOr), we add: scale_color_brewer(palette = "PuOr") to our graph code.

real_time_ineq_income_growth |> 
  ggplot(mapping = aes(x = year, y = real_factor_income_per_unit_growth, group = group.f, color = group.f)) +
  geom_line() +
  theme_minimal() +
  scale_y_continuous(label = scales::percent_format()) +
  scale_color_brewer(palette = "PuOr") +
  labs(title = "Cumulative percent change in income since 1976 by income group",
       subtitle = "Income growth is calculated from dollar figures that have been annualized \nand adjusted for price  inflation to March 2023 dollars.",
       y = "Cumulative percent change in income",
       x = "",
       caption = "Data retrieved from Realtime Inequality (https://realtimeinequality.org)",color = "Income group")

There are many R packages that offer attractive color palettes. Another package that is very easy to integrate with ggplot2 is viridis — which focuses on color blind-friendly colors. Alternatively, you can manually assign a color to each group. Click here to choose the colors that you like — noting either the color name or the HEX color code (I’ll show both options below). You can also search for colors and obtain HEX color codes here. Once your desired colors are selected then define a R object that assigns each level of the group (i.e., group.f) to a color. Then, add: scale_color_manual(values = group.colors1) to the graph code, this will tell R that you’d like to color the lines based on the colors you assigned to each group in the group.colors object.

# These two lines define the same colors,
# one with names, one with corresponding HEX values. 
# You just need one or the other.

group.colors1 <- c("Bottom 50%" = "darkslategrey", "Middle 40%" = "turquoise1", "Top 10% (excluding the Top 1%)" = "violetred", "Top 1% (excluding the Top .1%)" = "thistle3", "Top .1% (excluding the Top .01%)" = "darkseagreen", "Top .01%" = "darkorchid")

group.colors2 <- c("Bottom 50%" = "#2F4F4F", "Middle 40%" = "#00F5FF", "Top 10% (excluding the Top 1%)" = "#D02090",
        "Top 1% (excluding the Top .1%)" = "#CDB5CD", "Top .1% (excluding the Top .01%)" = "#8FBC8F", "Top .01%" = "#9932CC")


real_time_ineq_income_growth |> 
  ggplot(mapping = aes(x = year, y = real_factor_income_per_unit_growth, group = group.f, color = group.f)) +
  geom_line() +
  theme_minimal() +
  scale_y_continuous(label = scales::percent_format()) +
  scale_color_manual(values = group.colors1) +
  labs(title = "Cumulative percent change in income since 1976 by income group",
       subtitle = "Income growth is calculated from dollar figures that have been annualized \nand adjusted for price inflation to March 2023 dollars.",
       y = "Cumulative percent change in income",
       x = "",
       caption = "Data retrieved from Realtime Inequality (https://realtimeinequality.org)",
       color = "Income group")

Substantively, how should we interpret this graph? To help you better understand, I constructed the table below. It shows the income in 1976 and 2022 (in 2023 dollars) for each income group. Consider the bottom 50%. Their income went from an average of 15,588 USD to 18,455 USD — a growth of about 18.4%. For the top .01%, their income went from an average of 5,414,213 USD to 42,115,401 USD — a growth of about 678%. This vast difference in income growth rates reveals staggering income inequality that continues to grow at an alarming rate in the U.S. These data underscore the need for policies that promote more equitable income and wealth distribution and economic opportunities for all strata of society.

Scatterplot

A scatterplot is a type of data visualization used to display the relationship between two numerical variables (interval or ratio scales). Each point on the scatterplot represents an observation from the data frame and its position along the x-axis (horizontal) and y-axis (vertical) represents the values of two variables for that observation.

Scatterplots are often used to investigate if there is a correlation or pattern between the two variables. For instance, you might see a positive trend (where the points tend to go up from left to right) if one variable increases when the other one does. Conversely, you might see a negative trend (where the points tend to go down from left to right) if one variable decreases as the other one increases.

Take a look at the following plots that demonstrate different degrees of relatedness between the X and Y variables. These plots demonstrate a key concept in statistics known as correlation. Correlation quantifies the relationship between two variables. Correlation coefficients range from -1 to +1. A correlation of -1 indicates a perfect negative linear relationship: as one variable increases, the other decreases at a consistent rate. Conversely, a correlation of +1 indicates a perfect positive linear relationship: as one variable increases, so does the other. The closer the correlation is to zero, the weaker the relationship.

Looking at the plots, you can see how this works in practice. Notice the correlation coefficient printed at the top of each graph. The plots for correlation coefficients -1 and 1 show perfectly straight lines, illustrating perfect negative and positive relationships, respectively. As we move towards 0 from either end, the points start to spread out and the relationship between the variables becomes less clear. At a correlation near 0, there’s no discernible pattern, meaning that the variables do not move in any predictable way relative to each other.

It’s important to remember that correlation does not imply causation; just because two variables move together does not necessarily mean that one is causing the other to move. Nonetheless, understanding correlation is a fundamental aspect of many fields that rely on statistical analysis.

Scatterplots, such as those in the example graphs, can also reveal clusters of points, suggesting subgroups within the data, or outliers, which are points that sit apart from the main body of points (e.g., points that have an unusual score for the Y variable given their score on the X variable). In this way, scatterplots are a fundamental tool for statistical analysis — and we will work with scatterplots frequently as we move through the course.

So far in this Module, we examined graphs that depict the degree of economic inequality in the United States. In this example, let’s expand beyond the U.S. to consider economic inequality across countries and the extent to which economic inequality is related to social mobility. Social mobility refers to the ability of individuals or families to move up or down the social and economic ladder within a society — here, we focus on the ability for children born into poor families to move up in social class as they reach adulthood.

We’ll consider the Gatsby Curve, a concept in economics that illustrates the relationship between income inequality and social mobility across generations. The term was coined by economist Dr. Alan Krueger in 2012 and was inspired by the book The Great Gatsby, which offers a poignant and incisive depiction of the economic and social disparities within the American upper class during the opulent era of the 1920s.

The image below depicts the Gatsby Curve.

Click here to view a dynamic version of the image. And, please watch the video below by Dr. Miles Corak which further describes the Gatsby Curve and discusses its implications.

The Gatsby Curve demonstrates that in societies with high income inequality, there is less opportunity for social mobility. In other words, it’s harder for children from low-income families to move up the income ladder as adults, thereby perpetuating economic disparities across generations. The greater the income inequality in a society, the steeper the Gatsby Curve, indicating reduced social mobility. This relationship depicted in the Gatsby Curve raises significant concerns about fairness, social cohesion, and economic stability, as it suggests that the “American Dream” — the idea that everyone has an equal opportunity to succeed — could be harder to achieve for those born into lower-income households in highly unequal societies.

Let’s now create a scatterplot that depicts the Gatsby Curve. To create our version of the Gatsby Curve, we’ll consider two variables for a set of countries.

The first variable is a measure of economic inequality called the Gini Coefficient. It is a statistical measure used to represent the income or wealth distribution of a nation’s residents. Named after the Italian statistician Corrado Gini who developed it in 1912, it is a popular tool for quantifying income inequality within a population. The Gini Coefficient ranges between 0 and 100. A Gini Coefficient of 0 represents perfect equality, where everyone has the same income or wealth. On the other hand, a Gini Coefficient of 100 signifies perfect inequality, where one person has all the income or wealth, and everyone else has none. In practice, most countries have a Gini Coefficient between 25 and 60. High income inequality countries, like South Africa and Brazil, tend to have a Gini Coefficient over 50, while more egalitarian countries like those in Scandinavia tend to have a Gini Coefficient under 30. The Gini Coefficient is a useful tool for economists and policymakers, helping them to understand inequality trends over time, and to compare income or wealth distribution across different countries or regions.
The second variable is intergenerational earnings elasticity (IEE). It is a measure used in economics to quantify the extent to which a person’s income is determined by the income of their parents. It’s a crucial indicator of social mobility. IEE is coded so that a score of 0 means that a person’s income is not related at all to their parents’ income, signifying perfect social mobility — that is, everyone has an equal chance to land anywhere on the income scale, regardless of their starting point. On the other hand, a score of 1 means that a person’s income is entirely determined by their parents’ income, indicating zero social mobility — in this scenario, if you’re born into a low-income family, you’re destined to remain in that income bracket, and the same holds true for those born into wealth.

The data frame that we’ll use includes three variables:

Variable	Description
country	The name of the country
iee	Intergenerational earnings elasticity
gini	The Gini Coefficient

The data frame is called gatsby.Rds. Let’s import the data frame and take a look at the contents.

gatsby <- read_rds(here("data", "gatsby.Rds"))
gatsby

We can use the following code to create the scatterplot.

gatsby |> 
  ggplot(mapping = aes(x = gini, y = iee)) +
  geom_point() +
  ggrepel::geom_label_repel(mapping = aes(label = country),
                            color = "grey35", fill = "white", size = 2, box.padding =  0.4, 
                            label.padding = 0.1) +
  theme_minimal() +
  labs(title = "The Great Gatsby Curve",
       subtitle = "High inequality tends to mean intergenerational economic immobility",
       x = "Gini Coefficient",
       y = "Intergenerational Earnings Elasticity")

Most of the code here should look familiar to you. A new element is the use of the ggrepel package’s geom_label_repel() function to label the data points. In this graph, it provides the name of the country for each point. If you’d like to learn more about all the options of the geom_label_repel() call in this graph, please read the Advanced Tip below.

Tip

Advanced Tip for the Curious

Full documentation for the following code snippet: ggrepel::geom_label_repel(mapping = aes(label = country), color = "grey35", fill = "white", size = 2, box.padding = 0.4, label.padding = 0.1) +

mapping: This argument specifies the mapping for the label annotations. In this case, it is using the aesthetic mapping aes() function to map the “country” variable to the label parameter. This means that the label annotations will display the values of the “country” variable.
color: This argument determines the color of the label text. It is set to “grey35”, which is a shade of gray.
fill: This argument determines the fill color of the label background. It is set to “white”, indicating a white background.
size: This argument determines the size of the label text. It is set to 2, indicating a font size of 2 units.
box.padding: This argument controls the padding around the label background box. It is set to 0.4, specifying a padding of 0.4 units.
label.padding: This argument determines the padding between the label text and the label background box. It is set to 0.1, indicating a padding of 0.1 units.

NOTE: Another option would be to replace this code snippet with the following: geom_text(mapping = aes(label = country), vjust = -.1, hjust = -.1, size = 3, nudge_x = .1, check_overlap = TRUE)

This uses the geom_text() geometry native to ggplot2. However, I prefer the way ggrepel adds labels to data points in a scatterplot.

Let’s enhance this graph by adding a best fit line through the scatter of points. Adding a best fit line, or a trend line, to a scatterplot can be highly useful in understanding the relationship between two numerical variables. The line essentially condenses the overall pattern of the scattered points into a single, clear trajectory, making it easier to interpret the relationship. We can add this with one additional line of code, calling a new geometry with the geom_smooth() function: geom_smooth(method = "lm", se = FALSE), color = "darkorange"). method = "lm" specifies a linear model. We’ll dive into linear modeling later in the course and you’ll have a chance to explore how the best fit linear model is calculated. se = FALSE specifies that we don’t want the standard error to be displayed (we’ll also further explore this option later in the course when we begin studying statistical inference). color = "darkorange" is there just for appearances — the default line color is blue.

Here’s the updated code with these enhancements:

gatsby |> 
  ggplot(mapping = aes(x = gini, y = iee)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "darkorange") +
  theme_minimal() +
  labs(title = "The Great Gatsby Curve",
       subtitle = "High inequality tends to mean intergenerational economic immobility",
       x = "Gini Coefficient",
       y = "Intergenerational Earnings Elasticity")

What are the substantive findings depicted in this graph? Recall that our measure of social mobility and the variable on the y-axis, captures the degree to which parents’ income plays a big role in childrens’ income (that is, there is little opportunity for moving up the economic ladder). The Gini Coefficient (the variable on the x-axis), on the other hand, measures income inequality within a country — a higher Gini Coefficient indicates a higher level of income inequality (income is concentrated at the top).

From the data, it can be observed that countries with higher income inequality, such as Peru, China, and Brazil (higher Gini Coefficients), tend to have more intergenerational economic immobility. While countries with lower income inequality, such as Denmark, Norway, and Finland (lower Gini Coefficients), tend to have low economic immobility.

This suggests a positive relationship between income inequality and the intergenerational earnings elasticity, indicating that higher income inequality is associated with greater intergenerational economic immobility. This is the primary assertion of the Gatsby curve and underscores the argument that addressing income inequality is essential for enhancing opportunities for all children to succeed, regardless of their family background.

Histogram

A histogram is a type of graphical representation that organizes a group of data points into specified intervals. It is an effective tool for displaying the frequency or proportion of data within different intervals, and it is commonly used in statistics to visually represent the distribution and variability of numerical (i.e., interval and ratio scales) data.

A histogram consists of rectangles, where the area of each rectangle is proportional to the frequency of a variable within the range, also known as a bin or class interval. The x-axis shows these bins, while the y-axis reflects the frequency (how many cases (e.g., people) are in the bin). For instance, in a histogram representing the ages of people in a sample, each bin could represent a decade of life. The height of the bar corresponding to each bin would indicate the number of individuals whose age falls within that decade. Histograms allow us to grasp the overall shape and distribution of a variable.

In this section, we will consider data from a study conducted by Raj Chetty, Ph.D. and colleagues. The academic paper describing the study we’ll consider can be downloaded here.

Please watch the following Ted talk by Dr. Chetty that gives a nice overview of the example we’ll consider.

Briefly, Chetty and colleagues compiled administrative records on the incomes of more than 40 million children and their parents to quantify social mobility across areas within the U.S. — called commuting zones. One of the key findings of the study is the substantial variability in social mobility across the U.S. For example, the authors found that the likelihood that a child reaches the top quintile⁶ (i.e., the top 20%) of the national income distribution starting from a family in the bottom quintile (i.e., the bottom 20%) is 4.4% in Charlotte, NC but 12.9% in San Jose, CA. The map below depicts how this likelihood of intergenerational social mobility varies across the U.S.

The map shows that the degree of social mobility is lowest for individuals raised in the Southeast, while it’s highest in the regions of the Mountain West and rural Midwest. There are certain commuting zones (CZs) in the U.S. that display levels of social mobility on par with countries known for their high mobility, like Canada and Denmark. On the other hand, some zones show mobility rates that are lower than those recorded in any other developed country for which data is accessible.

The data frame, called chetty_mobility.Rds, includes data on 709 commuting zones and includes the following variables:

Variable	Description
cz_name	The name of the commuting zone
state	The state the commuting zone belongs to
p_bottom_to_top	The likelihood of the child getting to the top income quintile given their parents were in the bottom income quintile.
abs_up_mobility	The mean income rank of children with parents in the bottom half of the income distribution.
gini	The Gini Coefficient
gini_rank.f	The Gini Coefficient binned into 5 equal parts (i.e., quintiles)

Let’s import the data and take a peek at the head of the data frame.

chetty <- read_rds(here("data", "chetty_mobility.Rds"))
chetty |> head(n = 12)

Let’s begin by creating a histogram of absolute upward mobility (i.e., the variable abs_up_mobility).

chetty |> 
  ggplot(mapping = aes(x = abs_up_mobility)) +
  geom_histogram(binwidth = 1, fill = "grey") +
  theme_minimal() +
  labs(title = "The distribution of absolute upward mobility across commuting zones",
       x = "Absolute upward mobility",
       y = "Frequency")

The new line of code here is geom_histogram(), the geometry for a histogram in ggplot2. Notice the argument binwidth = 1. The bin width is the range of values that each bar in the histogram represents, so a bin width of 1 means that each bar represents a range of 1 unit of absolute upward mobility. An alternative to binwidth is the bins argument. Instead of specifying the width of the bins, you can specify the number of bins you want in the histogram, for example, bins = 20 would create 20 bins of equal width. This means that the range of the data is split into 20 intervals, and the height of each bar in the histogram represents the number of data points that fall into each interval.

In looking at the graph, we find that a large concentration of commmuting zones have an upward mobility score that is between about 40 and 45. We also find a large range — with some commuting zones having a very low upward mobility score (i.e., less than 30) and some having a quite high upward mobility score (i.e., above 60).

Let’s enhance the histogram — and consider how the the distribution of the variable of interest abs_up_mobility differs across Gini quintiles. Let’s use facet_wrap() to produce one histogram per Gini Coefficient group. The arguments for facet_wrap() are the variable to facet by, and the number of columns and/or rows that you desire (i.e., ncol for columns, and nrow for rows).

chetty |> 
  ggplot(mapping = aes(x = abs_up_mobility)) +
  geom_histogram(binwidth = 1, fill = "lightskyblue4") +
  theme_minimal() +
  labs(title = "Differences in the distribution of absolute upward mobility across commuting zones",
       subtitle = "Facetted by Gini Coefficient Quintile",
       x = "Absolute upward mobility",
       y = "Frequency",
       fill = "Quintile of the Gini Coefficient") +
  facet_wrap(~gini_rank.f, ncol = 1)

It’s clear to see that as the Gini Coefficient of the commuting zone increases (i.e., meaning the commuting zone has more inequality), the distribution of upward mobility shifts to the left. This means there tends to be less upward mobility in high inequality commuting zones. In other words, areas in the U.S. that have more inequality tend to provide less opportunity for children from poorer families to move up the income ladder as they become adults.

Density plot

A density plot is a graphical representation used for visualizing the distribution of a continuous (i.e, quantitative) variable (interval and ratio scales). It’s similar to a histogram but provides a smoothed curve rather than bars. A density plot can also help us to understand the shape of the data distribution and the degree of variability/spread/dispersion. Here’s a simple example that is similar to the first histogram we created.

chetty |> 
  ggplot(mapping = aes(x = abs_up_mobility)) +
  geom_density(fill = "grey") +
  theme_minimal() +
  labs(title = "The distribution of absolute upward mobility across commuting zones",
       x = "Absolute upward mobility",
       y = "Frequency")

Notice that the difference is that instead of geom_histogram(), geom_density() is used.

In a density plot, the y-axis represents the estimated density of the data at different values of the variable plotted on the x-axis. It’s important to note that the values on the y-axis are not probabilities, but rather, density values. The density values give you an idea of how closely packed or spread out the data points are around specific values.

In our example, we can see that the highest concentration of upward mobility scores are between 40 and 45 — of course, this corresonds with the histogram of the same variable produced earlier.

Please watch the video below, which describes the differences between histograms and density plots — and builds up intuition for understanding each of them.

Here are some key points for interpreting the y-axis in a density plot:

Higher Values Indicate Greater Density: A higher value on the y-axis indicates that more data points are concentrated around the corresponding value on the x-axis.
Area Under the Curve: The total area under the density curve is equal to 1. This is akin to saying that the probability of a data point falling somewhere along the x-axis is 100%.
Comparing Densities: By comparing the heights of different regions of the curve, you can make inferences about the relative densities of data points. For example, if one part of the curve is higher than another, it means that data points are more densely packed around that value of the x-axis compared to where the curve is lower.
Units: The units of the y-axis are not in terms of counts or probabilities, but in terms of density, which is derived as a count per unit of the x-axis. This sometimes causes confusion, as the values on the y-axis can be higher than 1, especially if the range of X is small.
Peaks and Troughs: Peaks in the density plot indicate modes or clusters within the data, whereas troughs indicate regions where there are fewer data points.

Understanding the y-axis in density plots helps in analyzing the distribution of data, identifying the range of values where the data points are concentrated, and observing patterns such as skewness, multimodality, and spread. They are often considered alongside histograms — where each have benefits. Histograms are more straightforward and easier to interpret, especially for those not well-versed in statistics. The bars represent raw counts or frequencies, which are intuitive to understand. Density plots use smoothing to create a continuous curve, which can be more appropriate for continuous data. This makes it easier to observe the underlying distribution and detect patterns. Importantly, when comparing multiple groups that differ in size, density plots have a particular advantage because they normalize the data (i.e., density plots are normalized so that the area under the curve sums to 1), allowing for a better comparison of the shapes of the distributions regardless of the size of each group. We’ll see an example of this benefit of density plots versus histograms later in the course.

In the code chunk below, notice that I set colors for both the line of the density plot (specified as color = "azure3") and the fill of the density plot (specified as fill = "azure2").

chetty |> 
  ggplot(mapping = aes(x = abs_up_mobility)) +
  geom_density(color = "azure3", fill = "azure2") +
  theme_minimal() +
  labs(title = "The distribution of absolute upward mobility across commuting zones",
       x = "Absolute upward mobility",
       y = "Density")

As you can see, the information in this density plot is quite similar to the information in the histogram we created earlier. We can modify this histogram to create multiple curves by Gini quintile by specifying it as a facet.

chetty |> 
  ggplot(mapping = aes(x = abs_up_mobility, fill = gini_rank.f)) +
  geom_density(alpha = .5) +
  theme_minimal() +
  facet_wrap(~ gini_rank.f) +
  labs(title = "The distribution of absolute upward mobility across commuting zones",
       x = "Absolute upward mobility",
       y = "Density",
       fill = "Gini Coefficient Quintile")

Or, we can put them all onto one graph:

chetty |> 
  ggplot(mapping = aes(x = abs_up_mobility, fill = gini_rank.f)) +
  geom_density(alpha = .5) +
  theme_minimal() +
  labs(title = "The distribution of absolute upward mobility across commuting zones",
       x = "Absolute upward mobility",
       y = "Density",
       fill = "Gini Coefficient Quintile")

Note here that I added an argument to geom_density() called alpha — which refers to the opacity of the selected geometry. Alpha ranges from 0 to 1, with lower values corresponding to more transparent colors. You can play around with changing alpha to .2 or .8, for example, to see what looks best.

Boxplot

A box plot, also known as a box-and-whisker plot, is a graphical representation used to display the distribution of a continuous variable (interval or ratio scale), often across levels of a grouping variable (nominal or ordinal scale). Like histograms and density plots, box plots also display variability/spread/dispersion of the variable — but instead of the full range — box plots highlight the interquartile range. The geometry for a box plot in ggplot2 is geom_boxplot().

Here is an example of a box plot — in which we again consider absolute upward mobility across commuting zones and how it differs across Gini quintiles.

chetty |> 
  ggplot(mapping = aes(x = gini_rank.f, y = abs_up_mobility)) +
  geom_boxplot(fill = "yellow") +
  theme_minimal() +
  labs(title = "A boxplot of absolute upward mobility across Gini Coefficient rankings",
       x = "Quintiles of the Gini Coefficient",
       y = "Absolute upward mobility")

Here’s how to interpret the components of a box plot as printed by ggplot2:

Box: The main part of the plot is a box, which represents the interquartile range (IQR), which is the range within which the middle 50% of the data falls. The bottom of the box represents the first quartile (Q1, or the 25th percentile), and the top of the box represents the third quartile (Q3, or the 75th percentile).
Line inside the Box: There is a horizontal line inside the box which represents the median (Q2 or the 50th percentile) of the variable. This line divides the data into two halves.
Whiskers: Extending from the top and bottom of the box, there are lines known as “whiskers”. These whiskers extend from the edges of the box to the maximum and minimum values within a defined range, for ggplot2 that is Q3 + 1.5 times the IQR and Q1 - 1.5 times the IQR respectively.
Outliers: Data points that are beyond the ends of the whiskers are often considered outliers and are usually plotted as individual points. These are data points that are outside the range of the whiskers.

Illustration of box plot components for the first Gini quintile

Box plots are widely used because they are a concise way to visualize the distribution and variability of a variable. They are particularly useful for comparing distributions across multiple groups or categories at once, and for identifying outliers. This plot displays similar information to the histogram that we created earlier — the one that also included the gini_rank.f as a grouping variable. This is simply a different way of examining the distribution of absolute upward mobility across the Gini Coefficient groups. Which one do you prefer?

Specialty plots (bonus material)

In this last section, I would like to introduce you to some specialty plots that require more advanced features of ggplot2 or additional R packages. As you embark on your journey of learning R, the immediate necessity of these specific types of graphs may not be apparent. However, with time, as you delve deeper into the realms of data analysis and visualization, you’ll likely find these visual representations increasingly valuable.

Scatterplot matrix

The GGally package has a function called ggpairs() that creates a scatterplot matrix. A scatterplot matrix is useful when a scatterplot of three or more variables is desired. For example, in the final aim of the Chetty and colleague’s study described earlier, the authors examined factors correlated with upward mobility. They found that high upward mobility areas have (1) less residential segregation, (2) less income inequality, (3) better primary schools, (4) greater social capital, and (5) greater family stability.

Let’s use the ggpairs() function to create a scatterplot matrix that relates our outcome of interest (absolute upward mobility) to several variables from their data frame. The data frame we will use includes the following variables:

The data frame is called chetty_cov.Rds, it includes data on 709 commuting zones:

Variable	Description
cz_name	The name of the commuting zone.
state	The state the commuting zone belongs to.
abs_up_mobility	The mean income rank of children with parents in the bottom half of the income distribution.
seg_racial	The Thiel index of racial segregation (a higher score means more segregation).
gini	The Gini Coefficient.
test_scores_incadj	Income adjusted standardized test scores for students in grades 3 through 8.
social_capital_index	A computed index comprised of voter turnout rates, the fraction of people who return their census forms, and various measures of participation in community organizations.
frac_singlemother	The proportion of households headed by a single mother.

Let’s import the data, and examine the first few rows:

chetty_cov <- read_rds(here("data", "chetty_cov.Rds"))
chetty_cov |> head()

To create the scatterplot matrix, we feed our data frame into the ggpairs() function. By default, the function will include all of the variables in the data frame — if you want a subset of these, you can specify those with the columns argument. The c() argument here stands for concatenate or combine. For this function, you must put the variable names in quotations. There is a second argument that I like to include — progress = FALSE. ggpairs() prints out a lot of unnecessary information about the progress towards completing the scatterplot — progress = FALSE suppresses this information.

chetty_cov |> 
  GGally::ggpairs(columns = c("abs_up_mobility", "seg_racial", "gini", "test_scores_incadj", "social_capital_index", "frac_singlemother"),
                  progress = FALSE)

With the default settings, ggpairs() produces the plot above — which is a grid of plots that allows you to visualize relationships between multiple variables at once. The diagonal running from the top-left to the bottom-right shows density plots of each variable. This gives you an idea of the distribution of each variable. Below the diagonal, you’ll find scatterplots, each of which represents the relationship between two of the variables. For instance, the plot in the ith row and jth column shows how the variable represented in the ith row is related to the variable in the jth column. Can you locate the scatterplot of abs_up_mobility and gini? Above the diagonal are the correlation coefficients that quantify the relationship between the variables. For example, the correlation between abs_up_mobility and gini is -.578.

Pie chart

A pie chart is a circular graphic that is used to display quantitative data by dividing a circle into sections or slices. Each slice represents a proportion of the whole, and the size of each slice is proportional to the quantity it represents in the data frame. The entire circle represents the total sum of the data, and each slice represents a part or percentage of the total.

To demonstrate, let’s re-create a pie chart produced by the Pew Research Center, an independent, nonpartisan research organization that serves as a “fact tank,” focusing on providing data-driven insights to the public about key issues, attitudes, and trends that are shaping the world. Part of their mission it to conduct the American Trends Panel, a nationwide survey of Americans on relevant topics of concern. In 2019, one of their surveys focused heavily on income inequality. A report of their findings can be found here, we’ll utilize some of the data from this survey in the remainder of this module. Let’s begin by replicating the data from the pie chart below, which is featured in their report:

Here is a data frame, called pew_too_much_ineq.Rds that includes the data from the graph, category represents the name of the slice, and percentage corresponds to the percentage displayed in the slice.

pew_too_much_ineq <- read_rds(here("data", "pew_too_much_ineq.Rds"))
pew_too_much_ineq

The highcharter package offers an easy to create pie chart. Let’s load the package into our workspace — and then create a pie chart of the data.

library(highcharter)

pew_too_much_ineq |>
  hchart(type = "pie", 
         hcaes(x = category, y = percentage, drilldown = category), 
         name = "Inequality") |>
  hc_tooltip(enabled = TRUE) |> 
  hc_title(text="Most American say there's too much inequality in the U.S.") |> 
  hc_subtitle(text = "% saying there is ___ economic inequality in the country these days")

There are options to modify the colors and other features of the chart — which you can explore at the highcharter link provided above. One nice feature of the package is that it’s interactive — hover over the categories and it will show you the percentage in each group.

Stacked bar chart for Likert and Likert-type items

Likert items are statements used in surveys and questionnaires that allow respondents to indicate their level of agreement or disagreement on a scale. They are named after the psychologist Rensis Likert, who developed this approach as a way to measure attitudes and opinions.

Likert items typically present respondents with a statement, and respondents are asked to indicate their level of agreement with that statement on a scale. The scale usually includes a neutral or middle option, and an equal number of agreement and disagreement options on either side.

A common example of a Likert item scale is a 5-point scale, such as:

Strongly disagree
Disagree
Neither agree nor disagree (Neutral)
Agree
Strongly agree

Other variations include 7-point scales or other ranges. The options might be labeled differently, such as:

Very unhappy
Unhappy
Somewhat unhappy
Neither unhappy nor happy
Somewhat happy
Happy
Very happy

Likert and Likert-type items are widely used in social sciences, marketing, education, and other fields for research, feedback, and evaluations. They are valuable for understanding the attitudes, perceptions, and opinions of respondents.

The Pew Research Center American Trends Panel conducted in September of 2019 (Wave 54) focused on issues of economic inequality, we’ll explore these data with the next couple graphs.

One of the question sets presented respondents with a set of issues facing the country:

Making health care more affordable
Reducing illegal immigration
Reducing economic inequality
Addressing climate change
Dealing with terrorism
Reducing gun violence

For each of these issues — respondents were asked: How much of a priority should each of the following be for the federal government to address?

In responding to each issue, respondents selected from one of 4 choices:

A top priority
Important, but lower priority
Not too important
Should not be done

For the graph that we will create, “Not too important” and “Should not be done” were combined.

The variables for the stacked bar chart that we will create include:

Layer	Description
issue	The name of the issue.
level	The level of importance of the issue.
n	The number of people selecting the level for the corresponding issue.
percentage	The percentage of people who selected the level for the corresponding issue.

Let’s read in the data frame, called pew_priorities_tally.Rds, and examine it.

pew_priorities_tally <- read_rds(here("data", "pew_priorities_tally.Rds"))
pew_priorities_tally

Here is the code to create the graph:

group.colors <- c("A top priority" = "mediumseagreen", 
                  "Important, but lower priority" = "khaki", 
                  "Not too important/Should not be done" = "tomato")
  
pew_priorities_tally |> 
  ggplot(mapping = aes(x = issue, y = percentage, fill = level)) +
  geom_col(position = "stack", color = "black") +
  scale_fill_manual(values = group.colors) +
  labs(
    title = "Despite saying there is too much economic inequality, \nrelatively few people choose it as a top policy priority",
    x = NULL,
    y = "Percentage",
    fill = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "right") +
  coord_flip()

Let’s break down the unfamiliar parts of the code:

The data frame pew_priorities_tally is passed to the ggplot() function. The variable issue is aesthetically mapped to the x-axis and the variable percentage is aesthetically mapped to the y-axis. The fill aesthetic (which determines the color of the bars) is mapped to the variable called level.
The geometry is a bar chart — specified using geom_col(). The position = "stack" argument indicates that the bars should be stacked on top of each other. The color = "black" argument sets the border color of the bars to black.
The code theme(legend.position = "right") positions the legend on the right of the plot.

Substantively, what does this mean? In the pie chart that we created earlier, we saw that approximately 60% of adults in the U.S. believe that there is too much economic inequality in the nation. Nevertheless, when compared to other concerns, enacting policy to reduce economic inequality is not a foremost concern that the public wants the federal government to focus on. Some 43% consider this should be a top priority, which is considerably less than other issues — like making healthcare more affordable (72%), addressing terrorism (64%), or curbing gun violence (61%).

Dumbbell plot

A dumbbell plot is a type of data visualization used to compare the values of two data points for each category or group. It is especially useful for illustrating the change between two time points or conditions.

In a dumbbell plot, each category or group is represented by a pair of dots connected by a line. The position of the dots represents the values of the two data points for each category, and the line connecting them helps to visualize the magnitude and direction of the change. Dumbbell plots are particularly useful for:

Tracking changes over time within categories
Comparing pre- and post-intervention data
Highlighting the differences between two groups or conditions
Making it easy to see which categories experienced growth or decline and by how much

In essence, dumbbell plots provide an alternative to paired bar charts or line graphs for displaying differences in values across two points (e.g., time, condition, group), while giving a clear view of the magnitude of the differences.

To demonstrate a dumbbell plot we’ll use data from the September 2019, Wave 54 American Trends Panel conducted by the Pew Research Center (and described earlier). As part of the survey, respondents were asked:

How much, if at all, do you think each of the following contributes to economic inequality in this country?

The following contributors were displayed:

The different life choices people make
Some people work harder than others
The growing number of legal immigrants working in the US
Discrimination against racial and ethnic minorities
Not enough regulation of major corporations
Some people start out with more opportunities than others
The tax system
Problems with educational system

For each contributor, respondents selected from one of 4 choices:

Contributes a great deal
Contributes a fair amount
Contributes not too much
Contributes not at all

The respondents taking the survey identified their political party affiliation — here, we will consider those who identified as 1.) Republican or 2.) Democrat.

The data are tabulated to calculate the percentage of Republicans and Democrats who indicated that the contributor contributes a great deal.

The variables for the dumbbell chart that we will create include:

Layer	Description
contributor	The contibutor
reps	The percentage of Republicans who indicated the contributor contributes a great deal.
dems	The percentage of Democrats who indicated the contributor contributes a great deal.

The data frame is called pew_ineq_contributors.Rds.

pew_contrib <- read_rds(here("data", "pew_ineq_contributors.Rds"))
pew_contrib

To create a dumbbell chart the ggalt package is used for the geom_dumbbell() geometry, which is specifically designed to create dumbbell plots. We will also make use of the ggtext package, which will allow us to do something spiffy with the title.

library(ggalt)
library(ggtext)

pew_contrib |> 
  ggplot(mapping = aes(y = contributor, x = reps, xend = dems)) +
  geom_dumbbell(colour = "lightgrey",
                colour_x = "firebrick2", colour_xend = "dodgerblue",
                size_x = 3, size_xend = 3,
                dot_guide = FALSE) +
  labs(
    x = NULL, y = NULL,
    title = "Reasons for income inequality as perceived <br> by <span style = 'color:firebrick2'>Republicans</span> and <span style = 'color:dodgerblue'>Democrats</span>"
  ) +
  theme_minimal() +
  theme(plot.title = element_markdown())

Let’s break down the code step by step since much of this is new:

The data fame called is piped into ggplot() to initialize a ggplot object. In the aes() function, the y-axis is mapped to a variable called contributor, while the x-axis has two variables: reps for Republicans and dems for Democrats. The x and xend arguments are used to specify the positions of the two points that are being compared for each category on the y-axis.
The geom_dumbbell() code line adds the dumbbell geometry. Here, linewidth specifies the size of the dumbbell line connecting the points, color specifies the color of the line connecting the dumbbell points, color_x and color_xend set the colors of the points for Republicans and Democrats, respectively, and size_x and size-xend set the dot size.
The x- and y-axes labels are set to NULL, indicating that they will not be displayed since they are not needed in this type of graph.
The title is set with HTML content to include colors matching the data points for Republicans and Democrats. The code line theme(plot.title = element_markdown()) enables the processing of markdown elements in the title. This is necessary because the title includes HTML tags for styling — in this case it colors certain words to correspond with the graph.

How might we interpret this graph? As described in the Pew Research Center report on these findings, Democrats are more prone than Republicans to highlight systemic issues, like the tax system, with 56% of Democrats compared to 30% of Republicans believing it considerably contributes to economic inequality. Additionally, 49% of Democrats, as opposed to 38% of Republicans, believe that shortcomings in the U.S. education system play a significant role in economic inequality.

On the other hand, Republicans are more inclined than Democrats to attribute economic inequality to individual choices and work ethics. For instance, 60% of Republicans, compared to 27% of Democrats, think that the diverse life decisions made by individuals substantially contribute to economic inequality. Moreover, 48% of Republicans believe that variations in individual work efforts significantly contribute to economic inequality, whereas only 22% of Democrats share this belief.

The differing perceptions between Democrats and Republicans on the causes of economic inequality have implications for how the issue is approached and addressed in terms of policy-making and public discourse. When there are divergent views on the root causes of economic inequality, it becomes challenging for policymakers to reach a consensus on the solutions. Democrats may push for reforms in the tax and education systems, while Republicans might focus on individual responsibility and work ethic. This disparity in perspectives can lead to polarization in the policy proposals, where Democrats may advocate for progressive taxation, increased funding for education, and social safety nets, whereas Republicans might favor deregulation, reduced taxes, and incentives for personal responsibility.

The public’s engagement with and attitudes toward policies aimed at reducing economic inequality might also be influenced by their political leanings. For instance, individuals who believe that structural factors contribute to inequality might be more supportive of systemic reforms, whereas those who believe in personal choices as significant factors might oppose government intervention. The lack of agreement on the causes can result in delayed action, as policies might get stalled due to political disagreements. Moreover, when policies are eventually enacted, they might not be as effective if they do not address the complexities and multiple facets of economic inequality. Understanding that perspectives on economic inequality are diverse, there might be a need for more comprehensive solutions that take into account both structural factors and individual choices. Policies that strike a balance between systemic reforms and promoting personal responsibility may have a better chance of gaining bipartisan support. In summary, to effectively address economic inequality and its consequences, it’s essential to recognize and navigate the complexities of differing perceptions and beliefs, and to foster dialogue and collaboration across political lines to develop more comprehensive and informed solutions.

Maps

As a final graph — I’d like to show you an example of a choropleth map created by ggplot2 via the geom_polygon() geometry. Choropleth maps display divided geographical areas or regions that are colored, shaded, or patterned in relation to a variable. This provides a way to visualize how a measurement varies across a geographic area.

These applications are quite advanced, and so here, I will just demonstrate what’s possible so you can get a feel for the capability. In the code below I use data from the Chetty study described above — but this time the absolute upward mobility score is calculated at the county level, rather than the commuting zone. Here, the produced map shows how the absolute upward mobility scores vary across counties, and gives us a sense of regions in the US that have higher upward mobility. Here we see that the Mountain West and rural Midwest regions seem to offer the best environments for supporting upward economic mobility.

chetty_map <- read_rds(here::here("data", "chetty_counties.Rds"))

chetty_map |> 
  ggplot(mapping = aes(x = long, y = lat, group = group, fill = abs_up_mobility)) +
  geom_polygon() +
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) + 
  ggthemes::theme_map() +
  viridis::scale_fill_viridis(option = "turbo") +
  labs(title = "Absolute upward mobility across US counties",
       subtitle = "Data from Opportunity Insights: https://opportunityinsights.org/",
       fill = "Absolute upward mobility",
       caption = "Greyed areas have insufficient data"
       )

Saving your creations

Once you create a graph, you may be happy to print it out as a graph inside your Quarto document. Other times, you may want to just save the graph. In this instance, the ggsave() function is your solution. There are many options for saving your graph, I will demonstrate some basic options below.

chetty |> 
  ggplot(mapping = aes(x = abs_up_mobility, fill = gini_rank.f)) +
  geom_density(alpha = .5) +
  theme_minimal() +
  labs(title = "The distribution of absolute upward mobility across commuting zones",
       x = "Absolute upward mobility",
       y = "Density",
       fill = "Gini Coefficient Quintile")

ggsave(filename = "my_graph.png", width = 20, height = 20, units = "cm")

Here’s a breakdown of how ggsave() works. First, note that it will save the last graph produced. Use the file name argument to indicate what you want to call the file. Here, “my_graph.png” is the name of the file where the plot will be saved. It specifies that the plot should be saved in PNG format (see the .png extension). You can specify other formats by changing the file extension (e.g., my_graph.jpeg for JPEG). Note that it will by default save the graph where the .qmd file lives (i.e., your Quarto analysis notebook), you can change that save location using the here() function. The arguments width = 20 and height = 20 set the dimensions of the saved plot. In this example, both the width and height are set to 20. In addition, units = "cm" specifies the units for the width and height. In this example, the units are centimeters. Other options include “in” for inches and “mm” for millimeters.

Wrap up

In this Module we’ve explored many types of data visualizations. Starting with basic plots and progressing to more complex visualizations, we’ve equipped ourselves to transform raw data into insightful, understandable, and impactful visual stories.

Throughout this module, you’ve learned the essential layers and grammar of ggplot2, which provides a systematic approach to creating graphs. By manipulating these elements, you’ve seen how data visualization serves not only as a method of data exploration but also as a robust communication tool, allowing us to present complex information in an accessible and persuasive manner.

Additionally, we’ve delved into practical applications and best practices for crafting visual narratives that resonate with audiences, ensuring that our data not only informs but also engages. As you move forward, remember that the power of data visualization lies in its ability to highlight trends, patterns, and outliers, providing a foundation for informed decision-making and insightful analysis. Continue to experiment with different types of plots, explore new data frames, and refine your ability to discern and depict the stories hidden within the numbers.

Your journey through data visualization doesn’t end here. With the skills you’ve developed, you are well-prepared to tackle more advanced visualizations and to continue enhancing your proficiency in this essential domain of data science. Keep exploring, learning, and visualizing, and you’ll find that the world of data is rich with stories waiting to be told.

Resources

There are several excellent open source books and resources that will help you continue to grow your graphing skills:

Dr. Hadley Wickham’s ggplot2: Elegant Graphics for Data Analysis
Dr. Kieran Healy’s Data Visualization: A Practical Introduction
Dr. Claus Wilke’s Fundamentals of Data Visualization.
Here is a nice resource on general principles for creating scientific graphs.

That concludes the module on data visualization. You learned how to create many wonderful graphs, and hopefully you are starting to get familiar with how R code works.

Footnotes

The term “aesthetics” in ggplot2 is borrowed from the field of visual design and art, where it refers to the characteristics of an object that make it visually appealing or distinctive. These characteristics might include attributes such as color, shape, size, or texture, all of which are aspects that we can also control in a data visualization. In ggplot2, “aesthetics” are used to describe how variables in the data are mapped to visual properties in the plot. These visual properties can include position (x and y coordinates), color, shape, size, and others. By using the term “aesthetics”, the designers of ggplot2 underscore the importance of thoughtfully deciding how to make data visualizations not only informative but also visually engaging and easy to interpret. Essentially, using the term ‘aesthetics’ highlights the blend of art and science that is inherent in creating effective data visualizations, encapsulating the dual goal of making graphs both visually pleasing and informative.↩︎
There are two geometries that create bar charts in ggplot2: geom_bar() and geom_col(). The difference between the two lies in how they treat the y-values. By default, geom_bar() uses stat = "count", which makes the height of the bar equal to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). In other words, it is useful for creating histogram-like plots where you’re counting the number of observations. If you want to plot pre-computed values, you have to supply stat = "identity". Calling geom_col() is equivalent to geom_bar(stat = "identity"). It makes the height of the bar equal to the value in the data. It is useful for when you have your data aggregated and you want to visualize it. The Y value is already calculated, and you merely want to represent it in a bar format. So, the primary difference between geom_bar() and geom_col() is whether you want to count the number of cases (with geom_bar()) or use a pre-existing calculated variable of counts (with geom_col()).↩︎
Line graphs that put a categorical variable on one of the axes can be quite useful in some circumstances. For example, consider the chart below in Alberto Cairo’s The Functional Art: An introduction to infromation graphics and visualization book (Figure 5.17).

↩︎
You might wonder – what does “Income growth is calculated from dollar figures that have been annualized and adjusted for price inflation to March 2023 dollars” really mean ? This conveys that the calculation for income growth takes into account several factors to make the data more comparable and meaningful over time. Here’s the breakdown. Calculated from dollar figures means that the raw data used for the calculation is in terms of monetary values, likely representing incomes. Annualized implies that the income data, which may have been originally recorded on a different time scale (e.g., weekly, monthly, quarterly), has been converted to an annual rate. This is important for consistency and comparability, especially when examining trends over time. Adjusted for price inflation indicates that the dollar figures have been corrected for the effects of inflation. Inflation reduces the purchasing power of money over time, so 100 USD in the year 2000, for example, could buy more than 100 USD in 2023. Adjusting for inflation means converting all the dollar amounts into constant dollars of a specific year (in this case, March 2023) to account for changes in purchasing power. This process is known as converting nominal values into real values. To March 2023 dollars specifies the reference point in time to which the dollar figures are adjusted for inflation. By adjusting all the dollar figures to what they would be equivalent to in March 2023, it allows for a fair comparison of incomes across different years by removing the effects of inflation. In summary, this statement means that the income growth has been calculated in a way that ensures the data is on an annual basis and that it reflects the real value of income in March 2023 dollars, making it possible to accurately analyze trends and changes in income over time without the distortion caused by inflation.↩︎
There is an equivalent alternative to including this information. Rather than placing it on the primary ggplot() call — one could include it in the geom_line() call itself. To accomplish this, replace the geom_line() + call in the code chunk with: geom_line(mapping = aes(color = group.f)) +. The result will be the same.↩︎
A quintile is a statistical term referring to dividing a set of data into five equal parts. Each quintile contains 20% of the data points. Quintiles are often used in economics, statistics, and data analysis to understand the distribution of data, especially in the context of income, wealth, or other socioeconomic indicators. For example, when discussing income distribution, the data can be divided into quintiles to understand how income is distributed among the population. The first quintile represents the bottom 20% of the income distribution (the 20% of people with the lowest incomes), while the fifth quintile represents the top 20% (the 20% of people with the highest incomes). In a more technical sense, quintiles are the four cut points that divide a data frame into five equal groups. These cut points are the 20th, 40th, 60th, and 80th percentiles.↩︎