A No-Code Introduction to Describing Data

Module 2

Artwork by @allison_horst

Learning objectives

  • Summarize the primary purpose of statistics
  • Identify a few questions that can be answered with statistics
  • Describe a situation where mathematical thinking is important
  • Discuss the relationship between theoretical constructs, measures, and variables
  • Describe methods for assessing a measure’s validity
  • Describe methods for assessing a measure’s reliability
  • List and describe common measurement scales
  • Contrast the difference between qualitative and quantitative variables
  • Define descriptive statistics
  • Define and describe common measures of central tendency
  • Define and describe common measures of dispersion
  • Describe the difference between a crude and adjusted descriptive statistics

Overview

In this module, you will immerse yourself in the foundational concepts needed to summarize and describe data. Known as descriptive statistics, this indispensable area plays a key role in distilling complex and seemingly chaotic raw data into clear, comprehensible information. We can use descriptive statistics to paint a vivid picture of data’s key characteristics, thereby facilitating insightful understanding of patterns and trends. In this module, we will take a no-code approach. I want to ensure that you have a strong point of departure from which to build as you progress through this course.

Throughout this course, you will have the opportunity to solidify your knowledge of the content by watching episodes of Crash Course Statistics. Crash Course Statistics is known for breaking down complex concepts with engaging visuals and real-world examples. Statistics is all about making sense of data — and figuring out how to put that information to use. To introduce this idea, please watch this video that answers the question “What is Statistics?”

A critical goal of this course is to help students develop their number sense and ability to think mathematically. Mathematical thinking is a fundamental aspect of data literacy. Please watch the Crash Course Statistics video below that introduces the key concepts of mathematical thinking.

Measurement

In this course, we will work with a diverse range of data frames, each representing various topics and phenomena. Within these data frames, we will encounter an array of variables that capture different aspects of research in the social and behavioral sciences. While the majority of our time will be spent exploring and analyzing these collected variables, it’s important to recognize that behind every variable lies a complex web of decisions and methodologies that shape how it is measured. The way we assign numbers or labels to objects or phenomena can profoundly impact the validity and reliability of our findings. Thus, understanding the intricacies of measurement is essential to ensure that the data we work with accurately represent the underlying constructs we aim to investigate.

By critically examining the measurement approaches taken in data collection, we gain insight into the strengths and limitations of the variables we encounter. We can assess the reliability and validity of measurements, identify potential biases, and make informed interpretations of the data. Moreover, considering the measurement process allows us to understand the context in which the data were collected, providing valuable insights into the specific research design and methods employed.

Moreover, grasping the nuances of measurement empowers us to think beyond the mere act of data collection. It encourages us to be mindful of the context in which measurements are obtained, including potential biases, limitations, and sources of error. By considering the measurement process as an integral part of our analytical journey, we gain a more comprehensive perspective on the data and can make informed decisions when applying descriptive statistics and drawing conclusions.

What exactly do we mean by “measurement”? Measurement is a nuanced process that involves assigning numbers or labels to objects or phenomena. Let’s consider a few common measurements that we can collect from people:

  • “My height is 70 inches.” (numeric)
  • “I am married.” (label)
  • “I have 0 children.” (numeric)
  • “I am employed full time.” (label)
  • “I average 8 hours of sleep per night.” (numeric)

Each of these examples of measurement is relatively straight-forward, and can be measured with a finite set of labels or numerically. However, many of the concepts that we seek to measure in the social and behavioral sciences are more ambiguous. For example, an Industrial-Organization Psychologist might be interested in the concept of work-life balance, a Clinical Psychologist might be interested in the concept of recovery from a traumatic event, or a Cognitive Psychologist might be interested in the concept of information processing bias. How do we measure such concepts?

Operationalization of our concepts

When we embark on the task of measuring something, it is necessary to engage in the process of operationalization. Operationalization involves transforming concepts of interest, which can initially be meaningful yet somewhat ambiguous, into precise and measurable variables. This process entails several key aspects that contribute to the development of a well-defined and reliable measurement:

  1. Precisely Defining What is Being Studied: Before measuring a concept, it is crucial to clearly define and articulate what exactly is being studied. This involves specifying the boundaries, characteristics, and components of the construct of interest. By providing a clear and comprehensive definition, researchers can ensure that the subsequent measurement aligns with their conceptual understanding.

  2. Determining the Measurement Method: Choosing an appropriate measurement method is crucial for capturing the desired information accurately. Researchers must decide whether to use self-reporting, authority reporting, or official records, depending on the nature of the construct and the research objectives. If self-reporting is employed, careful attention should be given to phrasing questions effectively, ensuring they are clear, unbiased, and capable of eliciting meaningful responses from participants.

  3. Establishing the Set of Allowable Values for the Measurement: While numerical values are common in many measurements, it is essential to consider the full range of potential response options. Some measurements may involve non-numerical options, such as categorical variables like marital status or gender. In such cases, researchers need to define the allowable categories and provide clear guidelines for participants to respond. It is crucial to consider how individuals may perceive and interpret these response options, ensuring that the measurement captures the intended information accurately.

By addressing these key aspects of operationalization, we as researchers can ensure that our measurements are well-defined, reliable, and aligned with the conceptual framework of the study. This thoughtful approach enhances the validity of the data collected and allows for more accurate and meaningful interpretations of the findings. Operationalization is an essential step in the research process that bridges the gap between abstract concepts and concrete measurements, enabling researchers to study and understand the complexities of the social and behavioral sciences.

Operationalization is a complex endeavor without a one-size-fits-all approach. The process depends on the specific research needs and may be influenced by established practices within the scientific community. The graphic below provides a general model of how we move from a theoretical construct to a variable in our data frame.

Validity of measurements

Measurement validity refers to the extent to which a measure accurately captures the construct or concept it intends to assess. It is a fundamental aspect of research design and plays a critical role in ensuring the reliability and meaningfulness of study findings. Establishing validity involves providing evidence that the measure is appropriate, relevant, and reliable for the intended purpose.

There are different types of validity that researchers consider when assessing the quality of their measures.

  • Content validity focuses on the representativeness and comprehensiveness of the items or questions in relation to the construct being measured. A measure with high content validity contains items that adequately and comprehensively represent the construct of interest. For example, a depression questionnaire with items covering various symptoms and domains related to depression would demonstrate strong content validity.

  • Construct validity examines the extent to which a measure aligns with theoretical expectations and hypotheses about the construct being measured. It assesses whether the measure is capturing the intended underlying construct accurately. Researchers typically employ multiple methods to establish construct validity, such as convergent validity (demonstrating that the measure is correlated with other measures of related constructs) and discriminant validity (showing that the measure is not strongly correlated with measures of unrelated constructs).

  • Criterion validity evaluates the extent to which a measure corresponds with a specific external criterion or gold standard. It is concerned with how well the measure predicts or correlates with an outcome or criterion of interest. There are two types of criterion validity: concurrent validity and predictive validity. Concurrent validity refers to the measure’s ability to distinguish between groups or individuals at the same time. Predictive validity, on the other hand, examines whether the measure can predict future outcomes or behaviors.

In addition to these types of validity, researchers must also consider the cultural and contextual validity of their measures. This involves assessing whether the measure is appropriate and applicable to the population and cultural context under investigation. It ensures that the measure is meaningful and interpretable within the specific cultural and social framework in which it is used.

Overall, establishing validity is a meticulous and iterative process that involves careful consideration of the measure’s content, theoretical alignment, and relationship to external criteria. By ensuring the validity of measures, we as researchers can have confidence in the accuracy and reliability of our findings, ultimately contributing to the advancement of knowledge in our respective fields.

Reliability of measurements

Measurement reliability is a crucial aspect of research design and is concerned with the consistency and stability of measured variables. A reliable measure yields consistent results when applied to the same individuals or objects under similar conditions. It ensures that the observed variations in scores reflect true differences or changes in the construct being measured, rather than measurement error or random fluctuations.

There are different types of reliability that researchers consider when assessing the quality of their measures.

  • Test-retest reliability examines the consistency of measurements over time. It involves administering the same measure to the same group of participants on two separate occasions and assessing the correlation between the scores obtained. A high test-retest correlation indicates that the measure produces consistent results over time.

  • Internal consistency reliability focuses on the consistency of items within a measure. It is commonly assessed using measures such as Cronbach’s alpha, which examines the intercorrelations among items within a scale. High internal consistency suggests that the items in the measure are measuring the same underlying construct consistently.

  • Inter-rater reliability assesses the consistency of measurements when different raters or observers are involved. It is particularly relevant in research involving subjective ratings or observations. Inter-rater reliability is measured by calculating the agreement or correlation between the ratings of different raters or observers. High inter-rater reliability indicates that the measure produces consistent results regardless of who is administering it.

  • Parallel forms reliability, also known as alternate forms reliability, is a measure of consistency that assesses the degree of agreement or correlation between two or more equivalent forms of a measure. It is particularly useful when researchers want to ensure that different versions of a measurement instrument produce consistent results.

Reliability is essential because it provides confidence in the consistency and stability of measured variables. A reliable measure reduces measurement error and increases the likelihood of obtaining accurate and meaningful results. Researchers can have greater confidence in the interpretations and conclusions drawn from reliable measures, knowing that the observed variations in scores are likely to be true reflections of the construct being measured.

It is important to note that reliability is necessary but not sufficient for validity. A measure can be reliable but not necessarily valid. While reliability ensures consistent results, validity ensures that the measure is measuring what it intends to measure accurately. Therefore, it is important for us to consider both reliability and validity when selecting and evaluating measures for our studies to ensure the robustness and meaningfulness of findings.

In summary, it is variables that we will focus on in this course. But, it’s important to keep in mind that the variables collated in your data frame should have emerged from a careful process of defining constructs and implementing rigorous methods for measuring those constructs.

Types of Variables

In research and practice, we encounter various types of variables. Understanding these different types is important to conducting proper data exploration and analysis endeavors.

Scales of Measurement

Variables are often categorized into different scales of measurement. These scales provide a framework for understanding the nature and properties of the variables we measure.

Nominal Scale

Variables measured on a nominal scale are categorical, representing different categories or groups without an inherent order or hierarchy. They’re characterized by their qualitative nature and lack numerical meaning, hence nominal variables are often referred to as qualitative variables.

Examples include gender (e.g., man, woman, non-binary), ethnicity (e.g., Asian, African American, Hispanic), and political affiliation (e.g., Democrat, Republican, Independent).

Ordinal Scale

The ordinal scale adds a sense of order and rank among the categories. While maintaining the qualitative nature of nominal variables, ordinal variables allow us to rank categories based on their relative position or magnitude. Categories are ordered in an ordinal scale, but differences between categories aren’t necessarily equal or quantifiable.

Examples include education level (e.g., high school, bachelor’s degree, master’s degree) and satisfaction ratings (e.g., very satisfied, somewhat satisfied, neutral, somewhat dissatisfied, very dissatisfied).

Continuous Variables

Continuous or quantitative variables are numerical variables that can take an unlimited number of values within a given range. They provide detailed information as they can capture fine differences between individuals or observations.

Continuous variables fall into two main categories: interval and ratio variables. The difference lies in whether the variable has a true zero point.

Interval variables are those where the difference between two values is meaningful. The “intervals” between values are interpretable, allowing us to measure the degree of difference between values, but not the ratio of one value to another. A common example is temperature in Celsius or Fahrenheit. The difference between 20 and 30 degrees is the same as between 30 and 40 degrees. However, since zero doesn’t represent an absence of temperature, saying that 20 degrees is twice as hot as 10 degrees doesn’t make sense.

In contrast, ratio variables have a clear definition of zero. When the variable equals zero, there is none of that variable. Hence, it’s sensible to talk about one value being twice as large as another. An example is weight measurements. For instance, a weight of 0 kg represents an absence of weight, and something that weighs 6 kg is indeed twice as heavy as something that weighs 3 kg.

In many situations, particularly in social and behavioral sciences, interval and ratio variables are treated similarly because most analyses don’t distinguish between the two. Still, understanding the difference is vital in certain contexts, such as performing multiplicative operations.

Examples include variables like height, weight, time, and temperature.

Other Distinctions

Let’s explore other distinctions that we will encounter in this course.

Binary Variables

Binary variables, also known as dichotomous variables, are a type of nominal variable with only two possible values or categories, representing the simplest type of categorical variable. The two categories could be “yes” and “no”, “true” and “false”, “success” and “failure”, “passed” and “failed”, “present” and “absent”, or any other pair of mutually exclusive categories.

When encoding binary variables numerically for use in statistical models, we often use the numbers 0 and 1. This numerical coding, often referred to as binary or dummy coding, doesn’t imply any quantitative relationship between the categories. For example, in a binary variable representing sex, assigning “male” as 1 and “female” as 0 doesn’t mean that “male” is greater than or equal to “female” in any numerical sense.

Discrete Variables

Discrete variables are quantitative variables that can take on a countable number of values, often representing counts of things, like the number of objects in a set, or events in a time interval. The key characteristic of a discrete variable is that between any two values, there’s a finite, countable number of other possible values. Discrete variables can only take on certain specific values and not any values in between.

For example, the number of children in a family is a discrete variable because it can only take on whole numbers like 0, 1, 2, 3, etc. You can’t have 2.5 children. Similarly, the number of visits to a website, the number of cars in a parking lot, or the number of defects in a product are examples of discrete variables.

Discrete variables contrast with continuous variables, which can take on an infinite number of values within a certain range. For example, temperature, time, height, weight, and distance are all continuous variables because they can be measured to any level of precision and can take on any value within a certain range.

Introduction to the data

In the remainder of this Module, we will explore results from the National Health and Nutrition Examination Study (NHANES). This national study is conducted by the Centers for Disease Control and Prevention, and is one of their surveillance initiatives to monitor the health and morbidity of people living in the US.

Public health surveillance programs like the NHANES are essential for multiple reasons, as they play a critical role in understanding, monitoring, and improving population health.

  • Data Collection and Analysis: NHANES collects data through interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. This data is vital for understanding various health conditions, dietary habits, and risk factors within the population.

  • Identifying Trends and Patterns: Surveillance programs help identify emerging health trends and patterns over time. For example, through NHANES, it is possible to observe changes in obesity rates, dietary intake, or prevalence of chronic diseases such as diabetes or hypertension. Recognizing these trends is crucial for public health planning and resource allocation.

  • Informing Policy and Programs: The data gathered through surveillance programs like NHANES is essential in informing public health policies and programs. Policymakers can use this data to understand which health issues need prioritization and to create evidence-based policies and interventions that address the identified needs.

  • Evaluating Interventions: Surveillance data is not only useful for informing policies and programs but also for evaluating their effectiveness. Through continuous monitoring, it is possible to assess whether implemented interventions are having the desired impact on population health and make necessary adjustments.

  • Identifying Health Disparities: NHANES and similar programs often collect data on various demographic groups, which is crucial in identifying and understanding health disparities among different populations. This information is vital for developing targeted interventions to address health inequalities.

  • Responding to Health Emergencies: Surveillance programs are crucial in identifying and responding to health emergencies such as outbreaks, epidemics, or other public health crises. For instance, monitoring trends in infectious diseases can help public health officials detect outbreaks early and respond more effectively.

  • Educating the Public and Health Professionals: The findings from NHANES and similar programs are often disseminated through various channels and used to educate both the public and health professionals about important health issues. This education is critical for promoting preventive health behaviors and informed decision-making.

  • Global Health Comparisons: Data from national surveillance programs like NHANES can be used in international comparisons to help understand how the health of the population in one country compares to others. This can be useful for global health initiatives and collaborations.

In conclusion, public health surveillance programs like NHANES are fundamental components of a robust public health system. They provide the necessary information and tools for public health officials, policymakers, healthcare professionals, and the general public to make informed decisions and take actions that promote and protect the health of populations.

In this module, we will use NHANES data collected during 2011-20121. NHANES data are publicly available for download, click here for more information.

We’ll consider several variables from NHANES:

Variable Description Type
age Age of respondent interval
marital_status Marital status of respondent (only acertained for age >= 20). nominal
education Highest level of education completed (only ascertained for age >= 20) ordinal
SBP Systolic Blood Pressure in mm Hg (only measured for age >= 8) ratio


Here is a glimpse of the data. Notice that age and SBP are listed as R data type “double” - <dbl>, while marital status and education are listed as R data type “factor” - <fctr>.

Rows: 5,000
Columns: 4
$ age            <dbl> 14, 43, 80, 80, 5, 34, 80, 35, 17, 15, 57, 57, 57, 57, …
$ marital_status <fct> NA, Single, Married, Married, NA, Married, Widowed, Mar…
$ education      <fct> NA, High School Graduate or GED, College Graduate, Coll…
$ SBP            <dbl> 107, 103, 97, 97, NA, 107, 121, 107, 108, 113, 110, 110…


Here are the first few rows of data for you to peruse. Notice that some people have a NA recorded for their scores. NA is R’s system missing indicator — this means that their score is missing/unknown. For example, marital_status and education weren’t ascertained for people under 20, while SBP wasn’t measured for children under 8. These variables are missing by design. In some cases, data are missing for other people for other reasons (e.g., a person preferred not to answer a question, or didn’t wish to have biological measures taken). These variables aren’t missing by design — and were likely out of the researchers’ control.



Systolic blood pressure (SBP), the top number in a blood pressure reading, signifies the force with which your heart pumps blood around your body. It’s an essential marker of cardiovascular health. High SBP, also known as systolic hypertension, can indicate a higher risk of developing serious health conditions such as heart disease, stroke, kidney diseases, and other cardiovascular complications. Given that heart disease and stroke are among the leading causes of death worldwide, the importance of maintaining healthy SBP cannot be overstated.

Describing qualitative data

When describing qualitative data (nominal and ordinal scales) the emphasis is on summarizing and presenting the distribution and characteristics of the different categories or ranks within the data. For example, among the 3,587 adults in the NHANES sample (i.e., age >= 20), a question pertaining to marital status (a nominal variable) and education (an ordinal variable) were ascertained.

We can describe the number (and percentage) of adults falling into each category for these variables. The table below accomplishes this.

Characteristic N = 3,5871
Marital status
    Divorced 352 (9.8%)
    Live with Partner 294 (8.2%)
    Married 1,896 (53%)
    Single 737 (21%)
    Separated 84 (2.3%)
    Widowed 222 (6.2%)
    Unknown 2
Highest level of eduation achieved
    8th Grade or Less 212 (5.9%)
    9 - 11th Grade 405 (11%)
    High School Graduate or GED 679 (19%)
    Some College 1,160 (32%)
    College Graduate 1,128 (31%)
    Unknown 3
1 n (%)

For example, 352 of the 3,587 NHANES adults were divorced. The percentage displayed in this table provides the percentage of divorced adults based on the number of people who responded to this question — which is 3,587 minus the 2 unknown (i.e., missing) cases or 3,585. To replicate the percentage, we take the number of people divorced divided by the number of people observed for marital status — that is 352/3,585 = .098, then we multiply this proportion by 100 to arrive at the percentage. That is, 9.8% of adults were divorced at the time of the study.

We might alternatively describe these data in a graph — for example a pie-chart is often used to describe nominal variables. The pie-chart below describes the data for marital status — again, among those with observed/known/non-missing (all terms for the same idea) data.

Describing quantitative data

When it comes to quantitative data (interval and ratio scales), the options for descriptive statistics become much greater, and there is more that we can do to describe the data.

In this section, we will work with data that describes the SBP of participants in NHANES. A total of 4,281 participants have SBP observed in the data frame. The SBP of participants is expressed in the units of mm Hg — and is considered a continuous or quantitative variable.

Before delving into the intricate statistics that describe systolic blood pressure (SBP), let’s first build a foundational understanding of this variable through visualization. A practical approach to summarizing SBP is by constructing a frequency table, which efficiently organizes the data into intervals or bins. This method is particularly useful for quantitative variables like SBP, allowing us to categorize individuals based on their blood pressure readings.

For this exercise, we’ll categorize SBP into bins with a width of 5 mm Hg, starting from the minimum observed value of 79 mm Hg. Each bin represents a range of 5 mm Hg, grouping individuals whose SBP falls within these intervals. The table includes three key pieces of information for each bin:

  • The bin: The notation used in intervals, such as [79,84), specifies the range of values each interval includes, with brackets and parentheses indicating whether the endpoints are included in the interval. For example, [79,84), means that the interval starts at 79 and goes up to, but does not include, 84. The last interval [219,224] is slightly different in that both endpoints are included, as indicated by the square brackets on both ends. This means all values from 219 up to and including 224 are part of this interval.

  • Frequency (labelled frequency): This column indicates the total number of individuals whose SBP falls within a specific bin. For instance, if the bin spans from 79 to 84 mm Hg, and we have 14 individuals in this range, it means these 14 people have SBP values between 79 and 84 mm Hg.

  • Relative Frequency (labelled relative_frequency): This represents the proportion of the total population found within each bin. To illustrate, if there are 14 individuals in the first bin and the total sample size is 4,281, then the relative frequency for this bin is \(\frac{14}{4281} = 0.0033\). Converting this proportion to a percentage provides a clearer picture: \(0.0033 \times 100 = 0.33\%\). Thus, 0.33% of the sample has an SBP within the range of 79 to 84 mm Hg.

By organizing SBP into bins and analyzing the frequencies and relative frequencies, we can easily visualize the distribution of SBP across the population. This step is foundational for understanding the overall behavior of SBP before advancing to more complex statistical analyses.


Instead of a table, we can also present the information in a bar graph:

Notice that the height of each bar records the corresponding value in the table. The top graph puts frequency on the y-axis (also called the vertical axis), while the bottom graph puts relative frequency on the y-axis. Notice that the two graphs look the same except for the y-axis/vertical scale. For example, for the first group, [79,84) — the height of the bar in the top graph is 14 on the y-axis, indicating 14 people are in this bin. The height of the bar in the bottom graph is .33%, indicating .33% of the 4,281 study participants are in this bin.

Some notation

Before we explore the common statistical measures used to describe quantitative variables, it’s beneficial to establish some basic notation. Consider a dataset composed of a series of observations, such as systolic blood pressure (SBP) readings from the NHANES study. Assuming there are n observations for variable X (i.e., SBP) in total, we can represent these observations as follows:

\(x_1, x_2, x_3, ... x_n\)

Here, \(x_1\) is the SBP score for the first person in the data frame, and \(x_n\) is the SBP score for the last person in the data frame. This concise mathematical formulation allows us to succinctly refer to any individual observation within the dataset.

In statistical and mathematical contexts, it is quite common to use uppercase letters to denote variables that represent a whole dataset (i.e., the variable that represents all of the SBP scores for the participants — X) or a series of values, and lowercase letters to denote individual elements or observations (\(x_i\)) within that dataset (i.e., the \(i_\text{th}\) person’s score is denoted as \(x_i\), in other words, the first person’s score for SBP is referred to as \(x_1\)).

Measures of Central Tendency

Central tendency measures are used to summarize a variable with one representative number. The term “central tendency” gets its name because it’s all about identifying the center or the middle value of a variable. Imagine you’re faced with a large set of values for some variable of interest - for example, the systolic blood pressure for the 5000 NHANES study participants. That’s quite a lot of information — how can we describe those scores in a succinct way?

The central tendency is like a representative or an ambassador for this group. It’s the go-to number that you can point to and say, “This value right here kind of sums up what’s typical for the whole group.” So, “central” in “central tendency” means we’re focusing on the middle, and “tendency” implies that this is a typical or representative value that the data tends to hover around. It’s called “central tendency” because it helps us understand what’s common or typical by looking at the central or middle values of the variable.

Measures of central tendency help us to condense the information in a variable into a single value, which you can then use to describe the general character of the variable. However, it’s essential to choose the appropriate measure for the data you’re examining.

There are different ways to pinpoint this central value. The three most common measures of central tendency are:

  • The mode (the number that shows up the most)
  • The median (the middle number when everything is lined up in order)
  • The mean (average)

Each of these measures gives you a different take on where the center of the data lies. Let’s see how they’re each calculated.

Median: The Middle Ground

The median represents the middle score for the variable. To find the median, first, arrange the numbers from smallest to largest and then identify the number that has an equal count of numbers above and below it. Let’s consider these 9 scores.

SBP
120
135
128
152
180
140
138
122
158

Let’s take a look at the scores arranged from smallest to largest:

SBP
120
122
128
135
138
140
152
158
180

The median is 138 for these 9 scores, as there are four scores below it and four scores above it.

What if there’s an even number of data points? Consider these 8 scores arranged from smallest to largest.

SBP
120
122
128
135
138
152
158
180

Here, there’s no single middle score, so the median is calculated as the average of the two middle numbers, \((135 + 138) / 2 = 136.5\).

Unlike the mode, the median will always represent the middle of the variable’s distribution.

Mean: The Balancing Point

The mean, or average, is calculated by adding all the scores for a variable and then dividing by the number of data points. Here’s the formula:

Mean = (Sum of all scores) / (Total number of scores)

We can also write it in more formal statistical notation:

The mean, often denoted as \(\bar{x}\), of a series of observations \(x_1, x_2, x_3, \ldots, x_n\) is given by:

\[\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i\]

where:

  • \(n\) is the total number of observations.

  • \(x_i\) represents the \(i^{th}\) observation in the dataset.

  • \(\sum_{i=1}^{n}\) denotes the sum of all observations from \(1\) to \(n\).

Now, let’s calculate the mean. For example, consider the following set of 10 scores:

SBP
140
115
125
132
118
142
129
148
130
115

\[\bar{SBP} = \frac{1}{10}\sum_{i=1}^{n} SBP_i = \frac{1294}{10} = 129.4\]

The mean, or average, can be seen as the “balance point” of a set of numbers.

However, it is critical to recognize a major drawback of the mean — outliers (scores that are substantially different from the other scores) can significantly influence the mean, making a mean calculated with outliers potentially unrepresentative of the center of the data. For example, what if we add a person with an extremely high SBP?

SBP
140
115
125
132
118
142
129
148
130
115
220

Now, the mean \(= 1514 / 11 = 137.6.\) The mean is highly influenced by one person with a very high score.


To summarize, let’s consider all of the individuals in the NHANES sample. For all scores, we can count how many people have the corresponding score for SBP, here’s a listing of the 114 unique scores for SBP observed in the data frame and the number of people with each of those scores in the sample:



We can also create a graph that show this same information. Here, on the x-axis (the bottom of the graph) all of the observed scores of SBP are displayed. On the y-axis (the left side of the graph) the number of people with the corresponding score for SBP are displayed.



There are two modes for SBP — scores of 110 and 114. For each of these SBP scores, 137 people had this score for their SBP. These are the highest two bars on the graph (denoted in blue). The median, the middle score, is 116. The mean, the average score, is just slightly higher at 119. Notice that there are more extreme scores for SBP on the higher end of the distribution (i.e., there are a few people way out in the right tail of the distribution, with SBP scores over 200). Because the mean is highly influenced by large scores, the mean is pulled toward that tail, and is therefore larger than the median. The median is not influenced by outliers.


To summarize, the Mode, Median, and Mean are measures of central tendency that summarize a set of scores (data points) by identifying a central or typical value within that dataset.

  • Mode is the value that appears most frequently in a dataset.
  • Median is the middle value when the data are ordered from smallest to largest.
  • Mean (arithmetic average) is the sum of all the values divided by the number of values.

In the examples thus far, we computed these estimates using the full data frame. But, we can accomplish the same task with a summary of the data. Suppose we have a summarized dataset of SBP from a group of individuals, rather than individual data points. The table below provides summarized data from 15 people. The first column provides 5 levels of SBP observed among the 15 people, and the second column provides the number of people with the corresponding SBP (often referred to as a weight). For example, among these 15 people, 2 of them have a SBP of 110.

Systolic Blood Pressure (mm Hg) Number of People
110 2
120 5
130 3
140 4
150 1

Using this table, let’s calcuate the mode, median and mean.

Calculating Mode

In this dataset, the weighted mode is straightforward: it’s the blood pressure reading with the highest frequency (number of people), which is 120 mm Hg (5 people).

Calculating Median

To find the weighted median, we need to understand that there are a total of 15 observations (110, 110, 120, 120, 120, 120, 120, 130, 130, 130, 140, 140, 140, 140, 150). The median is the middle value, so we need the 8th value when all observations are ordered. Using the counts, we see:

  • The first 2 values are 110.
  • The next 5 values are 120.
  • The 8th value falls into the group with a blood pressure of 130 mm Hg.

Thus, the median SBP, considering the weights, is 130 mm Hg.

Calculating Mean

To calculate the weighted mean:

  1. Multiply each blood pressure value by the number of people with that reading.
  2. Sum these products.
  3. Divide by the total number of people.

Mathematically, this looks like:

To calculate the weighted mean:

\[ \text{Mean} = \frac{(110 \times 2) + (120 \times 5) + (130 \times 3) + (140 \times 4) + (150 \times 1)}{2 + 5 + 3 + 4 + 1} \]

Expanding and simplifying the calculation:

\[ \text{Mean} = \frac{(220) + (600) + (390) + (560) + (150)}{15} \]

Finally, calculating the value:

\[ \text{Mean} = \frac{1920}{15} = 128 \]

Thus, the weighted mean SBP for these 15 individuals is 128 mm Hg. If you calculate the mean in the usual way using all 15 data points, you will find that the same mean is obtained — 128 mm Hg.

This example demonstrates how to calculate the mode, median, and mean from a summarized dataset, incorporating the concept of weighting. This approach is particularly useful in statistics when individual data points are not available, but aggregated data is, enabling accurate computation of central tendency measures. This point will become important throughout the course.

To finish up our review of measures of central tendency, please watch the following video by Crash Course Statistics.

Measures of variability, dispersion, and spread

We began our review of descriptive statistics by exploring measures of central tendency, which provide insight into the central or typical values within a dataset. However, understanding the distribution and variability of the data goes beyond just identifying the central values. It is akin to being at a gathering, where central tendency tells you who is mingling in the center of the party, but you also want to get a sense of how the crowd is dispersed. Are people tightly clustered on the dance floor, or are they scattered throughout the venue? This concept relates to measures of variability, dispersion, or spread, which reveal the extent to which the data points of a variable are spread out across the dataset. These measures provide valuable information about the range and diversity of the data, allowing us to grasp the full picture of the distribution and understand the nuances beyond just the central values.

Range

The range is the difference between the highest and the lowest scores of a variable. Building on the systolic blood pressure (SBP) example — the lowest score for the NHANES participants was 79 and the highest score was 221 — so the range is 221 - 79 = 142. This tells you how wide the spread of SBP readings is among the participants.

Percentile distribution

Earlier we defined the median as the middle point of the distribution of data, when the data were ordered from smallest to largest. In this way, with the data ordered from smallest to largest, we can also define the median as the 50th percentile score.

Imagine we had systolic blood pressure for exactly 100 people, and we lined them up from the lowest SBP (79) to the highest SBP (221). Imagine the median score for these 100 people is 116. Percentiles are like checkpoints along this line that tell you what SBP value a certain percentage of people fall below. For example, if we talk about the 50th percentile (i.e., the median) of SBP for the NHANES sample, it’s like saying “50% of the people have a SBP at or below 116”.

In this way, we might be interested in examining the full distribution of percentile scores — not just the 50th percentile. For example, the 0th percentile is the lowest score — that’s a SBP of 79 in the NHANES sample, and the 100th percentile is the highest score — that’s a SBP of 221 in the NHANES sample. Notice then that the lower bound used to form the range can be defined as the 0th percentile and the upper bound used to define the range can be defined as the 100th percentile.

Other common percentiles of interest are:

  • The 25th percentile, which is one-quarter of the way through the lined up data, is known as the first quartile. A quartile2 divides a rank-ordered variable into four equal parts. The values that divide each part are called the quartiles.This means that 25% of people have a SBP at or below this value. In the NHANES sample, a SBP of 107 is at the 25th percentile.

  • The 75th percentile, which is three-quarters of the way through the lined up data, is known as the third quartile. This means 75% of the people in the line have a SBP at or below this value. In the NHANES sample, a SBP of 128 is at the 75th percentile.

In reality, you usually find percentiles using formulas and data, rather than lining people up. But this line analogy helps to visualize what percentiles represent. We’ll learn how to calculate these percentiles using R later in the course.

In medical studies, percentiles are super useful. For example, doctors might say that if your blood pressure is higher than the 90th percentile for people your age, you might be at a higher risk for certain health issues. So, in the context of this example, percentiles are essentially a way to understand where an individual’s SBP stacks up compared to the rest of the population.

Interquartile Range (IQR)

Related to these percentiles, another form of range is often of interest — that is, a range called the interquartile range or IQR. The traditional range (defined above) is the difference between the 100th percentile and the 0th percentile. The IQR is the difference between the 75th percentile and the 25th percentile. This range shows you where the middle 50% of the scores fall. In the NHANES sample, the IQR for SBP is 128 - 107 = 21.

Mean absolute deviation

So far, we’ve explored two tools to understand the spread of data — the range and the interquartile range. Both of these tools are like detectives that search for clues in the percentiles of the data. But, there’s another way to describe spread. Instead of focusing on percentiles, we can choose a reference point that has a special meaning, like the average or the median, and examine out how much the data strays from this point. You might wonder — what counts as a “normal” amount of straying? Usually, it’s the average or midpoint of these strayings. This investigation leads us to two useful tools: the “mean absolute deviation,” which looks at how far data points are from the average, and the “median absolute deviation,” which looks at how far data points are from the median. Let’s focus on the mean absolute deviation.

The mean absolute deviation helps us figure out how much the scores of a variable tend to stray from the average value. Let’s break down the steps to calculate it:

  1. Find the Mean: First things first, calculate the average score. For example, the average SBP for the NHANES sample is 119.

  2. Absolute Differences from the Mean: Now, for each data point (i.e., person in the sample), calculate how far it is from the mean. Don’t worry about whether it’s above or below; just look at the raw distance (which means you’ll take the absolute value of the differences). So for example, for a person who has a SBP of 140, their absolute difference from the mean in the sample is 21, that is: \(|119 - 140| = 21\). Vertical bars around a number or an equation denote the absolute value. The absolute value of a real number is its distance from zero on the number line, regardless of the direction. Therefore, for example, \(|-5| = 5\) and \(|5| = 5\).

  3. Find the Mean of the Absolute Differences: Finally, find the mean of these absolute differences. This resultant value is the mean absolute deviation.

Formally, the Mean Absolute Deviation (MAD) around the mean, \(\bar{x}\), for a series of observations \(x_1, x_2, x_3, \ldots, x_n\) is given by:

\[MAD = \frac{1}{n}\sum_{i=1}^{n} |x_i - \bar{x}|\]

where:

  • \(n\) is the total number of observations.

  • \(x_i\) represents the \(i^{th}\) observation in the dataset.

  • \(\bar{x}\) is the mean of the observations.

  • \(|x_i - \bar{x}|\) denotes the absolute value of the deviation of each observation from the mean.

Imagine the mean absolute deviation like a bodyguard for the mean, letting you know how much the data points are trying to push and shove around the mean. The higher the mean absolute deviation, the more chaotic (and spread out) the crowd; the lower the mean absolute deviation, the more orderly and close-knit everyone is around the mean.

Here’s a small example of calculating the mean absolute deviation for 10 of the people in our data frame. Can you solve for each of these difference scores: diff_SBP = |SBP - 119|?


SBP diff_SBP
107 12
103 16
97 22
97 22
107 12
121 2
107 12
108 11
113 6
110 9


The first person has a SBP of 107 — the difference between 107 and 119 (the mean in the sample) is 12. The sixth person listed has a systolic blood pressure of 121 — the difference between 121 and 119 is 2. Notice that we record the absolute difference — that is, it doesn’t matter if the person’s score is above or below the mean.

The mean of these 10 differences is about 12.

\((12 + 16 + 22 + 22 + 12 + 2 + 12 + 11 + 6 + 9) / 10 = 12\)

Therefore, the mean absolute deviation for the 10 people listed above is 12 (based on the overall mean in the sample of 119).

If we calculate the mean absolute deviation for the whole NHANES sample using this same technique — we get 13.

Instead of the mean absolute deviation — you will sometimes come across the median absolute deviation. The basic idea behind median absolute deviation is very similar to the idea behind the mean absolute deviation . The difference is that you use the median in each step, instead of the mean. The median absolute deviation tells us the median of the absolute differences between each data point and the overall median.

Variance

Closely related to the mean absolute deviation is the variance. The variance is calculated in a very similar way, but rather than taking the average of the absolute deviations, we take the average of the squared deviations.

The variance, often denoted as \(s^2\), for a series of observations \(x_1, x_2, x_3, \ldots, x_n\) with mean \(\bar{x}\), is given by:

\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2\]

where:

  • \(n\) is the total number of observations.

  • \(x_i\) represents the \(i^{th}\) observation in the dataset.

  • \(\bar{x}\) is the mean of the observations.

  • \((x_i - \bar{x})^2\) denotes the squared deviation of each observation from the mean.

  • The division by \(n-1\) (instead of \(n\)) makes \(s^2\) an unbiased estimator of the variance if the observations come from a normally distributed population (we’ll cover this in more detail later in the semester).

The table below shows the squared absolute deviations for the first 10 people in the data frame. For example, we previously calculated the first person to have a absolute deviation of 12 — squaring that value (i.e., \(12*12\) or \(12^2\)) yields 144.


SBP diff_SBP squared_diff_SBP
107 12 144
103 16 256
97 22 484
97 22 484
107 12 144
121 2 4
107 12 144
108 11 121
113 6 36
110 9 81


If we take the average squared deviation for the 10 people in the table above, and divide by n - 1 (i.e., 9) we get the following.

\((144 + 256 + 484 + 484 + 144 + 4 + 144 + 121 + 36 + 81) / 9 = 211\)

If we calculate the squared deviation for all people in the NHANES data frame, and then average the squared differences for all participants, we get 302 — this is the variance for SBO in our sample.

So we have the calculated variance, but what does this number represent? You might find the variance a bit puzzling because it’s not on the same scale as your original data. This is due to squaring the differences before averaging them. How can we make better sense of this?

If the thought of taking the square root crossed your mind, then bingo! You’re on the right track. Taking the square root is the reverse process of squaring, which brings the numbers back to their original scale. In this case, taking the square root of 302 yields approximately 17. This is a more meaningful number that is easier to relate to the original data points, and this is a perfect segwey to our final measure of spread — the standard deviation.

Standard deviation

The standard deviation (s) is a statistical measure that quantifies the amount of variation or dispersion in a variable. As you just learned at the end of the last section — the standard deviation is the square root of the variance and is useful for understanding how spread out the data points are around the mean (average) of the variable. A small standard deviation means that the data points tend to be close to the mean, while a large standard deviation means that the data points are spread out over a wider range of values.

\[ s = \sqrt{ \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 } \]

Empirical Rule

The concept of standard deviation can be a bit tricky to grasp at first because it’s rooted in the variance, which itself is somewhat abstract. However, there’s a handy guideline that can help you make sense of standard deviation in practice. This rule is effective if the distribution of the variable that you are considering is bell-shaped (i.e., a symmetric curve). It’s known as the Empirical Rule or the 68-95-99 Rule.

The Empirical Rule states that:

  • About 68% of the scores will be within 1 standard deviation (sd) from the mean. So if you look at the range defined by (mean - 1 standard deviation) to (mean + 1 standard deviation), approximately 68% of the scores should be within this range.

  • Approximately 95% of the scores will be within 2 standard deviations from the mean. This is the range from (mean - 2 standard deviations) to (mean + 2 standard deviations).

  • Almost all (around 99.7%) of the scores will be within 3 standard deviations from the mean. This is the range from (mean - 3 standard deviations) to (mean + 3 standard deviations).

The cartoon below depicts this information. The shaded blue areas in the graphs shows that in the first graph, 68% of the distribution is within 1 standard deviation (sd) of the mean. In the second graph, 95% of the distribution is within 2 standard deviations of the mean. In the third graph, 99.7% (nearly all!) of the distribution is within 3 standard deviations of the mean.

Graphs depicting the Empirical Rule (68-95-99)

Recall that the Empirical Rule can be used if the distribution of scores is symmetric and bell-shaped. In a symmetric bell-shaped curve, the highest point, or peak, of the curve is located at the center, and the curve tapers off gradually in both directions. The left and right sides of the curve are mirror images of each other, reflecting the symmetry of the data. This means that the mean, median, and mode of the distribution coincide at the center of the curve.

Can we use the Empirical Rule for our blood pressure example? Here’s the distribution of the SBP scores that we observed earlier.


Systolic blood pressure loosely follows a bell-shaped curve — although we do see a long right-hand tail. This is largely because as people age, there is greater variability in SBP scores, and much higher SBP scores are more likely to be observed. To explore this a bit, let’s take a look at the distribution of SBP across age groups. I’ll color the parts of the distribution in accordance with the SBP classifications presented in the American Heart Association Chart below. Since we’re only considering systolic, and not diastolic, blood pressure now — we’ll ignore the contribution of diastolic blood pressure to the classifications.


Healthy and Unhealthy Blood Pressure Ranges from the American Heart Association


Our chart shows that the mean and median of SBP becomes larger as people age. Moreover, the distribution of SBP exhibits a noticeable increase in variability as people age, and the right tail of the distribution becomes more drawn out as we move to older age groups. This changing central tendency and variability is a result of physiological and lifestyle changes that are more pronounced in older age groups.

A Substantive Take

As individuals age, the diversity in health profiles escalates due to factors such as varied disease processes, differing responses to medications, lifestyle alterations, and the accumulation of environmental exposures. These differences contribute to a broader distribution of SBP values among older individuals. Additionally, conditions like hypertension, which yield higher SBP values, are more common among older populations. Hence, in older populations, the SBP distribution is more shifted to the higher pressures, is wider, and has a longer right-hand tail — making readings classified as hypertensive more likely to be observed in these groups.

While it’s true that older adults may face higher risks of elevated SBP, it’s important to notice that there are many individuals in each of the older groups who maintain healthy blood pressure levels (the green parts of the distribution). Maintaining a healthy SBP is not simply a factor of aging, but often the result of deliberate lifestyle choices and effective medical management. Regular physical activity, a balanced diet low in sodium and high in fruits and vegetables, maintaining a healthy weight, avoiding too much stress, reducing alcohol intake, and avoiding tobacco are all proactive measures individuals can take to support cardiovascular health. Furthermore, regular health check-ups and timely management of other health conditions such as diabetes can greatly assist in keeping blood pressure within healthy limits. The advances in medical science also provide us with effective blood pressure-lowering medications when necessary. In other words, aging does not inevitably equate to high SBP. With appropriate lifestyle modifications, regular monitoring, and proactive healthcare, it’s entirely possible for older adults to sustain a healthy SBP well into advanced age.

So — based on these graphs — we can see that the distribution of SBP for some of the age groups is pretty darn close to a bell-shaped curve. For example, if we limit our data frame to only participants between the ages of 16 and 25 (the first age group), then the distribution of SBP appears much more bell-shaped and symmetric. Take a look at the histogram below of individuals age 16 to 25 — with a perfect bell-shaped curve overlayed. In this subsample of participants, the mean SBP = 112, the median SBP = 112, and the standard deviation of SBP = 12.

Let’s apply the Empirical Rule to the NHANES participants aged 16 to 25.

The Empirical Rule states that:

  • About 68% of the scores will be within 1 standard deviation of the mean. Therefore, if you look at the range defined by (mean - 1 standard deviation) to (mean + 1 standard deviation), approximately 68% of the scores should be within this range.

In our subsample of NHANES participants aged 16 to 25, that means that about 68% of young adults will have a systolic blood pressure that is between 100 and 124, that is, \(112 \pm 12\).

  • Approximately 95% of the scores will be within 2 standard deviations of the mean. This is the range from (mean - 2 standard deviations) to (mean + 2 standard deviations).

In our subsample, that means that about 95% of young adults will have a systolic blood pressure that is between 88 and 136, that is, \(112 \pm (2 \times 12)\).

  • Almost all (around 99.7%) of the scores will be within 3 standard deviations of the mean. This is the range from (mean - 3 standard deviations) to (mean + 3 standard deviations).

In our subsample, that means that about 99.7% of all young adults will have a systolic blood pressure that is between 76 and 148, that is, \(112 \pm (3 \times 12)\).

Examples of normal, and not so normal, distributions

Keep in mind that this rule is a rough guide and is based on the assumption that the data is distributed in a bell-shaped curve, which is symmetric. As we’ll study later in the course — this is referred to as a normal distribution. In a normal distribution, the mean, median and mode are all the same. For example — here is an example of a normal distribution.

For some variables, you will find that their distribution is skewed to the left or skewed to the right, as demonstrated in the graphs below.

When a distribution is skewed to the left, also called left-skew or negative-skew, it means that the tail of the distribution extends towards the left side of the graph, while the bulk of the data is concentrated towards the right side. In a left-skewed distribution, the presence of extreme low values in the left tail pulls the mean towards that direction. Consequently, the mean value will be less than the median.

When a distribution is skewed to the right, also known as right-skew or positive-skew, it indicates that the tail of the distribution extends towards the right side of the graph, while the majority of the data is accumulated towards the left side. In a right-skewed distribution, the existence of extremely high values in the right tail drags the mean towards that direction. As a result, the mean value will be greater than the median.

The standard deviation can still be a meaningful statistic in skewed distributions. It accurately represents the spread of the data points from the mean, just as it is supposed to do. But interpreting the standard deviation (and the mean) can be challenging in skewed distributions, because these measures are being influenced by the skewness. In these cases it is likely more insightful to use additional or alternative measures, like the median (a measure of central tendency that is less sensitive to outliers and skewness) and the interquartile range (a measure of spread that is also less sensitive to outliers and skewness), rather than (or in addition to) the mean and the standard deviation.

So, to summarize, while standard deviation is still a meaningful statistic in skewed distributions, its interpretation can be more complex and it can be helpful to consider other statistics as well.

Measures of variability, also known as measures of spread or dispersion, are valuable statistical tools used to describe the degree to which the scores of a variable are spread out or clustered together. This is important for descriptive statistics — and you’ll find later in the course, that variability also plays a pivotal role in inferential statistics.

Please watch the following two videos below from Crash Course Statistics on Measures of Spread and Distributions.

Standardized scores

Standardized scores, or z-scores, are a useful concept in statistics. They provide a way to understand an individual observation in the context of a distribution of data. A z-score tells us how many standard deviations an observation is from the mean of a distribution, providing a measure of relative location.

To calculate a z-score, we subtract the mean of the variable from an individual data point (which gives us the deviation of the data point from the mean), and then divide that by the standard deviation of the variable. The formula for calculating a z-score is:

\[z_{i} = \frac{x_{i} - \bar{x}}{s_{x}}\]

Where:

  • \(z_{i}\) is the calculated z-score for each case (i.e., individual in the study).

  • \(x_{i}\) is the score of the variable of interest (e.g., SBP) for each case.

  • \(\bar{x}\) is the mean of the scores in the sample (mean SBP).

  • \(s_{x}\) is the standard deviation of the scores in the sample (sd of SBP).

Let’s use SBP as an example. In the full NHANES sample, the mean SBP is 119 mm Hg and the standard deviation is 17 mm Hg.

If someone has a SBP of 102, then their z-score is: \(z = \frac{102 - 119}{17} = -1\). That is, their SBP is 1 standard deviation below the mean.

Let’s try another. If someone has a SBP of 153, then their z-score is: \(z = \frac{153 - 119}{17} = 2\). That is, their SBP is 2 standard deviations above the mean.

One last example. If someone has a SBP of 119, then their z-score is: \(z = \frac{119 - 119}{17} = 0\). That is, their SBP is at the mean.

z-scores are particularly useful because they are unitless and thus allow for direct comparison between different types of data. They also form the basis of many statistical tests and procedures (which we’ll learn about later in the term). However, it’s important to note that z-scores are most meaningful when the data follows a normal distribution. If the data is not normally distributed, the z-score may not provide a accurate picture of an observation’s relative location.

Please watch the video below from Crash Course Statistics on z-scores and percentiles that further explains these concepts.

Wrap up

In this module, we’ve laid the groundwork for understanding how to describe and summarize data, crucial for any data science endeavor. Here are the key insights:

  1. Purpose of Statistics: At its core, statistics transform raw data into meaningful information, enabling us to uncover patterns and insights.

  2. Mathematical Thinking and Measurement: We’ve emphasized the importance of developing a strong number sense and thinking mathematically. Additionally, we explored how abstract concepts are operationalized into measurable variables, ensuring accuracy in data collection and interpretation.

  3. Types of Variables: We distinguished between nominal, ordinal, and continuous variables, each serving different purposes in data analysis. Understanding these differences is key to choosing the right descriptive techniques.

  4. Descriptive Statistics:

    • Central Tendency: Measures like the mode, median, and mean help pinpoint the central value in a dataset.

    • Dispersion: Metrics such as range, interquartile range (IQR), mean absolute deviation, variance, and standard deviation reveal the spread and variability in data.

  5. Empirical Rule: This rule aids in interpreting data distributions, especially those resembling a bell curve, and sets the stage for more advanced statistical concepts.

  6. Standardized Scores (z-scores): z-scores provide a standardized way to understand the relative position of data points within a distribution, essential for comparing different datasets.

Throughout this module, we’ve used real-world data from the National Health and Nutrition Examination Study (NHANES) to illustrate these concepts, ensuring practical understanding. By mastering these foundational skills, we’ve set the stage for delving into data science with R. In upcoming modules, we’ll build on this knowledge, leveraging R to perform sophisticated data analyses, visualize complex patterns, and ultimately make data-driven decisions. We’ll start in the most fun way possible — by creating beautiful graphics using the ggplot2 package for R in Module 3!

Credits

  • The Measurement section of this module drew from the excellent commentary on this subject by Dr. Danielle Navarro in her book entitled Learning Statistics with R.

Footnotes

  1. The 5,000 individuals from NHANES that are considered in this module are resampled from the full NHANES study to mimic a simple random sample of the US population.↩︎

  2. A quartile is a type of quantile. A quantile is a statistical term that refers to dividing a probability distribution into continuous intervals with equal probabilities, or dividing a variable into several parts of equal volume. In essence, if you have a variable and you want to split it into groups, each group being a certain percentage of the total, each group represents a quantile.For example, if you were to divide a variable into two equal parts, the point that separates the groups is the median. The data points less than the median make up the first half, and the data points more than the median make up the second half. Here are few types of quantiles you might come across: Quartiles: These split the data into four equal parts, so each part contains 25% of the data. Deciles: These split the data into ten equal parts, so each part contains 10% of the data. Percentiles: These split the data into one-hundred equal parts, so each part contains 1% of the data.↩︎