Modeling Non-Linear Relationships

Module 15

Learning objectives

Explain the importance of capturing the true relationship between variables in statistical modeling
Discuss the limitations of linear regression when dealing with non-linear relationships
Describe how non-linear transformations and polynomial regression extend the linear regression framework
Apply logarithmic transformations to real-world data using R
Interpret the effects of log-transformed variables in a regression model
Apply polynomial regression to real-world data using R
Interpret the coefficients of polynomial regression models

Introduction

In the realm of statistical modeling, capturing the true relationship between variables is paramount for accurate predictions and insightful interpretations. While linear regression is a powerful and widely used tool, its straightforward approach can sometimes fall short when dealing with complex, non-linear relationships inherent in real-world data. A key assumption of the linear regression model is that the relationship between the predictor and outcome variables is linear. This assumption, if violated, can lead to biased estimates, poor predictions, and misleading conclusions. To bridge this gap, it is essential to extend our toolkit beyond simple linear models to include techniques that can handle non-linearity while retaining the familiarity and interpretability of linear regression.

Non-linear transformations, such as logarithmic transformations, offer a practical solution for linearizing relationships that are otherwise exponential or multiplicative in nature. For instance, applying a logarithmic transformation to a right-skewed variable like income or population size can result in a more symmetrical distribution. This makes the data more suitable for linear regression analysis. Such transformations help meet the linearity assumption, stabilize variance, and normalize residuals, topics we will explore in Module 17. Furthermore, coefficients in a logarithmic regression model can be interpreted as percentage changes, providing a more intuitive understanding of the relationships between variables.

Polynomial regression, on the other hand, extends the linear model by adding polynomial terms of the predictor variables ($X$). By including quadratic ($X^2$), cubic ($X^3$), and higher-order terms, polynomial regression can fit more complex curves to the data that simple linear regression cannot capture. This method is particularly useful in situations where the effect of a predictor on the outcome is not constant but changes in a non-linear fashion. For example, the relationship between age and healthcare costs is often best described by a quadratic function, where costs increase at both ends of the age spectrum. Polynomial regression allows us to model these nuances more accurately, providing a better fit and more precise predictions.

Together, non-linear transformations and polynomial regression significantly expand the capabilities of applied data analysts to use the lm() machinery to fit curvilinear models. These techniques enable more accurate modeling of complex relationships, improving predictive performance and yielding deeper insights into the data. By understanding and applying these methods, analysts can better capture the intricacies of real-world phenomena, making more informed and reliable decisions based on their data analyses. The ability to handle non-linearity effectively not only enhances the robustness of statistical models but also ensures that the assumptions underlying these models are adequately addressed, leading to more credible and actionable results.

This module explores how non-linear transformations of our variables (namely through logarithmic transformation), as well as polynomial regression, can be employed to model non-linear relationships effectively within a linear framework. By transforming variables or adding polynomial terms, we can flexibly adapt linear regression to fit a broader range of data patterns. These techniques not only enhance the accuracy of our models but also allow us to uncover underlying trends and interactions that may not be immediately apparent in their raw form.

Understanding and applying these methods is crucial for several reasons:

Versatility: Non-linear transformations and polynomial terms expand the versatility of linear regression, enabling it to capture a wider variety of relationships between variables.
Interpretability: Maintaining a linear framework allows for easier interpretation and communication of model results, preserving the straightforwardness of linear models while enhancing their capacity.
Prediction Accuracy: Fitting non-linear patterns more accurately can lead to better predictions, which is particularly important in fields where precise forecasting is critical.
Insight into Data: These techniques can reveal hidden structures and insights in the data, providing a deeper understanding of the phenomena under study.

Throughout this Module, we will delve into practical applications and examples, demonstrating how to implement these methods using R. By the end, you will be equipped with a robust framework for tackling non-linear relationships, enriching your analytical skill set and enhancing the quality of your research and data analysis.

Non-linear transformations

As we journey further into the world of data analysis, we occasionally encounter relationships that are not straightforward. Many times, the data we collect doesn’t fit neatly into a linear pattern, making it challenging to apply basic linear regression techniques. In some cases, by applying a non-linear transformation to one or more of the variables in the regression model (i.e., the outcome, the predictor, or both), we can model what appears to be a non-linear relationship at the level of the raw variable(s), as a linear relationship at the level of the transformed variable(s). Logarithmic transformations are the most commonly employed non-linear transformations for this case.

A primer on logarithms

Linear transformations
To understand non-linear transformations, let’s first consider what it means to transform data linearly. A linear transformation preserves the linear relationships between data points. For example, converting measurements from inches to centimeters is a linear transformation because the relationship between the units is constant: every inch corresponds to 2.54 centimeters. In R, you can perform this transformation using the code mutate(centimeters = inches * 2.54).

Let’s consider the measurements of a series of objects in inches — these are displayed in the first column of the table below. The second column presents the corresponding measurement in centimeters. We can graph the correspondence between inches and centimeters for these measurements. The graph below the table is a visualization of the linear transformation from inches to centimeters. Notice how the data points align along a straight line, reflecting the consistent relationship between the two units.

inches	centimeters
1	2.54
2	5.08
3	7.62
4	10.16
5	12.70
6	15.24
7	17.78
8	20.32
9	22.86
10	25.40

The conversion from inches to centimeters is a linear transformation. We know this because the original and transformed values can be joined by a straight line. Linear transformations preserve relative spacing. That is, values that are evenly spaced before transformation remain evenly spaced after transformation. For example, values that are spaced twice as far apart before transformation (i.e., going from 2 to 4 on the x-axis) remain twice as far apart after transformation (going from 5.08 to 10.16 on the y-axis).

Non-linear transformations
Non-linear transformations are also possible. With a non-linear transformation, the initial and transformed values do not fall on a straight line, and non-linear transformations do not preserve spacing. A logarithm is an example of a non-linear transformation, and it is one of the most common transformations used in statistics.

The logarithm is the power (i.e., the exponent) to which a base must be raised to produce a given number:

\[log_b(x) = c\]

In other words, if we have a logarithm with base “b” and result “c,” it means that “b” raised to the power of “c” gives us the number we are interested in (“x”). Logarithms help simplify calculations involving large or small numbers and have widespread applications in various fields. Commonly encountered bases for logarithms include the natural logarithm with base “e,” the common logarithm with base 10, and the binary logarithm with base 2, each with its own unique properties and applications.

The binary logarithm

Let’s explore the binary logarithm: $log_2(x) = c$

Consider the equation: $log_2(8) = 3$

This equation asserts that “2 raised to the power of 3 gives 8.”
In other words, in converting x to the base-2 log of x, $log_2(x) = c$ would be the value c that satisfies the equation $2^c = x$, in this example, that’s $2^3 = 8$.

Let’s create a few scores for a variable x and then compute the base-2 logarithm for each value of log2x. In R this is accomplished with the following code: log2x = log2(x), where the function log2() applies the binary logarithm. I’ll also plot these against one another as we did for inches and centimeters.

x	log2x
2	1
4	2
8	3
16	4
32	5

Notice that the initial values of x and the transformed values (log2x) do not fall on a straight line, and the spacing is not preserved — each 1 unit increase in a base-2 logarithm (e.g., going from 3 to 4) represents a doubling of x (e.g., going from 8 to 16).

The natural logarithm
The most often used logarithm in statistics is the natural logarithm. The natural log transformation, often denoted as ln, is a specific type of logarithm with a base of the mathematical constant “e” (approximately equal to 2.71828).

The natural logarithm is defined as: $log_e(x) = c$

In converting x to the natural log of x (lnx), lnx would be the value c that satisfies the equation $e^c = x$.

Let’s create a few scores for a variable x and then compute the natural logarithm for each value of lnx. In R this is accomplished with the following code: lnx = log(x), where the function log() applies the natural logarithm. I’ll also plot these against one another as we did for the base-2 logarithm.

x	lnx
10	2.302585
20	2.995732
30	3.401197
40	3.688879
50	3.912023
60	4.094345
70	4.248495
80	4.382027
90	4.499810
100	4.605170

The plot for the natural logarithm looks similar to the plot for the base-2 logarithm, illustrating that both are non-linear transformations. In fact, logarithms in different bases are related by a constant factor, which means a logarithm in one base can be expressed as a linear transformation of a logarithm in another base. Due to this relationship, analysts can choose whichever base is most appropriate for their specific application.

Important

It’s important to note that taking the logarithm of 0 or a negative number is undefined. This is because the logarithmic function is only valid for positive numbers—logarithms tell us what power we need to raise a base (like 10 or e) to produce a number, and there is no power that will yield a non-positive result.

With that primer in our pocket — let’s see how application of a natural log transformation can assist us in modeling a non-linear relationship using our lm() machinery.

Applied example

In a perfect scenario, the relationship between our independent variable (predictor) and dependent variable (outcome) would form a straight line when plotted on a graph. This makes it easy to apply a linear regression model, which assumes a linear relationship. However, real-world data often tells a different story. Relationships can be curved or follow exponential patterns, making the application of linear regression less effective.

The Power of Transformations
This is where non-linear transformations come into play. By applying mathematical transformations to our data, we can convert these non-linear relationships into linear ones. One of the most commonly used transformations is the logarithmic transformation.

What is a Logarithmic Transformation?
A logarithmic transformation involves applying the logarithm function to one or more variables in our model. This transformation can help straighten out a curved relationship, making it easier to analyze with linear regression methods. Here’s why logarithmic transformations are so useful:

Linearizing Relationships: When the relationship between variables is exponential or multiplicative, a logarithmic transformation can help convert it into a linear form.
Managing Skewed Data: Logarithmic transformations can reduce skewness in variables, making them more normally distributed and easier to work with.
Simplifying Interpretation: In a transformed model, the coefficients can represent more intuitive concepts, such as percentage changes instead of unit changes — which is important if a unit change in the raw scale is not constant across the full range of our variables.

As mentioned above, logarithmic transformations can reduce skewness in data, making it more normally distributed and easier to work with. Certain types of variables are particularly prone to skewness and often benefit from non-linear transformations. Income-related data is a prime example of this. For instance, the distribution of individual incomes or the gross domestic product (GDP) of different countries typically exhibits a right-skewed pattern. In a right-skewed distribution, a small number of extremely high values can distort the overall picture, making it difficult to analyze the data accurately.

By applying a logarithmic transformation to these variables, we can normalize their distributions. This transformation compresses the range of the data, pulling in high values and spreading out lower values, resulting in a more symmetric, bell-shaped distribution. This normalization can make the data more suitable for analysis, enhancing the performance of statistical techniques that assume normality.

Let’s take a look at an example. Here, we’ll use data compiled by Our World in Data on gross domestic product (GDP) and life expectancy of residents in 166 countries.

To begin, let’s import and wrangle the data a bit:

gdp_lifeexp <- read_csv(here("data", "gdp_lifeexp.csv")) |> 
  janitor::clean_names() |> 
  filter(year == 2021)  |> 
  rename(gdp = gdp_per_capita,
         le = period_life_expectancy_at_birth_sex_all_age_0,
         pop = population_historical_estimates
         ) |> 
  select(entity, code, gdp, le, pop) |> 
  drop_na()

gdp_lifeexp

Now, we can create a scatter plot of the raw data.

gdp_lifeexp |> 
  ggplot(aes(x = gdp, y = le)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Life expectancy vs. GDP per capita, 2021",
       subtitle = "GDP per capita is adjusted for inflation and differences in the cost of living between countries.",
       x = "GDP per capita",
       y = "Life expectancy at birth")

Let’s imagine that we want to fit a model to predict life expectancy from GDP per capita. We might consider a linear model:

gdp_lifeexp |> 
  ggplot(aes(x = gdp, y = le)) +
  geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x", color = "#F26B38") +
  theme_minimal() +
  labs(title = "Life expectancy vs. GDP per capita, 2021",
       subtitle = "GDP per capita is adjusted for inflation and differences in the cost of living between countries.",
       x = "GDP per capita",
       y = "Life expectancy at birth")

The current best-fit line does not traverse through the middle of the data points in many parts of the scatter plot. This indicates that the line does not accurately capture the central trend of the data across different ranges of GDP per capita.

Given our understanding of logarithmic transformations, applying a log transformation to GDP per capita might improve the model fit. Before doing so, let’s examine a density plot of GDP in its raw form, as well as a plot of the log-transformed version of GDP:

gdp_lifeexp <-
  gdp_lifeexp |> 
  mutate(gdp_ln = log(gdp))

p1 <- 
  gdp_lifeexp |> 
  ggplot(aes(x = gdp)) +
  geom_density(fill = "azure3") +
  theme_minimal() +
  labs(title = "GDP in it's raw metric",
       x = "GDP per capita")

p2 <- 
  gdp_lifeexp |> 
  ggplot(aes(x = gdp_ln)) +
  geom_density(fill = "azure4") +
  theme_minimal() +
  labs(title = "The natural logarithm of GDP",
       x = "ln(GDP per capita)")

library(patchwork)
p1 + p2 + plot_annotation(title = "GDP per capita in it's raw and transformed metric")

In its raw metric, GDP per capita is severely skewed to the right. Although there’s no requirement that variables in a linear regression model be normally distributed, variables that are severely skewed often present modeling challenges and end up violating assumptions of the normality of the residuals (a topic we will learn about in Module 17). Thus, the observation that taking the natural log of GDP helps to move the distribution toward a normal distribution is encouraging.

Now, let’s determine if the transformed version of GDP provides a better approximation of a linear relationship with life expectancy.

gdp_lifeexp |> 
  ggplot(aes(x = gdp_ln, y = le)) +
  geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x", color = "#F26B38") +
  theme_minimal() +
  labs(title = "Life expectancy vs. ln(GDP per capita), 2021",
       subtitle = "GDP per capita is adjusted for inflation and differences in the cost of living between countries.",
       x = "ln(GDP per capita)",
       y = "Life expectancy at birth")

Indeed, the transformed version of GDP — ln(GDP) — allow it’s relationship to life expectancy to display a linear pattern. This means that we can use our linear regression machinery (i.e., utilize the lm() function) to estimate the relationship between the two variables.

Let’s fit the model:

lm1 <- lm(le ~ gdp_ln, data = gdp_lifeexp)
lm1 |> tidy() |> select(term, estimate, std.error)

Interpretation of the Model
The intercept (21.10) provides the expected life expectancy when the log of GDP per capita is zero. The exp() function applies the antilog (or inverse log) — returning the transformed score back to it’s original metric.

exp(0)

[1] 1

This calculation shows that a log of GDP per capita of 0 corresponds to a GDP per capita of 1. Therefore, according to our model, a country with a GDP of 1 would have an expected life expectancy of 21.1 years. However, there are no countries in our data frame with a GDP of 1; the lowest GDP per capita in our data is 607. Thus, interpreting this intercept in practical terms is not useful.

The slope (5.44) indicates that for every one-unit increase in the natural logarithm of GDP per capita, the life expectancy at birth increases by approximately 5.44 years. This indicates a strong positive relationship between the natural log of GDP per capita and life expectancy.

But, what if we want to interpret this effect in terms of change in GDP per capita (in it’s raw metric) rather than ln(GDP per capita)? That would serve as a more intuitive interpretation as a one-unit increase in ln(GDP) is difficult to imagine. We can accomplish this task, but it requires an extra step.

To achieve this, the following function calculates the expected change in life expectancy for a x_chg% increase in GDP:

ln.x <- function(slope, x_chg) {
  new_slope <- slope * (log(1 + (x_chg/100))) 
  return(new_slope)
}

We use this function by plugging in the slope from our fitted model (5.441468) and the desired percent change in GDP per capita that we want to describe. To begin, we’ll solve for the expected change in life expectancy for a one-percent (x_chg = 1) increase in GDP per capita.

ln.x(5.441468, 1)

[1] 0.05414441

Now, we can interpret the effect as follows: A one-percent increase in GDP per capita is associated with a .05 unit increase in life expectancy. Of course, a 1 percent increase in GDP per capita is very small — it is probably more informative to consider a larger increase — for example a 100 percent increase in GDP per capita. In this instance, we’d ask the question: “How much would we expect life expectancy to differ between two countries where one has a GDP per capita that is twice the size of the other?”). We can accomplish this by applying the function as follows (notice that here x_chg = 100).

ln.x(5.441468, 100)

[1] 3.771738

A 100% increase in GDP is associated with about a 3.8 year increase in life expectancy.

Interpretation in context
We can interpret the slope that relates GDP to life expectancy, when GDP is natural log-transformed, as the expected difference in life expectancy for a given percentage increase in GDP (rather than a unit change). This is because a natural logarithmic transformation allows us to express the effect in terms of percentage changes in GDP, effectively capturing the non-linear nature of the relationship between raw GDP and life expectancy.

For example, a 100% increase in GDP (i.e., doubling the GDP) results in a constant increase in life expectancy, regardless of the initial GDP level. To see this, let’s compare two countries with different GDP levels: the Democratic Republic of Congo (where GDP per capita = $817 in 2021) and the United States (where GDP per capita = $57,523 in 2021).

Consider a 100% increase in GDP (doubling the GDP) for each country:

Democratic Republic of Congo: Original GDP = 817, Doubled GDP = 1634. The natural logarithm changes from ln(817) ≈ 6.706 to ln(1634) ≈ 7.399. This is an increase of 0.69 units in ln(GDP) — that is, 7.399 - 6.706 = 0.69.
United States: Original GDP = 57,523, Doubled GDP = 115,046. The natural logarithm changes from ln(57,523) ≈ 10.960 to ln(115,046) ≈ 11.653. This is also an increase of 0.69 units in ln(GDP) — that is, 11.653 - 10.960 = 0.69.

In both cases, a 100% increase in GDP corresponds to an increase in life expectancy by $0.69 \times 5.44 = 3.75$ (i.e. 3.75 years), where 5.44 is the regression estimate associated with gdp_ln in the fitted lm() model. This highlights that the impact of a percentage change in GDP on life expectancy is consistent regardless of the absolute level of GDP.

In contrast, a constant unit increase in GDP per capita would have a very different meaning for countries at different economic levels. For instance, a 100 unit increase in GDP for a country with a GDP of 817 USD (i.e., going from 817 to 917) would be a very substantial increase in GDP, but a 100 unit increase for a country with a GDP of 57,523 USD (i.e., going from 57,523 to 57,623) would be a trivial increase in GDP. This explains why the raw GDP data does not fit well with a linear model, while the log-transformed data does.

Take a look at the plots below, which overlay the best-fit line implied by our regression model of life expectancy on the natural log of GDP per capita. On the left plot, you can see that the best-fit line relating GDP in its raw metric to life expectancy is curvilinear. I’ve marked the GDP of the two countries we considered earlier — the Democratic Republic of Congo (in red) and the USA (in blue). Note that as the best-fit line extends from the red line, the slope relating GDP to life expectancy is very steep, but as the best-fit line extends from the blue line, the slope relating GDP to life expectancy is shallower. This captures the observation that a certain unit increase in GDP among poorer countries will have a larger impact on life expectancy than the same unit increase for richer countries.

On the right plot, you can see that the best-fit line relating ln(GDP per capita) to life expectancy is linear, and the effect of ln(GDP per capita) on life expectancy is constant across the entire range of ln(GDP per capita). Therefore, using the technique of applying a natural log transformation to one or more of the variables that will be input into a regression model, we have the opportunity to fit a linear regression model that meets the assumptions of linearity at the level of the transformed variable(s). Then, we have the tools to interpret the effect(s) in terms of percent change for the raw variables. That’s pretty slick!

Polynomial regression

The basic regression model assumes that the relationship between $X$ and $Y$ is linear. However, in some cases, the effect of a given predictor may change as the predictor itself increases. In other words, the “effect of $X$ differs at different levels of $X$”.

A primer on polynomials

Consider the relationship between age and annual health care expenditures. As humans, we tend to cost a lot when we’re first born and when we’re in old age. The relationship between age and median medical expenditures can be represented by a quadratic function that takes on the shape of a parabola. As demonstrated by the graph, there is a relationship between age (the variable on the x-axis) and medical costs in dollars (the variable on the y-axis), but it is not linear. A close examination of the graph indicates that the effect of each additional year for the age variable on medical expenses changes depending on where on the x-axis we focus. During childhood, the effect of age on medical expenses is negative (each additional year of age is associated with lower costs). During old age, the effect of age on medical expenses is positive (each additional year of age is associated with higher costs).

Polynomial regression can capture relationships between $X$ and $Y$ that are not merely linear, by incorporating higher order terms of $X$, such as squared ($X^2$), cubed ($X^3$), and so forth. This accounts for the shifting influence of the predictor (i.e., $X$) on the outcome at various levels of the predictor.

For instance, when a squared term is added (i.e., $y_i$ regressed on $x_i$ and $x_i^2$, termed a quadratic function), the model can accommodate a single curve bend. Each subsequent higher order term allows for an additional bend; thus, a model with $y_i$ regressed on $x_i$, $x_i^2$, and $x_i^3$ (a cubic model) can accommodate two bends to the relationship relating the predictor to the outcome.

However, it’s worth noting that in the social and behavioral sciences, polynomial terms beyond the second order ($x_i^2$) are typically seldom observed. This is largely due to the difficulty in interpreting higher order terms and the risk of overfitting the model to the data.

Applied example

Consider a straightforward scenario where we aim to understand the impact of practice duration on a study participants score on a visual discrimination task. Participants undergo random allocation into varying levels of practice. Subsequently, a test on visual discrimination is conducted, and each participant’s correct response count is recorded. Among the forty participants, each was randomly designated a practice duration: 0, 2, 4, 6, 8, 10, 12, or 14 minutes.

The experiment revolves around two variables:

practice: the experimentally assigned duration of practice
score: the total correct responses during the test.

Let’s import the data and take a look at the scores:

cogtest <- read_csv(here("data", "cogtest.csv")) |> select(-subject)
cogtest

The skim() output shows that mean for the performance score is 16.38, and the standard deviation is 7.67.

cogtest |> skim()

Data summary
Name	cogtest
Number of rows	40
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
practice	0	1	7.00	4.64	0.00	3.50	7.00	10.50	14.00	▇▃▇▃▇
score	0	1	16.38	7.67	1.06	11.67	19.04	22.23	25.54	▂▂▂▃▇

Let’s plot the relationship between minutes spent practicing and performance on the task, and overlay a best fit line using the argument formula = y ~ x on the call to geom_smooth().

cogtest |> 
  ggplot(mapping = aes(x = practice, y = score)) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, color = "orange") +
  geom_point() +
  scale_x_continuous(limits = c(0, 14), 
                     breaks = seq(0, 14, by = 2)) + 
  theme_minimal() +
  labs(title = "Does more practice lead to a better score?",
       subtitle = "Overlay best fit line",
       x = "Minutes spent practicing", 
       y = "Score")

From the graph, we see a clear increasing trend. However, the increasing trend appears to slow down at higher levels of practice. To better capture this pattern, we can change the linear best-fit line to a quadratic curve. This is accomplished by changing formula = y ~ x to formula = y ~ poly(x, 2), where the second argument to poly() specifies the polynomial order (2, or squared, denotes a quadratic relationship).

cogtest |> 
  ggplot(mapping = aes(x = practice, y = score)) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ poly(x, 2), color = "#EC2049") +
  geom_point() +
  scale_x_continuous(limits = c(0, 14), 
                     breaks = seq(0, 14, by = 2)) + 
  theme_minimal() +
  labs(title = "Does more practice lead to a better score?",
       subtitle = "Overlay quadratic curve",
       x = "Minutes spent practicing", 
       y = "Score")

The quadratic function appears to offer a substantially better fit to the data, with the curved line accurately intersecting the data points at each instance of practice.

Let’s now explore how polynomial regression equips us with the capability to capture this curvilinear relationship between practice duration and performance outcomes.

The gist of a polynomial regression model
In building a polynomial regression model, our standard approach involves testing an array of models, each progressively incorporating an additional polynomial term (e.g., $x_i$, $x_i^2$, $x_i^3$, etc.). We strive to adopt the most parsimonious model, so we begin with the simplest model and continue testing higher-order models until we find that the latest added term no longer noticeably enhances the model’s fit to the data. At this point, we opt for the previous model, where the highest-order term significantly influenced the model’s ability to capture the curvature in the relationship. Crucially, all lower-order terms are retained in the model if the highest-order term is deemed necessary, regardless of whether the lower terms have a regression parameter estimate close to zero.

First, we need to create the polynomial terms, we’ll consider up to 2 bends.

cogtest <- 
  cogtest |> 
  mutate(practice2 = practice^2, # squared term for one bend
         practice3 = practice^3) # cubed term for two bends

cogtest

Now, we can fit the polynomial models.

Test a series of models

Linear model

Let’s start with the linear model.

lm_lin <- lm(score ~ practice, data = cogtest)
lm_lin |> tidy(conf.int = TRUE, conf.level = .95) |> select(term, estimate, conf.low, conf.high)

lm_lin |> glance() |> select(r.squared, sigma)

cogtest |> 
  ggplot(mapping = aes(x = practice, y = score)) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x, color = "orange") +
  geom_point() +
  scale_x_continuous(limits = c(0, 14), 
                     breaks = seq(0, 14, by = 2)) + 
  theme_minimal() +
  labs(title = "Does more practice lead to a better score?",
       subtitle = "Overlay best fit straight line",
       x = "Minutes spent practicing", 
       y = "Score")

In this simple linear regression, the intercept is the predicted score for someone who receives no practice. The slope is the predicted change in the score for a one unit increase in minutes spent practicing. It is meaningfully different from zero, indicating a positive linear trend — each additional minute spent practicing is associated with a 1.5 unit increase in performance. However, the graph suggests that the relationship between practice and score is not strictly linear, or in other words, there is not a constant rate of increase. At lower levels of practice, each additional minute matters quite a lot, but at high levels of practice, the increment to performance for each additional minute is smaller.

The $R^2$ is .86 – indicating that about 86% of the variability in performance is predicted by this linear trend. Sigma, which represents the standard deviation of the residuals, is 2.95. The standard deviation of performance is 7.67 – indicating a substantial reduction in variability once the linear trend is accounted for.

Quadratic model

Next, let’s fit the quadratic model.

lm_quad <- lm(score ~ practice + practice2, data = cogtest)
lm_quad |> tidy(conf.int = TRUE, conf.level = .95) |> select(term, estimate, conf.low, conf.high)

lm_quad |> glance() |> select(r.squared, sigma)

cogtest |> 
  ggplot(mapping = aes(x = practice, y = score)) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ poly(x, 2), color = "#EC2049") +
  geom_point() +
  scale_x_continuous(limits = c(0, 14), 
                     breaks = seq(0, 14, by = 2)) + 
  theme_minimal() +
  labs(title = "Does more practice lead to a better score?",
       subtitle = "Overlay quadratic curve",
       x = "Minutes spent practicing", 
       y = "Score")

Notice that the estimate for the quadratic term (practice2) is meaningfully different from zero (-.14, and the 95% CI doesn’t include 0). Moreover, the $R^2$ increases quite a bit (from .86 in the linear model to .97 in the quadratic model). In addition, sigma is further reduced. This provides evidence that there is a substantial curve to the relationship. This initial curvature is an important aspect of the relationship between minutes spent practicing and subsequent performance.

Cubic model

Let’s finish with the cubic model.

lm_cubic <- lm(score ~ practice + practice2 + practice3, data = cogtest)
lm_cubic |> tidy(conf.int = TRUE, conf.level = .95) |> select(term, estimate, conf.low, conf.high)

lm_cubic |> glance() |> select(r.squared, sigma)

cogtest |> 
  ggplot(mapping = aes(x = practice, y = score)) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ poly(x, 3), color = "#2F9599") +
  geom_point() +
  scale_x_continuous(limits = c(0, 14), 
                     breaks = seq(0, 14, by = 2)) + 
  theme_minimal() +
  labs(title = "Does more practice lead to a better score?",
       subtitle = "Overlay cubic curve",
       x = "Minutes spent practicing", 
       y = "Score")

The estimate for the cubic term (practice3) is very small (-.003, and the 95% CI includes 0), the $R^2$ and sigma barely budged as compared to the quadratic model. This provides evidence that there is not a second bend to the relationship. In addition, the cubic graph doesn’t seem to add anything over the quadratic graph. Therefore, the quadratic model seems to be the best fit for these data. Thus, we will move forward with the quadratic model.

How do we describe the effects?

Interpretation of polynomial regression models requires extra care. Let’s break down the elements for a quadratic model. For convenience, here are the results of the quadratic model that we determined was the best fit for the data. Also, as I refer to effects in the results below, I will refer to the following equation:

\[ \hat{y_i} = {b_0} + ({b_1}\times{x_i}) + ({b_2}\times{x^2_i}) \]

Where ${x_i}$ refers to practice and ${x^2_i}$ refers to practice2.

lm_quad |> tidy(conf.int = TRUE, conf.level = .95) |> select(term, estimate, conf.low, conf.high)

Interpretation of intercept
The estimate for the intercept (${b_0}$) is the predicted test score for people who practice 0 minutes (i.e., practice equals 0). That is, if someone doesn’t practice, we predict they will score 1.7 points on the test.

Interpretation of the shape of the parabola
The graph of a quadratic regression model is the shape of a parabola. This shape can be mound shaped (i.e., an inverted U — also referred to as concave) or bowl shaped (i.e., a U — also referred to as convex). Our example is mound shaped (see the quadratic graph above). The sign of the squared term (i.e., ${b_2}$ in the equation or the estimate for practice2, which is -.14 in our example) determines whether the shape is a mound or bowl. If the estimate of the squared term is positive, then the parabola is bowl-shaped (U-shaped), and if it’s negative, the parabola is mound-shaped (inverted U-shaped).

Interpretation of the changing slope
The slope of a line drawn tangent to the parabola at a certain value of $x_i$ is estimated by:

\[ {b_1} + (2\times{b_2}\times{x_i}) \]

We can calculate the slope drawn tangent to the line at any desired level of $x_i$ (i.e., minutes of practice). In other words, the slope at different minutes of practice shows how the predicted score changes for each additional minute of practice. Here’s how we can calculate this at various points:

For example:

When practice equals 0 the slope is: $3.517 + (2\times-.142\times0) = 3.52$

When practice equals 8 the slope is: $3.517 + (2\times-.142\times8) = 1.25$

When practice equals 12 the slope is: $3.517 + (2\times-.142\times12) = .11$

When practice equals 14 the slope is: $3.517 + (2\times-.142\times14) = -.46$

This means that at 0 minutes of practice time, one additional minute of practice is predicted to increase the test score by about 3.52 points (i.e., we predict they’ll get more than 3 additional problems correct). But at 8 minutes, one additional minute of practice is predicted to increase the test score by about 1.25 points. By the time we get to 12 minutes of practice time, the beneficial effect of additional time spent practicing is nearly gone, and by 14 minutes, each additional minute appears to be detrimental (indicated by the negative slope). These changing slopes for 0, 8, 12, and 14 minutes are marked on the graph below.

Interpretation of the vertex
There is a value of $x_i$ along the curve when the slope drawn tangent to the line is 0. In other words, this is the point at which the predicted value of $y_i$ (i.e., score) takes a maximum value if the parabola is a mound shape or a minimum value if the parabola is a bowl shape. This point is called the vertex and the x-coordinate of the vertex can be estimated with the following formula:

\[ {-b_1} \div (2\times{b_2}) \]

In our case, the vertex represents the practice time at which the predicted score is maximized. For our example, this is $-3.517 \div (2\times{-.142}) = 12.4$. This is the point where the effect of practicing goes from positive to negative, and this indicates that there is no need to practice more than about 12.4 minutes. At this point, there is no additional benefit (i.e., no increment to the test score), and it would seem that any longer will begin to deteriorate the participant’s score. In any regression model, we should not extrapolate beyond the range of our observed data, so we wouldn’t want to speculate what will happen beyond 14 minutes. If we want to know, we’d need to conduct a new study with a wider range of practice time. The graph below depicts the vertex for our example.

Polynomial regression is an extension of linear regression that models non-linear relationships between the predictor variable ($X$) and the response variable ($Y$). It does this by including polynomial terms of the predictor, such as quadratic, cubic, and higher powers. This method allows for more flexibility in fitting curves to the data.

Why Polynomial Regression is Useful:

Captures Non-Linear Relationships: While linear regression assumes a straight-line relationship, polynomial regression can model more complex, curvilinear relationships between variables. This is particularly useful in real-world scenarios where the effect of a predictor on the outcome changes at different levels of the predictor.
Increased Model Flexibility: By adding polynomial terms, the model can better fit data that deviates from a straight line. For example, a quadratic model can capture a single bend in the data, and a cubic model can capture two bends, providing a more accurate fit for certain datasets.
Improved Predictive Performance: For datasets where the relationship between variables is inherently non-linear, polynomial regression can provide better predictions and a higher $R^2$ value, indicating a better fit compared to linear models.
Practical Applications: Polynomial regression is widely used in fields (including Psychology and Public Health) where relationships between variables are often non-linear. For instance, it can model the progression of disease, the effects of aging, or the diminishing returns of investment.

Key Considerations:

Avoiding Overfitting: While adding higher-order terms can improve the model fit to the sample data, it also increases the risk of overfitting, where the model captures noise rather than the underlying trend. It is crucial to balance model complexity with generalizability.
Interpretability: Higher-order polynomial terms can make the model harder to interpret. Typically, polynomial terms beyond the second order are less commonly used in practice due to this reason.

In summary, polynomial regression is a powerful tool for modeling non-linear relationships, offering increased flexibility and better fit for complex datasets. However, it requires careful application to avoid overfitting and to maintain interpretability.

Wrap up

In applied data analysis, capturing the true relationship between variables is crucial for making accurate predictions and deriving meaningful insights. While linear regression provides a powerful and straightforward tool for modeling relationships, it can fall short when dealing with complex, non-linear patterns inherent in real-world data. Non-linear transformations, such as logarithmic transformations, and polynomial regression offer versatile solutions to these challenges. These techniques allow us to extend the linear regression framework to handle non-linearity, thereby enhancing the flexibility and predictive power of our models.

Non-linear transformations, such as logarithms, help to linearize relationships that are otherwise exponential or multiplicative in nature. By transforming skewed data, these techniques make it more normally distributed, simplifying the analysis and improving the performance of regression models. For instance, in economic data, where variables like income or GDP often exhibit right-skewed distributions, logarithmic transformations can normalize the data and reveal more straightforward linear relationships. This not only aids in meeting the assumptions of linear regression but also facilitates the interpretation of model coefficients in terms of percentage changes, which are often more intuitive.

Polynomial regression, on the other hand, directly incorporates higher-order terms of the predictor variables to model non-linear relationships. By adding quadratic and higher-order terms, polynomial regression can fit curves to data that linear models cannot capture. This technique is particularly useful in fields where the effects of predictors are often non-linear. For example, the relationship between age and healthcare costs can be modeled more accurately with a quadratic function, reflecting the increased costs at both ends of the age spectrum. While polynomial regression increases model flexibility and improves fit, it is essential to balance this with the risk of overfitting the model to sample data.

Together, non-linear transformations and polynomial regression significantly enhance the toolkit of applied data analysts. These methods enable more accurate modeling of complex relationships, improve predictive performance, and provide deeper insights into the data. By understanding and applying these techniques, analysts can better capture the intricacies of real-world phenomena and make more informed decisions based on their data analyses.