Non-Linear Transformations
Non-linear transformations are mathematical modifications applied to variables in a regression model to better capture the underlying relationship between predictors and outcomes.
They help address non-linear patterns that simple linear regression cannot explain by transforming the scale of a variable.
Reveal hidden patterns: Non-linear relationships can become more interpretable after transformation.
Meet assumptions: Improve model fit by addressing skewness, stabilizing variance, or making relationships more linear.
Control for large ranges: Manage variables with wide ranges, e.g., converting income or population size to a logarithmic scale.
Upward Transformations: Increase the magnitude of changes, often using powers
Quadratic: \(x^2\)
Exponential: \(e^x\)
Downward Transformations: Compress changes, making large values less extreme
Logarithmic: \(\log(x)\)
Square Root: \(\sqrt{x}\)
Logarithmic transformations are very common in statistics — in particular the natural log transformation. Denoted as ln, the natural logarithm uses the mathematical constant e (approximately 2.71828) as its base.
The natural logarithm is defined as: \(log_e(x) = c\), and is the power (or exponent) to which the base \(e\) must be raised to produce a given number: \(\ln(x) = c \quad \text{means that} \quad e^c = x\).
What can nonlinear transformations buy us in terms of linear regression modeling?
We’ll use data compiled by Our World in Data on gross domestic product (GDP) per capita and life expectancy of residents in 166 entities/countries.
Press Run Code on the code chunk below to import the data and wrangle it for analysis — including selecting year 2021, creating shorter names, and dropping countries with missing data.
Press Run Code on the code chunks below to create a density plot of Life Expectancy and GDP.
Press Run Code on the code chunk below to create a scatter plot of the raw variables.
Press Run Code on the code chunks below to create a density plot of GDP and ln(GDP).
In the code chunk below, create a new variable that is the natural log of GDP (call it gdp_ln). Then, regress life expectancy (le) on ln(GDP) (gdp_ln). Request the tidy() output with 95% CIs.
The intercept (21.10) provides the expected life expectancy when ln(GDP) per capita is zero. A GDP per capita of 1 corresponds to a ln(GDP) per capita of 0. Therefore, according to our model, a country with a GDP of 1 would have an expected life expectancy of 21.1 years.
The slope (5.44) indicates that for every one-unit increase in the natural logarithm of GDP per capita, the life expectancy at birth increases by approximately 5.44 years. This indicates a strong positive relationship between the natural log of GDP per capita and life expectancy.
A one-percent increase in GDP per capita is associated with a .05 unit increase in life expectancy.
Of course, a 1 percent increase in GDP per capita is very small — it is probably more informative to consider a larger increase — for example a 100 percent increase in GDP per capita. In this instance, we’d ask the question: “How much would we expect life expectancy to differ between two countries where one has a GDP per capita that is twice the size of the other?”.
If one country has a GDP that is twice that of the other country (a 100% increase), then we expect the country with the higher GDP to have a life expectancy that is about 3.8 years longer.
How much would we expect life expectancy to differ between two countries where one has a GDP per capita that is 50% larger than the other?”
A 50% increase in GDP is associated with about a 2.2 year increase in life expectancy.
Let’s compare two countries with different GDP levels: the Democratic Republic of Congo (where GDP per capita = $817 in 2021) and the United States (where GDP per capita = $57,523 in 2021).
To aid with using marginaleffects tools, let’s fit the same model, but with slightly different syntax:
We can use the average_comparisons() function from marginaleffects to examine the expected change in life expectancy if the Democratic Republic of Congo experienced a $500 increase in GDP per capita, and if the USA experienced a $500 increase in GDP per capita.
Now, let’s see what happens if we double GDP per capita for each country.
The red lines represents the GDP per capita of the Democratic Republic of Congo and the blue lines represents the GDP per capita of the USA. Solid lines equal GDP per capita in 2021, dashed lines represent a $500 increase, dotted lines represent a 100% increase.
Rather than consider GDP and life expectancy, let’s consider GDP and CO2 emissions. These data also come from Our World in Data. Let’s consider the total GDP (not GDP per capita) and the annual total emissions of carbon dioxide (CO₂) measured in tonnes for the year 2021.
Click Run Code on the code chunk below.
Press Run Code on the code chunks below to create a density plot of CO2 emissions and GDP.
Click Run Code on the code chunk below. After creating the initial plot — explore the following changes:
Request the natural log of gdp.
Request the natural log of co2.
Request the natural log of gdp and co2.
Create the natural log transformed versions of GDP and CO2 emissions, then fit a SLR to regress ln(CO2 emissions) on ln(GDP). Request the tidy() output with 95% CIs.
Each one unit increase in ln(GDP) is associated with a 0.9881734 unit increase in ln(CO2).
The function below can be used to interpret a regression slope in which both x and y have been natural log transformed. Enter in the regression slope from the fitted model, and the desired percent change for the x variable.
A 100% increase in GDP is associated with a 98% increase in CO2 emissions.
What is the predicted C02 emissions for the USA?
The GDP for the USA is 21,131,600,000,000. What is the natural log of this value?
Solve for the y-hat (predicted ln(CO2)):
Back transform y-hat so it’s in it’s original metric (CO2 in tonnes)
And, we can save ourselves some work by using augment().