Intro to Regression Models
Description
Prediction
Causal Inference
The primary purpose of fitting a regression model is to understand and quantify the relationship between a dependent variable (outcome) and one or more independent variables (predictors). By doing so, you can:
Describe Relationships: Determine how changes in the predictor variables are associated with changes in the outcome variable.
Make Predictions: Use the established relationship to predict the value of the outcome variable for given values of the predictor variables.
Sleep efficiency is a measure of how effectively you sleep, calculated as the ratio of the time you spend asleep to the total time you spend in bed. It is expressed as a percentage, with higher percentages indicating more efficient sleep.
Let’s imagine that a person records their sleep efficiency every night using an Apple watch.
Standard deviation (sleep efficiency): \[ s = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (y_i - \bar{y})^2} \]
Sum of squares (sleep efficiency) \[ SS = \sum_{i=1}^{n} (y_i - \bar{y})^2 \]
What factors lead to better or worse sleep?
Is night to night variability in sleep efficiency related to alcohol consumption?
Press Run Code to simulate data for the next example.
Some key assumptions. For each value of x, the average value of y sits at the center of the spread of y values. The spread of y around this average follows a bell-shaped, or normal, distribution. These average values can be connected with a straight line, which represents the relationship between x and y.
\[ \hat{y}_i \approx 85 + (- 5) \times x_i \]
\[ \hat{y}_i \approx 85 - 5 \times x_i \]
\[ \hat{y}_i = 84.9 - 5.0 \times x_i \]
If this individual consumes 2.5 drinks, what do we predict their sleep efficiency score will be?
The average drinks is 0.85. What is the predicted sleep efficiency?
Let’s imagine that for a night when 2.5 drinks were consumed, the actual sleep efficiency is 80. What is the residual for this case?
\[ e_i = y_i - \hat{y}_i = 80 - 72.3 = 7.7 \]
With the augment() function we can obtain fitted values and residuals for all observed cases:
\(R^2\) represents the proportion of the variability in the outcome that can be predicted by the predictors. SSR is the Sum of Squares Regression, SSE is the Sum of Squares Error, SST is Sum of Squares Total.
\[ R^2 = \frac{\text{SSR}}{\text{SSR + SSE}} = \frac{\text{SSR}}{\text{SST}} \]
Sum of Squares Regression (SSR), \(Σ(\hat{y_i} – \bar{y})^2\), captures how much better our model’s predictions are than just predicting the average outcome (e.g., \(\bar{y}\)) — which is marked by the black horizontal line in the graph below.
Sum of Squares Error (SSE), \(SSE = Σ({y_i} - \hat{y_i})^2\), captures the error in the predictions — that is the difference between the observed score and the predicted score.
\[ R^2 = \frac{\text{SSR}}{\text{SSR} + \text{SSE}} = \frac{12128.8}{12128.8 + 9266.0} = 0.567 \]
The total sum of squares (referred to as SST) is the sum SSR + SSE
\[ R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{12128.8}{21394.8} = 0.567 \]
Standard Deviation of \(y\): The total variability in \(y\) may be described by the standard deviation. Recall that this measures how much the values of \(y\) vary around the mean of \(y\).
\[ s_y = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (y_i - \bar{y})^2} \]
Residual Standard Deviation (σ — sigma): After fitting a regression model, sigma represents the variability in \(y\) that remains unexplained by x. It’s calculated from the sum of squared residuals (SSE):
\[ \sigma = \sqrt{\frac{1}{n - 2} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]
When \(x\) is not informative about \(y\), the regression model does not improve much over simply using the mean \(\bar{y}\) to predict \(y\), and \(\sigma\) will be large—similar to the standard deviation of \(y\), \(s_y\).
When \(x\) is informative, the regression model explains some of the variability in \(y\), and the residuals (differences between observed \(y_i\) and predicted \(\hat{y}_i\)) will be smaller. This means \(\sigma\) will be smaller compared to \(s_y\), indicating that \(x\) successfully explains some of the variance in \(y\).
\(R^2 = 1 - \left(\frac{\sigma^2}{s_y^2}\right)\)
In the code chunk below, please calculate the correlation of drinks and sleep efficiency.
In the code chunk below, create z-scores of both sleep efficiency and number of drinks. Then regress the z-score for sleep efficiency on the z-score for number of drinks.
\[ \hat{y}_i = 0 + (-.753) \times x_i \]
\[ b_1 = r \times \frac{s_y}{s_x} \]
Where:
\(b_1\) is the unstandardized regression slope.
\(r\) is the Pearson correlation coefficient between \(x\) and \(y\).
\(s_y\) is the standard deviation of \(y\).
\(s_x\) is the standard deviation of \(x\).
This equation shows that the unstandardized regression slope is the product of the correlation and the ratio of the standard deviations of y and x.