PSY 652: Research Methods in Psychology I

Three primary tasks

Description

Summarizes features of the world (e.g., diabetes prevalence, social network visualization).
Ranges from simple (e.g., mean) to advanced (e.g., cluster analysis).

Prediction

Uses variables (features) to predict an outcome.
Ranges from simple (e.g., correlation) to advanced (e.g., complex regression, machine learning).

Causal Inference

Identifies the causal effect of an intervention or condition on outcomes (e.g., determining how much vaccination reduces infection rates).
Involves randomized experiments or applying methods like matching and instrumental variables to observational data.

Regression models for prediction

The primary purpose of fitting a regression model is to understand and quantify the relationship between a dependent variable (outcome) and one or more independent variables (predictors). By doing so, you can:

Describe Relationships: Determine how changes in the predictor variables are associated with changes in the outcome variable.
Make Predictions: Use the established relationship to predict the value of the outcome variable for given values of the predictor variables.

An example

Sleep efficiency is a measure of how effectively you sleep, calculated as the ratio of the time you spend asleep to the total time you spend in bed. It is expressed as a percentage, with higher percentages indicating more efficient sleep.

Let’s imagine that a person records their sleep efficiency every night using an Apple watch.

A histogram of sleep efficiency

Quantifying variation in sleep efficiency

Standard deviation (sleep efficiency): \[ s = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (y_i - \bar{y})^2} \]

Sum of squares (sleep efficiency) \[ SS = \sum_{i=1}^{n} (y_i - \bar{y})^2 \]

Predictors of sleep efficiency

What factors lead to better or worse sleep?
Is night to night variability in sleep efficiency related to alcohol consumption?

Alcohol -> Sleep Efficiency

Press Run Code to simulate data for the next example.

Visualize sleep efficiency in the new data

Visualize sleep efficiency as a function of drinks

Add the full distribution

An alternative view

Visualize data with a scatter plot

Add the mean sleep efficiency as a function of drinks

Regression is a model for the means

Some key assumptions. For each value of x, the average value of y sits at the center of the spread of y values. The spread of y around this average follows a bell-shaped, or normal, distribution. These average values can be connected with a straight line, which represents the relationship between x and y.

The equation for a line

Identify the intercept and the slope

An equation for the line

\[ \hat{y}_i \approx 85 + (- 5) \times x_i \]

\[ \hat{y}_i \approx 85 - 5 \times x_i \]

Fit the SLR

\[ \hat{y}_i = 84.9 - 5.0 \times x_i \]

Predicted scores

If this individual consumes 2.5 drinks, what do we predict their sleep efficiency score will be?

Predicted score on graph

Predicted sleep efficiency when drinks is at the average

The average drinks is 0.85. What is the predicted sleep efficiency?

Predicted score on graph for mean of x

Residual

Let’s imagine that for a night when 2.5 drinks were consumed, the actual sleep efficiency is 80. What is the residual for this case?

Calculation of the residual

\[ e_i = y_i - \hat{y}_i = 80 - 72.3 = 7.7 \]

Obtain all fitted values (y-hats) and residuals

With the augment() function we can obtain fitted values and residuals for all observed cases:

Total variability explained

\(R^2\) represents the proportion of the variability in the outcome that can be predicted by the predictors. SSR is the Sum of Squares Regression, SSE is the Sum of Squares Error, SST is Sum of Squares Total.

\[ R^2 = \frac{\text{SSR}}{\text{SSR + SSE}} = \frac{\text{SSR}}{\text{SST}} \]

What is SSR?

Sum of Squares Regression (SSR), \(Σ(\hat{y_i} – \bar{y})^2\), captures how much better our model’s predictions are than just predicting the average outcome (e.g., \(\bar{y}\)) — which is marked by the black horizontal line in the graph below.

Calculation of SSR

What is SSE?

Sum of Squares Error (SSE), \(SSE = Σ({y_i} - \hat{y_i})^2\), captures the error in the predictions — that is the difference between the observed score and the predicted score.

Calculation of SSE

Calculation of R-squared

\[ R^2 = \frac{\text{SSR}}{\text{SSR} + \text{SSE}} = \frac{12128.8}{12128.8 + 9266.0} = 0.567 \]

SSR + SSE = SST

The total sum of squares (referred to as SST) is the sum SSR + SSE

\[ R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{12128.8}{21394.8} = 0.567 \]

Sigma

Standard Deviation of \(y\): The total variability in \(y\) may be described by the standard deviation. Recall that this measures how much the values of \(y\) vary around the mean of \(y\).

\[ s_y = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (y_i - \bar{y})^2} \]

Residual Standard Deviation (σ — sigma): After fitting a regression model, sigma represents the variability in \(y\) that remains unexplained by x. It’s calculated from the sum of squared residuals (SSE):

\[ \sigma = \sqrt{\frac{1}{n - 2} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]

Comparing the standard deviation and sigma

When \(x\) is not informative about \(y\), the regression model does not improve much over simply using the mean \(\bar{y}\) to predict \(y\), and \(\sigma\) will be large—similar to the standard deviation of \(y\), \(s_y\).

When \(x\) is informative, the regression model explains some of the variability in \(y\), and the residuals (differences between observed \(y_i\) and predicted \(\hat{y}_i\)) will be smaller. This means \(\sigma\) will be smaller compared to \(s_y\), indicating that \(x\) successfully explains some of the variance in \(y\).

Mini distributions of y given x

Sigma intuition

Revisited graph from the start

How is sigma related to R-squared?

\(R^2 = 1 - \left(\frac{\sigma^2}{s_y^2}\right)\)

Correlation

In the code chunk below, please calculate the correlation of drinks and sleep efficiency.

code
get hint

Correlation as a standardized regression

In the code chunk below, create z-scores of both sleep efficiency and number of drinks. Then regress the z-score for sleep efficiency on the z-score for number of drinks.

code
get hint

A graph of z-scores

\[ \hat{y}_i = 0 + (-.753) \times x_i \]

How is the SLR slope related to the correlation?

\[ b_1 = r \times \frac{s_y}{s_x} \]

Where:

\(b_1\) is the unstandardized regression slope.
\(r\) is the Pearson correlation coefficient between \(x\) and \(y\).
\(s_y\) is the standard deviation of \(y\).
\(s_x\) is the standard deviation of \(x\).

This equation shows that the unstandardized regression slope is the product of the correlation and the ratio of the standard deviations of y and x.