Apply and Practice Activity

Build Your Own Model with MIDUS Twins

Introduction

In predictive analytics, the goal is to build a model that not only fits the current data but also generalizes well to unseen data. To achieve this, you must navigate two key challenges: (1) capturing meaningful patterns in the training data, and (2) avoiding overfitting, where the model becomes too tailored to the specifics of the training data and performs poorly on new data. Balancing these challenges is known as the bias-variance tradeoff.

The Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept in model development. It refers to the tension between two sources of error that can affect the performance of a predictive model:

  1. Bias: Bias refers to errors introduced by making simplifying assumptions in the model. A model with high bias might be too simple, underfitting the data by failing to capture important patterns. This typically results in poor performance because the model cannot accurately reflect the relationships in the data.

  2. Variance: Variance refers to errors introduced by the model being too sensitive to small fluctuations in the data. A model with high variance is likely overfitting — it captures the noise and quirks in the analyzed data frame, which don’t generalize to new data. This leads to excellent performance on the data used to develop the model but poor performance on unseen data, as the model is too specific to the idiosyncrasies of the data frame used to build the model.

The goal of model building is to find the optimal balance between bias and variance, resulting in a model that generalizes well to new, unseen data.

Why Split the Data?

To assess this balance and avoid overfitting, we split the data frame into two parts: a training set and a test set.

  1. Guided Model Development: The training set is used to shape and refine the model. During this phase, you adjust the model’s structure, select variables, and tune parameters to best capture the patterns in the data. However, using only the training set might lead to overfitting, where the model becomes too complex and perfectly fits the training data but fails on new data.

  2. Objective Evaluation on Unseen Data: The test set remains completely separate from the model-building process. After you finalize the model using the training data, the test set serves as a proxy for new, unseen data. By evaluating the model’s performance on this data, you obtain an objective measure of its ability to generalize beyond the training set.

  3. Detecting Overfitting: If your model performs exceptionally well on the training data but poorly on the test data, this is a clear sign of overfitting. The model has likely captured noise in the training data rather than true underlying patterns. This discrepancy highlights the importance of using the test set for an unbiased evaluation.

  4. Mitigating Underfitting: Conversely, if the model performs poorly on both the training and test sets, it might be too simplistic, resulting in underfitting. In this case, the model has too much bias and is not flexible enough to capture the true complexity of the data.

  5. Balancing Complexity and Generalization: In practice, you want to select a model that balances complexity with generalization. A simpler model may have higher bias, but it is often more generalizable to new data. A more complex model might have low bias but may suffer from high variance, leading to overfitting. This is why testing the model on unseen data is crucial for determining how well it can generalize.

Enhancing Robustness

By using this two-step process — training on one set and validating on another — you enhance the robustness of your model. This reduces the chance that the model is overly influenced by anomalies or quirks in the training data, ensuring it is reliable and versatile for real-world applications. A model that generalizes well is better suited for making predictions in dynamic and complex environments.

Conclusion

In summary, splitting the data into training and test sets, along with understanding concepts like the bias-variance tradeoff, are key strategies for building models that not only fit the training data but also perform well in real-world predictive challenges. This process helps ensure your model captures meaningful patterns while avoiding overfitting and underfitting, ultimately leading to a more accurate and robust predictive tool.

The focus for the activity

In this Apply and Practice Activity, you will build your own model to predict work interfering with family (WIF) using the MIDUS twins dataset from Module 14. The data will be split into two sets: one set representing one sibling from each twin pair and the other set representing the other sibling. You will develop a model using one set, selecting the variables, transformations, recoding, and interactions as you see fit. Once you’ve built what you consider to be your ideal model, you will test it on the second dataset to evaluate how well it performs on unseen data.

Let’s get started

Step 1

In the Posit Cloud foundations project, start a new quarto document. Click File -> New File -> Quarto Document. A dialog box will pop up — in it, give your new document a title (e.g., Model Building with MIDUS), then type your name (beside Author). Uncheck the box beside Use visual markdown editor. Then click Create.

Once the file is created, click File -> Save As, then save your file in the apply_and_practice_programs folder inside the programs folder of the foundations course project. Name it model_building_midus.qmd.

Quarto auto-populates some text and code chunk examples to help you get started. You can delete all of this. Just highlight everything BELOW the YAML header (i.e., the part that starts and ends with the three dashes ---) and then click delete.

To ensure you are working in a fresh session, close any other open tabs (save them if needed). Click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output.

Step 2

First we need to load the packages that are needed for this activity. Create a first level header called

# Load packages

Then insert a code chunk and load the following packages.

library(broom)
library(here) 
library(skimr)
library(gt)
library(tidyverse)

Step 3

Create a first level header

# Import data

Insert a code chunk, then import the MIDUS twins data frame.

orig_data <- read_rds(here("data", "midus_twins_workfamily.Rds"))

Here’s a listing of all the variables in the data frame.

Variable Description
id Individual ID
fam_id Family ID
which_twin Distinguishes sibling 1 from sibling 2 in the twin pair
time Study wave: w1, w2, w3 for waves 1 through 3
zyg Zygosity of twin pair: dz - same sex = dizygotic male-male or female-female; dz - different sex = dizygotic male-female, mz = monozygotic
wif Work Interference with Family (wif): Measures how often job demands affect home life in the past year. Responses range from all the time (1) to never (5). Higher scores indicate greater conflict.
fiw Family Interference with Work (fiw): Assesses how often family responsibilities impact work in the past year. Responses range from all the time (1) to never (5). Higher scores indicate greater conflict.
jd Job Demands: Evaluates the intensity of job requirements, including too many demands, insufficient time, and frequent interruptions. Responses range from all the time (1) to never (5), reverse-scored for appropriateness. Higher scores indicate greater job demands.
fd Family Demands: Looks at the pressure from family obligations, including excessive demands and frequent interruptions. Responses range from all the time (1) to never (5). Higher scores indicate greater family demands.
ext Extraversion: Personality trait indicating sociability. Responses range from not at all (1) to a lot (4). Higher scores reflect greater extraversion.
agr Agreeableness: Trait showing propensity for kindness and cooperation. Responses range from not at all (1) to a lot (4). Higher scores reflect greater agreeableness.
neu Neuroticism: Personality trait indicating emotional instability. Responses range from not at all (1) to a lot (4). Higher scores reflect greater neuroticism.
opn Openness: Trait related to creativity and willingness to experience. Responses range from not at all (1) to a lot (4). Higher scores reflect greater openness.
con Conscientiousness: Trait indicating reliability and diligence. Responses range from not at all (1) to a lot (4). Higher scores reflect greater conscientiousness.
lifesat Life Satisfaction. Responses include 1 = not at all, 2 = a little, 3 = somewhat, 4 = a lot.
mh Mental Health Rating. Responses include 1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent.
ph Physical Health Rating. Responses include 1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent.
comph Comparison of Overall Health to Others Your Age. Responses include 1 = much worse, 2 = somewhat worse, 3 = about the same, 4 = somewhat better, 5 = much better.
sex Sex of twin: 1 = male, 2 = female
age Age of twins in years

Step 4

Create a first level header

# Format and split the data

Copy and paste the following code to begin preparing the data. First, we’ll perform some data wrangling to focus on wave 1. Then, we’ll split the data into two separate data frames: one representing one sibling from each twin pair, and the other representing the second sibling from each pair.

wave1 <-
  orig_data |> 
  filter(time == "w1") 

train <- wave1 |> filter(which_twin == "twin1")
test <- wave1 |> filter(which_twin == "twin2")

You will develop your model using the train data frame. Importantly, don’t look at or utilize the test data frame until you fully develop your desired model with the train data frame.

Step 5

Create a first level header

# Develop your ideal model

Now, build your model to predict Work Interference with Family (WIF) using your knowledge of the variables, theoretical understanding, and intuition. Call your model wif_model_train. Below are some guidelines to help you get started.

  • Understanding WIF

    • WIF measures how much job demands interfere with family life. You should start by reflecting on what factors are likely to contribute to this interference. For instance, do higher job demands lead to more interference? Does personality, such as neuroticism or extraversion, play a role?
  • Review the Variables

    • Examine the variables in the data frame carefully. Think about which ones are likely to affect WIF based on your theoretical understanding of work-family conflict, personality traits, and life satisfaction. Some variables you might consider:

      • Job Demands (jd): Likely a direct predictor of WIF.

      • Family Demands (fd): Could contribute to WIF by adding additional pressure on the individual.

      • Personality Traits (e.g., neuroticism, extraversion, agreeableness): Could moderate the relationship between job/family demands and WIF.

      • Life Satisfaction (lifesat) and Mental Health (mh): May act as buffers or reflect how an individual perceives and manages conflict between work and family.

  • Theoretical Basis for Relationships

    • Use existing theories of work-family conflict, stress, or personality to guide your model-building process. For example:

      • The Demand-Control Model suggests that high demands combined with low control predict stress and work-family interference,

      • Personality traits like neuroticism may exacerbate stress responses, while conscientiousness may reduce the likelihood of work interfering with family life.

  • Start with a Simple Model

    • Begin by including variables you believe are most strongly related to WIF based on theory or intuition. A good starting point might be:
wif_model_train <- lm(wif ~ jd + fd + ext + neu, data = train)
  • Consider Interaction Terms
    • It’s possible that some variables interact with each other. For example:
    • Job Demands might interact with Personality Traits. Someone high in neuroticism might experience more WIF under high job demands than someone lower in neuroticism.
  • Transformation and Recoding
    • Review the distribution of the variables. You may want to transform variables if they are skewed or use categorical versions of variables if theory suggests it (e.g., splitting mental health into categories based on thresholds).
  • Fit, Evaluate, and Adjust
    • After fitting your initial model, evaluate its fit by examining the summary of the model by using the broom functions — i.e., tidy(), glance(), and augment(),
    • Pay attention to the regression coefficients for the predictors and the adjusted R-squared. Adjust your model based on these evaluations, but always keep the larger goal in mind: to develop a model using the training data that can optimally predict WIF in the test data. It’s important to avoid overfitting to the training data, where the model performs perfectly on what it has seen but fails to generalize to new, unseen data. This is where the bias-variance tradeoff comes into play: as you refine your model, strive to balance capturing the complexity of the data (reducing bias) without making it too specific to the training data (increasing variance). Your adjustments should help the model generalize, making it both flexible enough to detect patterns and robust enough to perform well on the test data.

Step 6

Once you have your ideal model specified, compute the following fit statistics for the training data frame.

wif_model_train |> glance() |> select(r.squared, adj.r.squared, sigma)

Step 7

Create a first level header

# Evaluate model performance

In the code chunk, use the following code to obtain predictions for the participants in the test data frame. Here, we take the developed model from the training data frame (wif_model_train) and apply it to the new data via the augment() function from broom.

The results will be stored in a data frame named unseen_predictions — and defined in the code chunk below.

unseen_predictions <- 
  wif_model_train |> 
  augment(newdata = test)

Run this code and then inspect the contents of unseen_predictions — notice that this includes the .fitted and .resid values which represent the predicted wif (based on the model we developed on the training data frame) and residual (difference between the predicted and observed scores) for the siblings in the test data frame. Essentially, this helps us to see if our model developed with the siblings in the training data frame can do a good job of predicting the outcome (wif) for the siblings in the test data frame.

We can generate fit metrics to determine how well our model performed in predicting the outcome in the test data frame. We’ll consider three metrics – namely \(R^2\), adjusted-\(R^2\), and Sigma. By comparing these statistics between our training and test data frames, we can gain a comprehensive understanding of how well our model generalizes to new, unseen data.

When comparing these statistics between training and test data frames, here’s what you should look for:

  1. \(R^2\): Ideally, the \(R^2\) values between the training and test data frames should be close. A significantly lower \(R^2\) in the test data frame might suggest that the model doesn’t generalize well to new data or may have been overfit to the training data.
  2. Adjusted-\(R^2\): Like \(R^2\), you’d hope the adjusted-\(R^2\) values are comparable between training and test data frames. A large drop in the test data frame might indicate overfitting.
  3. Sigma (Residual Standard Error): A comparable sigma between the training and test data frames is ideal. A significantly larger sigma in the test data frame suggests the model predictions are less accurate on new, unseen data.

In summary:

  • Consistency between these metrics in the training and test data frames indicates good generalization and a model that’s likely not overfit.
  • Significant discrepancies, especially if the test metrics are noticeably worse, suggest potential overfitting or that the model doesn’t generalize well to new data.

Let’s calcuate these metrics for our example.

Create a second level header

## Evaluate predictions

In the code chunk, use the following code to obtain commonly used evaluation metrics.

unseen_predictions <-
  unseen_predictions |> 
  select(wif, .fitted, .resid) |> 
  drop_na() |>  # drop cases where wif, .fitted or .resid are missing
  mutate(for_SSR = .fitted - mean(wif), # calculate the difference between each predicted score and the mean of y
         for_SSR2 = for_SSR^2) |>  # square the difference
  mutate(for_SSE2 = .resid^2) |>  # square the residual 
  select(wif, .fitted, .resid, for_SSR, for_SSR2, for_SSE2)

n <- nrow(test)  # number of cases in the test data frame
p <- length(wif_model_train$coefficients) - 1  # number of predictors, minus the intercept in the developed model

fits_test <- 
  unseen_predictions |> 
  summarize(SSR = sum(for_SSR2),
            SSE = sum(for_SSE2),
            sigma = sqrt(SSE/(n - p - 1)),
            R2 = SSR/(SSR + SSE),
            R2_adj = 1 - ( (n-1)/(n-p) ) * (1 - (SSR/(SSR + SSE)))
            )

fits_test

Let’s compare the fit metrics from the training data frame and the test data frame.

Create a second level header

## Compare fits for train and test

In the code chunk, use the code below to get a side by side comparison. Take a look at the metrics and jot down your thoughts.

# model from training data frame
wif_model_train |> glance() |> select(r.squared, adj.r.squared, sigma)

# model from test data frame
fits_test |> select(R2, R2_adj, sigma)

Additionally, let’s create a nicely formatted table to compare the model fits from the training and test data frames. Use the code below:

# Extract the fit indices for the training model
train_indices <- wif_model_train |> 
  glance() |> 
  select(r.squared, adj.r.squared, sigma) |> 
  rename(
    `R-squared` = r.squared,
    `Adjusted R-squared` = adj.r.squared,
    `Residual Std. Error` = sigma
  ) |> 
  mutate(Model = "Training")

# Extract the fit indices for the test model
test_indices <- fits_test |> 
  select(R2, R2_adj, sigma) |> 
  rename(
    `R-squared` = R2,
    `Adjusted R-squared` = R2_adj,
    `Residual Std. Error` = sigma
  ) |> 
  mutate(Model = "Test")

# Combine the indices into a single data frame
fit_indices <- bind_rows(train_indices, test_indices)

# Create the gt table
fit_table <- fit_indices |> 
  gt(rowname_col = "Model") |> 
  tab_header(
    title = "Model Fit Indices for Training and Test Data"
  ) |> 
  fmt_number(
    columns = c(`R-squared`, `Adjusted R-squared`, `Residual Std. Error`),
    decimals = 3
  ) |> 
  cols_label(
    `R-squared` = "R-squared",
    `Adjusted R-squared` = "Adjusted R-squared",
    `Residual Std. Error` = "Residual Std. Error"
  )

# Display the table
fit_table

# Save the table
fit_table |> gtsave(filename = "midus_table.html")

Step 8

Finalize and submit.

Now that you’ve completed all tasks, to help ensure reproducibility, click the down arrow beside the Run button toward the top of your screen then click Restart R and Clear Output. Scroll through your notebook and see that all of the output is now gone. Now, click the down arrow beside the Run button again, then click Restart R and Run All Chunks. Scroll through the file and make sure that everything ran as you would expect. You will find a red bar on the side of a code chunk if an error has occurred. Taking this step ensures that all code chunks are running from top to bottom, in the intended sequence, and producing output that will be reproduced the next time you work on this project.

Now that all code chunks are working as you’d like, click Render. This will create an .html output of your report. Scroll through to make sure everything is correct. The .html output file will be saved along side the corresponding .qmd notebook file.

Follow the directions on Canvas for the Apply and Practice Assignment entitled “Model Building with MIDUS Apply and Practice Activity” to get credit for completing this assignment.