A webR tutorial
Which colleges promote the most upward mobility?
Background
In a 2017 study, Dr. Raj Chetty and colleagues analyzed intergenerational income mobility at 2,202 U.S. colleges using data from over 30 million students. They calculated the median parental income for students attending college in the early 2000s and the median income of these students at age 34. Parental income was defined as the average pre-tax household income when the child was 15-19, adjusted to 2015 dollars. Childrenβs income was based on individual earnings in 2014, ranked within their birth cohort, as were parentsβ incomes. In the data frame, each row of data represents one of the 2,202 colleges.
Weβll explore Chetty and colleagueβs data in this WebR activity.
Import the data
Weβll use data provided by the authors, downloaded from the Opportunity Insights data repository. Weβll focus on the following key variables within the data frame:
The name of the college.
The median income of parents (as described above and called par_median).
The median income of children at age 34 (called k_median).
Press Run Code on the code chunk below to import the data.
Visualize the data
Data visualization is always the best place to start. Please create a scatter plot of the data β map parentβs median income to the x-axis and childβs median income to the y-axis (i.e., fill in the XXX spots with the variable names in the code chunk below).
Interpret the plot
With your neighbor, construct a description of this plot. What does the best fit line tell us? How should we interpret data points that are above the best fit line? How about data points that are below the best fit line? What does the data point for CSU tell us?
Positive Correlation: The regression line shows a positive slope, indicating that higher median parent income is associated with higher median child income across colleges.
Variation Around the Line: Points above the line represent colleges where children earn more than expected based on parent median income, suggesting upward mobility. Conversely, points below the line, suggests downward mobility β students are not reaching the expected income level given parent median income.
Whatβs the story with CSU?: Based on these data and this model, CSU students donβt fare quite as well as we would expect. That is, given parentβs median income, we would expect the students to earn a bit more than is actually observed (i.e., observed \(y_i\) is less than \(\hat{y}_i\)).
Fit the model
In the code chunk below, fit a linear regression model to predict childβs income using parentβs income across the colleges. Call the model fitted_model. Request the tidy() output, and the glance() output. Once fitted, with your neighbor, interpret this output β including the estimates for the intercept and the slope for median parent income, as well as the \(R^2\) and sigma.
Intercept (\(b_0\)): The intercept estimate is 10,945. This value represents the predicted median income of the students in their mid-30s if the parent median income were zero. While not realistic in this context, as no colleges have a median parent income of zero, it serves as a baseline reference point for the model.
Slope (\(b_1\)): The slope estimate is 0.334. This value suggests that for every additional dollar of parent median income, the median income of the children in their mid-30s is expected to increase by approximately 33 cents. This positive relationship indicates that higher median parental income is associated with higher child median income.
R-Squared: The \(R^2\) value is 0.548, meaning that approximately 54.8% of the variability in median income of students attending the college/university can be explained by the parent median income. This suggests a moderate to strong relationship, indicating that the median income of the parents is a robust predictor of the median income of the children.
Sigma: The sigma value, or residual standard error, is 8639. This value represents the average distance between the observed childrenβs median income and the median income predicted by the model, measured in dollars. A smaller sigma would indicate a closer fit of the model to the data, while a larger sigma suggests more variation around the predicted values.
Compute the fitted values and residuals
Use the augment() function to compute the fitted values and residuals for all colleges. Add these fitted values and residuals to the college_mobility data frame. Sort the data frame in descending order by residual.
Then print out the first five rows of the data frame, as well as the bottom five rows of the data frame. Also print the row for CSU.
With your neighbor, describe what you observe in the output.
The residuals in this context represent the difference between the actual median income of children (k_median) and the predicted income (\(\hat{y}_i\) or .fitted value) based on the median parent income (par_median). Specifically, each residual value indicates how much a collegeβs observed outcome (median child income) deviates from the expected outcome according to the fitted model:
Positive Residuals: For colleges with positive residuals (e.g., Saint Louis College Of Pharmacy), children are earning significantly more than expected based on parent income, suggesting higher upward mobility.
Negative Residuals: For colleges with negative residuals (e.g., Landmark College), children are earning less than predicted, suggesting downward mobility.
Interpretations for CSU For Colorado State University (CSU), the residual of -3702 indicates that CSU students have a median income that is lower (by $3,702) than what the model predicts it should be based on parent median income.
Looking at the specific values for CSU:
Parent Median Income (par_median): $115,400. This is the median income of parents of CSU students, placing them relatively high in terms of socioeconomic background.
Observed Child Median Income (k_median): $45,800. This is the actual median income observed among CSU students.
Predicted Child Income (.fitted): $49,502. Based on the model, the expected median income for CSU students, given the parent median income, would be around $49,502.
The Residual (.resid): $-3702 (calculated as
k_median - .fitted = 45800 - 49502
). This indicates that CSU students earn about $3,702 less than the modelβs prediction. This gap suggests that, compared to similar institutions, CSU students earn less than would be expected based solely on their parentsβ median income level.
Dig further
In case you want to explore a bit more, hereβs the whole data frame with the .fitted and .resid scores from the fitted model.
Check out a cool graphic series
If these data are of interest to you, please take a look at this interactive graphic series from the New York Times that uses the Chetty data. In addition to exploring income mobility, it also uses the data to explore economic segregation at US colleges and universities. Itβs super interesting work!