fitted_model |>
tidy() |>
select(term, estimate) |> # select just the items we need for now
mutate(estimate = round(estimate, 3)) # specify roundingA webR tutorial
Which Colleges Produce the Most Updward Mobility?

Background
In a 2017 study, Dr. Raj Chetty and colleagues analyzed intergenerational income mobility at 2,202 U.S. colleges using data from over 30 million students. They calculated the median parental income for students attending college in the early 2000s and the median income of these students at age 34. Parental income was defined as the average pre-tax household income when the child was 15-19, adjusted to 2015 dollars. Children’s income was based on individual earnings in 2014. In the data frame, each row of data represents one of the 2,202 colleges.
We’ll explore Chetty and colleague’s data in this WebR activity.
Import the data
We’ll use data provided by the authors, downloaded from the Opportunity Insights data repository. We’ll focus on the following key variables within the data frame:
The name of the college.
The median income of parents (as described above and called par_median).
The median income of children at age 34 (called k_median).
Press Run Code on the code chunk below to import the data.
Visualize the data
When tackling a new analysis — data visualization is always the best place to start. Please create a scatter plot of the data — map parent’s median income to the x-axis and child’s median income to the y-axis (i.e., fill in the instances of “___” with the variable names in the code chunk below). Then press Run Code to see the result.
Interpret the plot
With your neighbor, construct a description of this plot. What does the best fit line tell us? How should we interpret data points that are above the best fit line? How about data points that are below the best fit line? What does the data point for CSU tell us? After you discuss, click on the Interpretations tab to check your interpretations.
Positive Correlation: The regression line shows a positive slope, indicating that higher median parent income is associated with higher median child income across colleges.
Variation Around the Line: Points above the line represent colleges where children earn more than expected based on parent median income, suggesting upward mobility. Conversely, points below the line, suggests downward mobility — students are not reaching the expected income level given parent median income.
What’s the story with CSU?: Based on these data and this model, CSU students don’t fare quite as well as we would expect. That is, given parent’s median income, we would expect the students to earn a bit more than is actually observed (i.e., observed \(y_i\) is less than \(\hat{y}_i\)).
Fit the model
In the code chunk below, fit a linear regression model to predict child’s income using parent’s income across the colleges. Call the model fitted_model. Request the tidy() output, and the glance() output. Once fitted, with your neighbor, interpret this output — including the estimates for the intercept and the slope for median parent income, as well as the \(R^2\) and sigma. Click the Interpretations tab to check your answers.
Replace the ___ instances with the correct information.
Intercept (\(b_0\)): The intercept estimate is approximately $10,945. This value represents the predicted median income of the students in their mid-30s if the parent median income were zero. While not realistic in this context, as no colleges have a median parent income of zero, it serves as a baseline reference point for the model.
Slope (\(b_1\)): The slope estimate is 0.334. This value suggests that for every additional dollar of parent median income, the median income of the children in their mid-30s is expected to increase by approximately 33 cents. This positive relationship indicates that higher median parental income is associated with higher child median income. To put this in more meaningful terms, a $10,000 increase in parental median income is associated with an expected increase of approximately $3,340 in children’s median income (0.334 × $10,000).
R-Squared: The \(R^2\) value is 0.548, meaning that approximately 54.8% of the variability in median income of students attending the college/university can be explained by the parent median income. This suggests a moderate to strong relationship, indicating that the median income of the parents is a robust predictor of the median income of the children.
Sigma: The sigma value (\(\sigma\)), also called the residual standard error or the residual standard deviation, is approximately 8639. This value represents the typical distance between the observed children’s median income and the median income predicted by the model, measured in dollars. A smaller sigma would indicate a closer fit of the model to the data, while a larger sigma suggests more variation around the predicted values. In the very unlikely case where our predictor perfectly predicted the outcome, then sigma would be 0 (and \(R^2\) would be 1).
If you don’t like the scientific notation, you can add a mutate() step that requests the estimate be rounded to a certain number of decimal places (e.g., 3):
Compute the fitted values and residuals
Let’s use the augment() function to compute the fitted values and residuals for all colleges.
- The fitted values (\(\hat{y}\)) represent the predicted child income (median for the college) based on the model. These are called .fitted in the output.
- The residuals (\(e\)) represent the difference between the observed child income and the predicted value (observed - predicted, or, \(e = y - \hat{y}\)). These are called .resid in the output.
In the code chunk below, we’ll add these fitted values and residuals to the college_mobility data frame, then we’ll sort the data frame in descending order by residual. Last, we’ll print out the first five rows of the data frame, as well as the bottom five rows of the data frame. And, for interest, we’ll print the row for CSU.
Press Run Code to accomplish this task.
With your neighbor, describe what you observe in the output.
The residuals in this context represent the difference between the actual median income of children (k_median) and the predicted income (\(\hat{y}_i\) or .fitted value) based on the median parent income (par_median). Specifically, each residual value indicates how much a college’s observed outcome (median child income) deviates from the expected outcome according to the fitted model:
Positive Residuals: For colleges with positive residuals (e.g., Saint Louis College Of Pharmacy), children are earning significantly more than expected based on parent income, suggesting higher upward mobility.
Negative Residuals: For colleges with negative residuals (e.g., Landmark College), children are earning less than predicted, suggesting downward mobility.
Interpretations for CSU: For Colorado State University (CSU), the residual of -3702 indicates that CSU students have a median income that is lower (by $3,702) than what the model predicts it should be based on parent median income.
Looking at the specific values for CSU:
Parent Median Income (par_median): $115,400. This is the median income of parents of CSU students, placing them relatively high in terms of socioeconomic background.
Observed Child Median Income (k_median): $45,800. This is the actual median income observed among CSU students.
Predicted Child Income (.fitted): $49,502. Based on the model, the expected median income for CSU students, given the parent median income, would be around $49,502.
The Residual (.resid): $-3702 (calculated as
k_median - .fitted = 45800 - 49502). This indicates that CSU students earn about $3,702 less than the model’s prediction. This gap suggests that, compared to similar institutions, CSU students earn less than would be expected based solely on their parents’ median income level.
Explore standardized variables
When we run a regression with the original income variables, the slope depends on the units of measurement (dollars). But what if we standardize both parent and child income into z-scores? This removes the units and places both variables on the same scale—revealing some elegant mathematical relationships. Let’s check it out.
First, click Run Code below to create standardized scores (z-scores) for both the predictor and outcome, then visualize the relationship.
Compute z-scores and plot
Fit regression with z-scores and compare to correlation
Now click Run Code to:
- Fit a simple linear regression where z-scored child income is regressed on z-scored parent income.
- Calculate the correlation between the original variables.
What just happened?
The intercept becomes 0, but this isn’t magic — it’s a property of all simple linear regressions. The regression line always passes through the point (mean X, mean Y). When we standardize, both means are zero, so the intercept must be zero. Standardization just makes this visible.
The slope becomes the correlation coefficient (i.e., 0.7402807). When you standardize both variables, the slope sheds its original units and shows you the pure strength and direction of the relationship.
\(R^2\) and correlation are shown to be two sides of the same coin. See that \(R^2\) = 0.548. Now square the correlation: 0.7402807² = 0.548. Same number!
The residuals tell a story too. Notice that sigma = 0.672. Since we standardized child income (giving it a variance of 1), we can account for where that variance goes: 0.548 is explained by the relationship with parent income (that’s the \(R^2\)), therefore, 0.452 remains unexplained (i.e., 1 – 0.548). Since standard deviation equals the square root of variance, the residual standard deviation (sigma) may also be computed as: sqrt(0.452) ≈ 0.672.
Why should you care? Think about the difference this makes: With raw variables, a slope of $0.334 per dollar of parent income provides the magnitude of the effect — but it may be difficult to understand how sizable an effect that is. Is that a strong relationship? How does it compare to, say, the relationship between child income and college GPA? You can’t easily tell because the units are different. With standardized variables, a slope of 0.740 tells you immediately: for every standard deviation increase in parent income, child income goes up by 0.740 standard deviations. Now you can compare across completely different contexts. A correlation of 0.740 in economics means the same strength of relationship as a correlation of 0.740 in psychology — both describe how tightly points cluster around that regression line. The big insight? Standardization strips away the (sometimes arbitrary) units and reveals the underlying geometry of the relationship. It’s like converting different currencies to a common standard — suddenly you can compare apples to apples.
Want to explore more?
In case you want to explore a bit more, here’s the whole data frame with the .fitted and .resid scores from the fitted model.
Check out a cool graphic series
If these data are of interest to you, please take a look at this interactive graphic series from the New York Times that uses the Chetty data. In addition to exploring income mobility, it also uses the data to explore economic segregation at US colleges and universities. It’s super interesting work!