A webR tutorial
College upward mobility sample size exploration
Background
In a 2017 study, Dr. Raj Chetty and colleagues analyzed intergenerational income mobility at 2,202 U.S. colleges using data from over 30 million students. They calculated the median parental income for students attending college in the early 2000s and the median income of these students at age 34. Parental income was defined as the average pre-tax household income when the child was 15-19, adjusted to 2015 dollars. Children’s income was based on individual earnings in 2014, ranked within their birth cohort, as were parents’ incomes. In the data frame, each row of data represents one of the 2,202 colleges.
We’ll explore Chetty and colleague’s data in this WebR activity.
Import the data
We’ll use data provided by the authors, downloaded from the Opportunity Insights data repository. We’ll focus on the following key variables within the data frame:
The name of the college.
The median income of parents (as described above and called par_median).
The median income of children/students at age 34 (called k_median).
Press Run Code on the code chunk below to import the data.
Goal of this activity
In the lecture slides we estimated the regression parameter estimates and the associated uncertainty estimates (e.g., the standard error and 95% Confidence Interval) for the intercept and slope of the fitted model when sample size was set at 500. Now, we’ll explore how these estimates change for different sample sizes and different levels of confidence.
A function to do the work
The function defined in the code chunk below simulates the process of sampling from the population (i.e., 2,202 Colleges and Universities) and fitting a regression model to describe the relationship between parent median income and child median income. Here’s a broad overview of what the function does (you don’t need to understand exactly what’s happening in the function code):
Draws Multiple Samples: The function takes many random samples (resamples) from the full population data frame, each with a user-defined sample size. This allows us to observe how estimates vary from one sample to another due to random chance.
Fits a Regression Model: For each sample, it fits a linear regression model to estimate the relationship between two variables: median parental income and median child income for colleges. This provides an intercept and slope for each sample. In this example, parent median income is centered at the mean in the population so that the intercept represents the predicted child median income when parent median income is $77,695.
Calculates Summary Statistics: After gathering estimates from all samples, the function summarizes them by calculating the average estimate, standard error, and confidence interval (user-defined) for both the intercept and slope. These summaries help illustrate the distribution of estimates across samples.
Displays Results: Finally, the function prints out the summary results, allowing us to see how the intercept and slope estimates vary depending on sample size and confidence level.
Press Run Code on the code chunk below to define the function.
Employ the function
Now, please press Run Code to use the function to examine the regression intercept and slope (and associated uncertainty estimates—including the standard error (labeled se), and the lower and upper bounds of the Confidence Interval (labeled CI_lower and CI_upper)) for the example we studied in the lecture slide deck. This run is for a sample size of 500 colleges and a confidence level of 0.95.
Explore
Now, it’s your turn to explore. In the code chunk below, choose different values for sample_size and confidence_level to see how the estimates change.
For example, you might explore:
What happens if you leave sample_size at 500 but change the confidence_level to 0.99 (for a 99% Confidence Interval)?
What happens if you leave confidence_level at 0.95 but change the sample_size to 50 (i.e., randomly select 50 colleges rather than 500)?
Explore on your own, develop your intuition, and then share your thoughts with a neighbor about the influence of sample size and confidence level on the resulting estimates.