Introduction To Machine Learning in R: Predicting Income with Social Data

Hey all,

Getting to the end of the course on R and really struggling with the results I’m getting from this linear regression model.
the lesson is at

Data seems to suggest that as education years increase so should Income. However with the code below both the model I’ve created and plotting a linear model with ggplot I get no correlation. The line just stays flat at the bottom of the plot. I’m at a bit of a loss as to where I’ve gone wrong.

Thanks for any help!

title: "Predicting Income with Social Data"
output: html_notebook

```{r message=FALSE, warning=FALSE}
# load packages and data
psid <- read.csv("psid_2017.csv")

# view data structure

# plot age
plot_age <- psid %>%
  ggplot(aes(age)) +

# filter to reasonable age group
psid_clean <- psid %>% filter(between(age, 18, 75))

# plot flitered age
plot_age_clean <- psid_clean %>%
  ggplot(aes(age)) +

# plot education
edu_boxplot <- psid_clean %>% 

# filter to reasonable education levels
psid_clean <- psid_clean %>% filter(between(education_years, 5, 25))

# plot income
labor_boxplot <- psid_clean %>% 

# view income summary statistics

# plot mean income by age
mean_income_by_age <- psid_clean%>%
  group_by(age) %>%
  summarise(mean_income = mean(labor_income)) %>%
  ggplot(aes(age, mean_income)) +
# view plot

# subset data points into train and test sets
sample <- sample(c(TRUE, FALSE), nrow(psid_clean), replace = T, prob = c(0.6,0.4))

# define train and test
train <- psid_clean[sample, ]
test <- psid_clean[!sample, ]

# build model
model <- lm(labor_income ~ education_years, data = train)

# plot against LOESS model
plot_lm <- train %>% ggplot(aes(education_years, labor_income)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  geom_smooth(se = F, color = "red")
# view plot

# compute r-squared
r_sq <- summary(model)$r.squared * 100

# uncomment to write out r-squared interpretation
sprintf("Based on a simple linear regression model, we have determined that %s percent of the variation in respondent income can be predicted by a respondent's education level.", r_sq)
1 Like

I just finished with this activity and ran into the same problem. My education_years had a non-significant p-value ( 0.281) and a negative coefficient ( -41.178), which was wrong according to the hint under #19.

(Do education_years , age , and gender all have a significant impact on labor_income ?** All variables are highly significant, with a p-value < 0.01 .).

The negative coefficient also seemed counterintuitive and wrong. Not sure if it was something I did or if the activity was set up incorrectly. Would appreciate it if anyone could clarify for us!


I have the same problem. Slightly different figures but same bottom line: predictor (education_years) is non-significant.

lm(formula = labor_income ~ education_years + age + gender, data = train)
                Estimate Std. Error t value **Pr(>|t|)**    
(Intercept)     7359.145    578.715  12.716  < 2e-16 ***
education_years  -39.781     36.320  -1.095    0.273    
age              -95.562      5.545 -17.234  < 2e-16 ***
gender          -757.383    166.888  -4.538 5.74e-06 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8214 on 9767 degrees of freedom
Multiple R-squared:  0.03248,   Adjusted R-squared:  0.03218 
F-statistic: 109.3 on 3 and 9767 DF,  p-value: < 2.2e-16

I have the exact same results -39.781 etc. do anyone know what is the reason? I reviewed the data and from what I see, this could be the case of legit result with the data that was provided, I saw many records of no collage education with some income, and degrees with income 0.

Had the exact same output, really thought I had my filters set wrong or something, but I guess it’s just the data…

The output was quite counterintuitive, I agree. I tried some things left and right, but the data still didnt make sense. Then, I decided to filter out all the 0 incomes; this returned results that looked plausible.
If you think about it, it makes sense. If you take into account people with 0 income, you are actualy predicting income & unemployment in one; you would be better off analyzing these parameters separately if you ask me.

1 Like