Introduction To Machine Learning in R: Predicting Income with Social Data

Hey all,

Getting to the end of the course on R and really struggling with the results I’m getting from this linear regression model.
the lesson is at
https://www.codecademy.com/paths/analyze-data-with-r/tracks/introduction-to-machine-learning-in-r/modules/linear-regression-in-r/projects/predicting-income-psid

Data seems to suggest that as education years increase so should Income. However with the code below both the model I’ve created and plotting a linear model with ggplot I get no correlation. The line just stays flat at the bottom of the plot. I’m at a bit of a loss as to where I’ve gone wrong.

Thanks for any help!

---
title: "Predicting Income with Social Data"
output: html_notebook
---

```{r message=FALSE, warning=FALSE}
# load packages and data
library(ggplot2)
library(dplyr)
library(modelr)
psid <- read.csv("psid_2017.csv")

# view data structure
str(psid)

# plot age
plot_age <- psid %>%
  ggplot(aes(age)) +
  geom_bar()
plot_age



# filter to reasonable age group
psid_clean <- psid %>% filter(between(age, 18, 75))

# plot flitered age
plot_age_clean <- psid_clean %>%
  ggplot(aes(age)) +
  geom_bar()
plot_age_clean


# plot education
edu_boxplot <- psid_clean %>% 
  ggplot(aes(education_years))+
  geom_boxplot()
edu_boxplot


# filter to reasonable education levels
psid_clean <- psid_clean %>% filter(between(education_years, 5, 25))


# plot income
labor_boxplot <- psid_clean %>% 
  ggplot(aes(labor_income))+
  geom_boxplot()
labor_boxplot


# view income summary statistics
summary(psid_clean$labor_income)

# plot mean income by age
mean_income_by_age <- psid_clean%>%
  group_by(age) %>%
  summarise(mean_income = mean(labor_income)) %>%
  ggplot(aes(age, mean_income)) +
  geom_point() 
 
# view plot
mean_income_by_age



# subset data points into train and test sets
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(psid_clean), replace = T, prob = c(0.6,0.4))

# define train and test
train <- psid_clean[sample, ]
test <- psid_clean[!sample, ]


# build model
model <- lm(labor_income ~ education_years, data = train)


# plot against LOESS model
plot_lm <- train %>% ggplot(aes(education_years, labor_income)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  geom_smooth(se = F, color = "red")
 
# view plot
plot_lm 



# compute r-squared
r_sq <- summary(model)$r.squared * 100

# uncomment to write out r-squared interpretation
sprintf("Based on a simple linear regression model, we have determined that %s percent of the variation in respondent income can be predicted by a respondent's education level.", r_sq)