Hey all,

Getting to the end of the course on R and really struggling with the results I’m getting from this linear regression model.

the lesson is at

https://www.codecademy.com/paths/analyze-data-with-r/tracks/introduction-to-machine-learning-in-r/modules/linear-regression-in-r/projects/predicting-income-psid

Data seems to suggest that as education years increase so should Income. However with the code below both the model I’ve created and plotting a linear model with ggplot I get no correlation. The line just stays flat at the bottom of the plot. I’m at a bit of a loss as to where I’ve gone wrong.

Thanks for any help!

```
---
title: "Predicting Income with Social Data"
output: html_notebook
---
```{r message=FALSE, warning=FALSE}
# load packages and data
library(ggplot2)
library(dplyr)
library(modelr)
psid <- read.csv("psid_2017.csv")
```

```
# view data structure
str(psid)
# plot age
plot_age <- psid %>%
ggplot(aes(age)) +
geom_bar()
plot_age
# filter to reasonable age group
psid_clean <- psid %>% filter(between(age, 18, 75))
# plot flitered age
plot_age_clean <- psid_clean %>%
ggplot(aes(age)) +
geom_bar()
plot_age_clean
# plot education
edu_boxplot <- psid_clean %>%
ggplot(aes(education_years))+
geom_boxplot()
edu_boxplot
# filter to reasonable education levels
psid_clean <- psid_clean %>% filter(between(education_years, 5, 25))
# plot income
labor_boxplot <- psid_clean %>%
ggplot(aes(labor_income))+
geom_boxplot()
labor_boxplot
# view income summary statistics
summary(psid_clean$labor_income)
# plot mean income by age
mean_income_by_age <- psid_clean%>%
group_by(age) %>%
summarise(mean_income = mean(labor_income)) %>%
ggplot(aes(age, mean_income)) +
geom_point()
# view plot
mean_income_by_age
# subset data points into train and test sets
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(psid_clean), replace = T, prob = c(0.6,0.4))
# define train and test
train <- psid_clean[sample, ]
test <- psid_clean[!sample, ]
# build model
model <- lm(labor_income ~ education_years, data = train)
# plot against LOESS model
plot_lm <- train %>% ggplot(aes(education_years, labor_income)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(se = F, color = "red")
# view plot
plot_lm
# compute r-squared
r_sq <- summary(model)$r.squared * 100
# uncomment to write out r-squared interpretation
sprintf("Based on a simple linear regression model, we have determined that %s percent of the variation in respondent income can be predicted by a respondent's education level.", r_sq)
```