 # Introduction To Machine Learning in R: Predicting Income with Social Data

Hey all,

Getting to the end of the course on R and really struggling with the results I’m getting from this linear regression model.
the lesson is at

Data seems to suggest that as education years increase so should Income. However with the code below both the model I’ve created and plotting a linear model with ggplot I get no correlation. The line just stays flat at the bottom of the plot. I’m at a bit of a loss as to where I’ve gone wrong.

Thanks for any help!

``````---
title: "Predicting Income with Social Data"
output: html_notebook
---

```{r message=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(modelr)

``````
``````# view data structure
str(psid)

# plot age
plot_age <- psid %>%
ggplot(aes(age)) +
geom_bar()
plot_age

# filter to reasonable age group
psid_clean <- psid %>% filter(between(age, 18, 75))

# plot flitered age
plot_age_clean <- psid_clean %>%
ggplot(aes(age)) +
geom_bar()
plot_age_clean

# plot education
edu_boxplot <- psid_clean %>%
ggplot(aes(education_years))+
geom_boxplot()
edu_boxplot

# filter to reasonable education levels
psid_clean <- psid_clean %>% filter(between(education_years, 5, 25))

# plot income
labor_boxplot <- psid_clean %>%
ggplot(aes(labor_income))+
geom_boxplot()
labor_boxplot

# view income summary statistics
summary(psid_clean\$labor_income)

# plot mean income by age
mean_income_by_age <- psid_clean%>%
group_by(age) %>%
summarise(mean_income = mean(labor_income)) %>%
ggplot(aes(age, mean_income)) +
geom_point()

# view plot
mean_income_by_age

# subset data points into train and test sets
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(psid_clean), replace = T, prob = c(0.6,0.4))

# define train and test
train <- psid_clean[sample, ]
test <- psid_clean[!sample, ]

# build model
model <- lm(labor_income ~ education_years, data = train)

# plot against LOESS model
plot_lm <- train %>% ggplot(aes(education_years, labor_income)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(se = F, color = "red")

# view plot
plot_lm

# compute r-squared
r_sq <- summary(model)\$r.squared * 100

# uncomment to write out r-squared interpretation
sprintf("Based on a simple linear regression model, we have determined that %s percent of the variation in respondent income can be predicted by a respondent's education level.", r_sq)
``````
1 Like

I just finished with this activity and ran into the same problem. My education_years had a non-significant p-value ( 0.281) and a negative coefficient ( -41.178), which was wrong according to the hint under #19.

(Do `education_years` , `age` , and `gender` all have a significant impact on `labor_income` ?** All variables are highly significant, with a p-value `<` `0.01` .).

The negative coefficient also seemed counterintuitive and wrong. Not sure if it was something I did or if the activity was set up incorrectly. Would appreciate it if anyone could clarify for us!

1 Like