Introduction To Machine Learning in R: Predicting Income with Social Data

Hey all,

Getting to the end of the course on R and really struggling with the results I’m getting from this linear regression model.
the lesson is at
https://www.codecademy.com/paths/analyze-data-with-r/tracks/introduction-to-machine-learning-in-r/modules/linear-regression-in-r/projects/predicting-income-psid

Data seems to suggest that as education years increase so should Income. However with the code below both the model I’ve created and plotting a linear model with ggplot I get no correlation. The line just stays flat at the bottom of the plot. I’m at a bit of a loss as to where I’ve gone wrong.

Thanks for any help!

---
title: "Predicting Income with Social Data"
output: html_notebook
---

```{r message=FALSE, warning=FALSE}
# load packages and data
library(ggplot2)
library(dplyr)
library(modelr)
psid <- read.csv("psid_2017.csv")

# view data structure
str(psid)

# plot age
plot_age <- psid %>%
  ggplot(aes(age)) +
  geom_bar()
plot_age



# filter to reasonable age group
psid_clean <- psid %>% filter(between(age, 18, 75))

# plot flitered age
plot_age_clean <- psid_clean %>%
  ggplot(aes(age)) +
  geom_bar()
plot_age_clean


# plot education
edu_boxplot <- psid_clean %>% 
  ggplot(aes(education_years))+
  geom_boxplot()
edu_boxplot


# filter to reasonable education levels
psid_clean <- psid_clean %>% filter(between(education_years, 5, 25))


# plot income
labor_boxplot <- psid_clean %>% 
  ggplot(aes(labor_income))+
  geom_boxplot()
labor_boxplot


# view income summary statistics
summary(psid_clean$labor_income)

# plot mean income by age
mean_income_by_age <- psid_clean%>%
  group_by(age) %>%
  summarise(mean_income = mean(labor_income)) %>%
  ggplot(aes(age, mean_income)) +
  geom_point() 
 
# view plot
mean_income_by_age



# subset data points into train and test sets
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(psid_clean), replace = T, prob = c(0.6,0.4))

# define train and test
train <- psid_clean[sample, ]
test <- psid_clean[!sample, ]


# build model
model <- lm(labor_income ~ education_years, data = train)


# plot against LOESS model
plot_lm <- train %>% ggplot(aes(education_years, labor_income)) +
  geom_point() + 
  geom_smooth(method = "lm") + 
  geom_smooth(se = F, color = "red")
 
# view plot
plot_lm 



# compute r-squared
r_sq <- summary(model)$r.squared * 100

# uncomment to write out r-squared interpretation
sprintf("Based on a simple linear regression model, we have determined that %s percent of the variation in respondent income can be predicted by a respondent's education level.", r_sq)
1 Like

I just finished with this activity and ran into the same problem. My education_years had a non-significant p-value ( 0.281) and a negative coefficient ( -41.178), which was wrong according to the hint under #19.

(Do education_years , age , and gender all have a significant impact on labor_income ?** All variables are highly significant, with a p-value < 0.01 .).

The negative coefficient also seemed counterintuitive and wrong. Not sure if it was something I did or if the activity was set up incorrectly. Would appreciate it if anyone could clarify for us!

2 Likes

I have the same problem. Slightly different figures but same bottom line: predictor (education_years) is non-significant.

all:
lm(formula = labor_income ~ education_years + age + gender, data = train)
Coefficients:
                Estimate Std. Error t value **Pr(>|t|)**    
(Intercept)     7359.145    578.715  12.716  < 2e-16 ***
education_years  -39.781     36.320  -1.095    0.273    
age              -95.562      5.545 -17.234  < 2e-16 ***
gender          -757.383    166.888  -4.538 5.74e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8214 on 9767 degrees of freedom
Multiple R-squared:  0.03248,   Adjusted R-squared:  0.03218 
F-statistic: 109.3 on 3 and 9767 DF,  p-value: < 2.2e-16

I have the exact same results -39.781 etc. do anyone know what is the reason? I reviewed the data and from what I see, this could be the case of legit result with the data that was provided, I saw many records of no collage education with some income, and degrees with income 0.

Had the exact same output, really thought I had my filters set wrong or something, but I guess it’s just the data…

The output was quite counterintuitive, I agree. I tried some things left and right, but the data still didnt make sense. Then, I decided to filter out all the 0 incomes; this returned results that looked plausible.
If you think about it, it makes sense. If you take into account people with 0 income, you are actualy predicting income & unemployment in one; you would be better off analyzing these parameters separately if you ask me.

1 Like