Date-A-Scientist - Poor ML Scores

My code can be found here on GitHub:

I am working on analyzing the data, and have spent a good twenty hours on the project. What I am trying to develop is a regression model that predicts a person’s income based on the other provided information in the data set.

However, I have done several approaches to get a better result to no avail. I attempted to scale the data to prevent one set of large numbers, like income, bias the results. I also removed columns from the data set that do not have income information.

I also ran through different combinations of data into the model to “trim the forest,” but there was no combination that drove the score close to one.

Also, I am very confused as to why all of my attempts for the KNN model go to zero. I would have expected one combination to go above zero, but everything is negatively correlated.

I’m tempted to post the project as a way to show a null result. Basically, a ML regression model couldn’t be built that can predict the values. Is there something major I am missing in my approach that is preventing a better score?

I did think of doing part three to try to do a classifier approach. It wouldn’t necessarily meet what I was trying to complete which is predicting values, but I could try to classify what income people would fall into with the model. That may be a way to at least build something that has a certain level of predictive values.

Any suggestions to improve the regression models would be great!

I love my life,


I’m not quite sure. I’m not an expert but if you use .describe() or create a histogram of ‘income’ column, you will see that the majorities of users had income at -1. I think this doesn’t make sense. There might be some reason for example they might not want to tell their real income. I don’t think that we can predict income well with regression given that dataset.

By the way, I didn’t select this topic for ML model. At first, I chose to predict religions but changed my mind to follow the solution to predict zodiac signs (with my own way). However, the models I created cannot predict better than 0.3. But I learned a lot from this project even it doesn’t give a good result.