Data Science - Ok Cupid - Date-A-Scientist Portfolio Project - Feedback Request


I have spent about twenty-five hour putting this together, with at least another twenty waiting for things to compile. I built some rather computationally intense programs for my MacBook to handle.

This project was challenging. Not because I was poorly prepared, the course did a fantastic job for preparing me for this work. The challenge was I did a poor job of data exploration at the start of the project that lead me to a difficult finish. I chose to predict incomes from all of the data without realizing 80% of the responses in the data didn’t have real income values. 80% had an income of -1, which was misleading as I didn’t see any NAN values. I did come across a great bit of code that does scatter and histogram plots that I’ll be using for future analysis. I ended up with a very small data set as I had to wipe out all the -1’s, or 80% of my data. I didn’t use an average of the income or some other filler value because I would have heavily skewed my results.

I also found putting together a correlation matrix helped me make sense of what I was seeing. I didn’t originally make one as part of my data analysis. I was getting low R^2 scores, and couldn’t make sense of it. When I saw almost everything was correlated below .3, I realized I was going to have a hard time coming up with any predictive information.

I could have spent more time trying to find smaller subsets of data to try to improve the score (which I did to a certain extent. I showed a 30% improvement from my original scores). However, I never got into the .7 or .8 range that I could have a higher degree of confidence in the results.

I would love some feedback on my approach or other ways I might have increased my predictive score through subsets, another formula, or a better approach to shaping the data.

My code can be found here: