OkCupid Data-A-Scientist project

Hello everyone!

I am uploading my Date-A-Scientist project. Any suggestions, feedback or suggestions would be so much appreciated, as it maybe is my final one on the Data Science skill path.

Thanks so much in advance for any answer!

1 Like

I’m going through your project and here is my feedback:

  • You have a statement which is “This file contains information from the profiles registered at the Ok-Cupid dating app from mid-2016 to mid-2017”. However, the dataset is actually from mid 2011 to 2012
  • Your code which selects all essay columns: columns = [c for c in df.columns if c[:5] != 'essay'] can also be done with pandas startswith()
  • You have a comment “The essays columns were not useful for machine learning modules, so I proceeded to take them out of the dataframe”. This is not true, as in previous lessons in the course we train a naive bayes classifier on text data like the essays. Additionally, in the Codecademy project portfolio page, Natural Language Processing is specified as a recommended project prerequisite.
  • In the evaluation of your drug logistic regression model, you show the value counts of the different labels the model predicted. It could be nicer if we could see how many classificationsit got right and wrong for each of those labels rather than just the value counts. Or something similar to what you did later on using sklearn’s evalclassifier().
  • In your income prediction model, you are using the terms ‘k’ and ‘neighbors’ even though the model you are using is a Random Forest. The appropriate term to use here would be the number of ‘trees’
  • You often use the map() method to convert your category labels to integers. A more compact way to do it instead of using map() would be to use dataframe['c'].cat.codes. See docs
  • It’s really great to see that you were using scikitlearn functions such as labelencoder() and evalclassifer() !
  • In your conclusion, I have an issue with the language you use in the following statement “It is not recommended to predict the income of the users with categorical information containing a lot of options, as the model would not have a high level of accuracy”. Just my opinion, but I think the methods you have tried are not exhaustive enough to conclude that in general we cannot make a good prediction model using for the income variable.