OKCupid Capstone Project

It took ages to do because I wanted the write-up to be just right, but I finally finished my OKCupid Capstone Project for the Data Science career path!!! I’m incredibly proud and excited. Check it out!

Part One: Analyzing the Data

Part Two: Using Naive Bayes and Random Forests to create a “Gaydar”


Nice work! I’m also exploring prediction of queerness in this project, and had a lot of fun seeing your approach/success. It was really interesting to see the resampling technique! I used imblearn.under/over_sampling to adjust for my data imbalance and found it to have little effect. I’ll have to try resampling, maybe with better luck!

Also, the website looks stellar–a super helpful example of how to display your work for someone like me who is just finishing up the Data Science Career Course this week (and panicking about how to present myself to this industry).

All around great work–thanks for sharing!

1 Like

Here’s my feedback on the Data Analysis Blog Post. It is praise mixed with constructive criticism

  • You definitely have a Front-End engineer’s eye for making the article look inviting, despite the fact that I see the article is long when I look at the vertical scrollbar on the right. Instead of thinking “This is going to be a long article that is dense and dry”, because of the initial aesthetic which hijacks my brain I end up thinking “this is going to be a nice long article full of interesting stuff which will keep me engaged”.
  • I definitely noticed the consistent color palette throughout matching the LGBT flag. Gave a sense of coherence and thematic effect.
  • I love your “language”. Nice use of commas and separation of clauses, concise sentences, varied vocabulary, consistent tenses and perspective. You hold up as an academic in linguistics.
  • Definitely appreciate the complete labelling and titles on charts and plots
  • Some statements of fact don’t have bibliographic reference or inline citation. Like this statement: “while biphobia runs rampant regardless of gender, male bisexuality is seen as less socially acceptable”. However I must say that most of your statements do have references and external links, and the ones that don’t are the rare case. It is actually very very nice to see that you are contrasting the OKCupid data with other academic literature! It helps legitimize what we are seeing in the data and provides some causality on top of the correlation.
  • You flesh out facts and findings with relevant social contexts to allow proper nuanced interpretation. Super!
  • “The median age was 30, 27, and 30 years for straight, bi, and gay users respectively”. I just want to point out that the median age if we look at all users is also 30. It is the bi users which seem to come from a different distribution in terms of age. You might find this interesting to incorporate or consider.
  • Regarding comparison with US census data. It is visually appealing and digestible and the point is communicated. I only have a suggestion for improvement. It would be great if the two tables, the US census distribution and the OKCupid distribution, would be juxtaposed into a single one where we can see the numbers side-by-side closely. A single table with 5 columns: [‘Race’, ‘US Population’, ‘OKC Straight’, ‘OKC Bi’, ‘OKC gay’]
  • The stratified clustered bar plots titled “ethnicities listed” ,“highest education completed”, “Career Listed”, “Median Salary by Field”. I am struggling with the legend color. With the exception of the “white” ethnicity, im finding it hard to identify which bars represent straight, bi, and gay.
  • The conclusion. In addition to the closing paragraph, would like to see the specific observations made throughout the article restated in TLDR way ("Bi people did more drugs, gays more likely to complete a phd, etc.)
  • You might want to add a link to the part 2 blog at the bottom of the conclusion so that readers can seamlessly jump into the next part.

To make things easier to read and respond, I will put my feedback on the machine learning blogpost and on the jupyter notebook in separate posts.

Here’s my feedback on the Machine Learning Blog Post

  • You use the pandas map() function to convert the category labels into integers. This can also be accomplished using pandas.Categorical.codes
  • “The speaks column wasn’t terrible, either. I had considered breaking it down by language, but decided it would be more efficient to just count the number of languages spoken by each user”. If you want to easily break it down by language you may use pandas.Series.str.get_dummies
  • Regarding the way you split ethnicity. You can also accomplish this using pandas.Series.str.get_dummies. I noticed that using get dummies seems to take less time than using lambda-apply (Don’t take my word for it, try it yourself).
  • Regarding the plot “Blank Essays on OKCupid”. Perhaps exchange the x and y axes? So that the essay tick labels will be horizontal for easier reading.
  • Nice to see that you tested your model on the OKCupid data of your friends
  • Nice to see that the feature selection for the model is explained
  • The part where you used the predicted labels of you NBclassifier as training features of a Random Forest Model. I wonder if the model performance would be improved if you could also add the confidence level of the NB classifier in its prediction? For example, you would have two things the random forest considers [‘nb_prediction’, ‘nb_prediction_confidence’] instead of just [‘nb_prediction’]

Learning Highlights from this Project that are extracurricular with respect to the career path:

  • The use of resampling, a bootstrapping technique, which can improve performance when working with imbalanced datasets. Imbalanced means that the distribution of category labels is not uniform. It is a form of data augmentation where the sample size of the smaller category is artificially increased to match the larger one.
  • The nesting of a ML model in another ML model. The predicted classification labels of a Naive Bayes Classifier was used as one of the training features, alongside other features, in a Random Forest Model. This, certainly would be a way to consider unstructured as well as structured data at the same time.
  • In the Jupyter Notebook, and not in the blog post, there are nested donut plots which look great.