OkCupid Date-a-Scientist

Here we are! This is my last capstone project before the final one.
I see the finish line and I’m excited to get there.

But before tackling the last part of this Data Scientist path, a few things about this date-a-scientist project:

  • I liked the lessons on NLP, it is something new for me and I decided to explore it further. I focused my work around Spacy models, so I spent some time to understand how Spacy works: it was a good exercise.
  • Nice dataset. Its dimension add some complexity to the analysis, in fact I had to discard some algorithms. I selected a model that is fast to train because it is easier to experiment with it.
  • It took me 12 to 16 h to complete the project (from notebook draft to Medium story)

I know that this dataset could be explored deeper than I did.
Anyway, I’m satisfied: my purpose was to test a specific idea to see if it gave good results and I think I was able to gather some useful insights.

Feedbacks are welcome.

Happy reading!

Link to repo at the end of the story

1 Like

Here’s my feedback on the blog post:

  • “Looking at the far right of the ditribution plot, it seems that we have some people over 100 years old”. It is barely noticeable visually, on the distribution plot, that there are data points over 100 years old.
  • " We will complete this step in Word Vectors and Age Labels ." The link reloads the page rather than taking me to the “Word Vectors and Age Labels” section.
  • Very nice to see careful deliberation and explanation of choosing a proper evaluation metric, which in this case was the weighted f1 score.
  • Easy to follow along with the discussion. Great writing and explanation skills. Could benefit from correction of typos.
  • In the part where you prepare a confusion matrix, instead of doing a crosstab you could also use:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels, predictions)
  • In addition to what you have mentioned in next steps, resampling or some other data augmentation can also be suggested to address the imbalance and artificially increase the sample size of the ‘old’ category. See the Gaydar project for implementation of resampling.
  • Very good for not simply declaring that because the accuracy is 85% the model has learned a lot, beyond just the distribution of the categories. You established a baseline by using the Zero Rule which put the score into proper context.

Project Highlights:

  • Use of a pre-trained NLP model from the Spacy package
  • Use of scikitlearn pipeline object. An object that allows you to easily chain multiple model training and transforming algorithms.
  • Use of scikitlearn PCA for dimensionality reduction.
  • Use of Grid Search Cross Validation using scikitlearn. Allows you to let the computer discover on its own which parameters of your models and transforms gives you the best evaluation score, according to your chosen evaluation metric. The project uses it to tune setup parameters of PCA and logistic regression.
  • Use of Repeated K fold Stratified Cross Validation. Assesses your model by comparing performance when taking different quantiles or subsets of your data for training and validation.
  • Use of scikitlearn DummyClassifier to establish a baseline vs your model performance, to see if the model is better than just guessing based on the existing distribution.

Hi @careershifter ,
thank you for your precious feedback, I really appreciate it.
I’d like to add some details to enrich what you highlighted.

You’re totally right, and this exactly what happened!
I spotted them by chance (compared to Medium, the blue line is a bit more brighter on Jupiter Notebook… at least on my monitor) and then I investigated further with display(profiles[profiles.age > 100]).

Nice catch!
This is the article I used for learning how to link a paragraph in Medium.
After reading your comment I’ve discovered that it’s better to write the link differently, especially if you are referencing a paragraph from the same site you are in. I solved the issue by using:
See the extra / before the paragraph ID #114e?
Without this additional slash some browsers will reload the page without scrolling to the selected paragraph, as you experienced (I’ve tested this behavior on Safari and Firefox). Chrome works even without the extra /.

I prefer crosstab because it outputs a pandas DataFrame.
It’s sooooo satisfying to see the “automagical” Jupiter-to-Medium paste a nice looking table directly into your Medium story.

In these regards I addressed this problem by adopting a stratified k-fold Cross-Validation (RepeatedStratifiedKFold in sklearn) which is a resampling technique, as suggested here and here.
Bootstrapping could give different results, I will add it to conclusions.
Thank you for suggesting it!

For your info, an example with drawbacks of over/undersampling is available here.