Presenting my OKCupid Project April 2021

My review of the project:
I learned a lot working on this project. I appreciate the fact that it is the first time we are exposed to a dataset of this size throughout the course. I found the project challenging because of everything that you can possibly do and the fact that you are not following guided instructions. You can get lost, lose focus, and make something that is not cohesive. After finishing the Computer Science Path, and now at this point being 90% through the Data Scientist Path, I felt like this was the hardest (but rewarding) thing I ever did in Codecademy.

Time estimate:
It took me around 7 weeks to complete this. Why? Lack of experience working on a dataset this large, and, along the way I kept on learning things that are not covered in the existing Career Path. Those things are:

  1. Converting a Pandas DataFrame to a SciPy compressed sparse row matrix, and dealing with sparse data in Pandas in general
  2. Interactive data visualization using Plotly
  3. Interactivity in Jupyter Notebooks through the use of Jupyter Widgets (ipywidgets)
  4. How to make a Donut Plot using matplotlib
  5. How to make wordclouds
  6. How to share a Jupyter Notebook online using nbviewer, because the notebook nor the notebook html won’t render in Github

You will see examples and code for all of these things in my project notebook. The sections involving ipywidgets need to be run on a Jupyter Notebook in order to load. In the current state of the notebook, it takes 10-20 mins for the notebook to run all cells on my Intel(R) Core™ i5-3210M 10 year-old computer.

Here is my project code repository
Here is the notebook displayed in nbviewer

Feedback that I would appreciate:

  • The project in its current form is not exactly non-data scientist friendly. Any tips on that?
  • Critique of the Machine Learning Model’s parameters, feature selection, etc.
  • More ways to build on the NLP analysis
  • Feedback on whether everything loads properly when you view the notebook in the links
  • Overall project review, thoughts, comments, and feedback in general

After some time, I will revisit this to retouch and revise, and to create a version that is easily presentable to a non-data scientist.

2 Likes

After reviewing the portfolio project kanban provided by Codecademy, here are some things the project, in its current form, despite its already long length, lacks:

  • Robust Model evaluation metrics
  • Model comparison charts
  • Presentation Slideshow

Will work on the above features and update when done.

For everyone reading, this is what I would do if i could go back to the start of this project:

I suggest don’t follow what I did, visualizing the whole dataset and adding interaction first before making ML models. Note that this project was specifically positioned in the career path right after learning supervised and unsupervised machine learning. It’s not supposed to be visualization and then machine learning as an afterthought, rather, machine learning and then visualization as an afterthought.

Go straight to choosing a single multi-label or binary label column whose values you will try to predict with a machine learning classifier. Create multiple ML models with the same purpose of predicting classification labels. Present a variety of ML evaluation metrics and explain why you focus on those particular metrics. Show summary comparisons of the performance between your models, including time it took to prepare and train. Make conclusions.

Only then can you decide if you have enough time to go ahead and do other fancy stuff with this project. (Hopefully you have the time, no shame if you don’t).

2 Likes

Hi Careershifter,
Thank you so much for the detailed feedback, we really appreciate it! I’m looking into these issues and how we can address some of them in future revisions of the course.
Best of luck as you evolve this project - going through it to the end has definitely set you up to be successful in your journey.
All best,
Michelle

2 Likes

Just want to make sure this is not being misinterpreted:

I should have said “my take on” the project. I am referring to my work, not the way codecademy set up the project.
Can I be allowed to edit the post one more time so that I can fix the language?

1 Like

No worries at all, and thank you!! We don’t take it personally, and are just grateful for such detailed and attentive feedback!

1 Like

I have met these objectives which I previously outlined.

Updated my repository with the following:

Any feedback and comments would be appreciated!

Updated the projects to be part of my portfolio.

Here are the new links:

1 Like