My review of the project:
I learned a lot working on this project. I appreciate the fact that it is the first time we are exposed to a dataset of this size throughout the course. I found the project challenging because of everything that you can possibly do and the fact that you are not following guided instructions. You can get lost, lose focus, and make something that is not cohesive. After finishing the Computer Science Path, and now at this point being 90% through the Data Scientist Path, I felt like this was the hardest (but rewarding) thing I ever did in Codecademy.
It took me around 7 weeks to complete this. Why? Lack of experience working on a dataset this large, and, along the way I kept on learning things that are not covered in the existing Career Path. Those things are:
- Converting a Pandas DataFrame to a SciPy compressed sparse row matrix, and dealing with sparse data in Pandas in general
- Interactive data visualization using Plotly
- Interactivity in Jupyter Notebooks through the use of Jupyter Widgets (ipywidgets)
- How to make a Donut Plot using matplotlib
- How to make wordclouds
- How to share a Jupyter Notebook online using nbviewer, because the notebook nor the notebook html won’t render in Github
You will see examples and code for all of these things in my project notebook. The sections involving ipywidgets need to be run on a Jupyter Notebook in order to load. In the current state of the notebook, it takes 10-20 mins for the notebook to run all cells on my Intel(R) Core™ i5-3210M 10 year-old computer.
Feedback that I would appreciate:
- The project in its current form is not exactly non-data scientist friendly. Any tips on that?
- Critique of the Machine Learning Model’s parameters, feature selection, etc.
- More ways to build on the NLP analysis
- Feedback on whether everything loads properly when you view the notebook in the links
- Overall project review, thoughts, comments, and feedback in general
After some time, I will revisit this to retouch and revise, and to create a version that is easily presentable to a non-data scientist.