Final Portfolio Project: Run to the Hills

michelealberti · May 16, 2021, 6:44pm

I’m so satisfied!
Submitting this post is like crossing the finish line of my run.

Some thoughts about this experience with Codecademy:
My journey on the Data Scientist Career path was fruitful.
In addition to completing projects (which are a great motivator) and improving my skills, I have build a simple sqlite database at work to enable the automation of some tasks.
I already knew Python and some of the packages I used before subscribing to Codecademy, but I approached the sql language during this course and being able to immediately put into practice what I learned was a great satisfaction.

And now, ladies and gentlemen, it’s time to introduce my final project …

I was listening to Iron Maiden’s Run to the Hills when I had the idea for this project and I found it to be an appropriate title, given the main topic.
I’ll leave it to you to find out what it’s all about by reading the story on Medium.

I had to work with xml files, a new challenge that led me to develop a dedicated python module to better handle the dataset and a notebook for describing some details about this process.
This second notebook, which explain data cleaning and EDA, is not included in the Medium story but you will find it on GitHub.
I applied some domain knowledge I have learned from my job to a different subject. It was engaging!
It took almost every evening for two weeks to complete this project. I spent most of the time converting, cleaning and exploring the dataset (approx. 60-70% of the overall time). Last days were dedicated to data visualization and storytelling.

One last thing it’s worth to add: I have used a package that is not mentioned during lessons.
I made friends with Plotly several months ago.
I use this tool extensively in this project, I hope you will find those interactive figures as interesting as I do.

Feedbacks are welcome.

Happy reading!

CODECADEMY WALKTHROUGH: RUN TO THE HILLS
Link to repo at the end of the story

gavingro · May 18, 2021, 5:34pm

Hey, I’m still just a novice halfway through this path but I learned a lot from reading through your work!

Loved the concept.
Also, I’m definitely going to yoink your method of how you combine both pipelines and a gridCV. I’ve definitely spent some time scratching my head after using both methods individually, and getting frustrated in the documentation and Stack Overflow for something that felt like was very normal use case.

Thanks for writing up your process so clearly!

For curiosity’s sake, what makes you opt for plotly over seaborn?
I’d imagine making mini-widgets for the visualizations are a huge plus, but at the same time it looks like it takes a little more finessing to do what you want.

michelealberti · May 18, 2021, 9:21pm

Hi @gavingro ,
thank you for your feedback.

Well, mini-widget are not an obvious thing to have both in Jupyter and Medium under the same framework. Especially because Medium need that you host the interactive components on a different service (this is how Medium embed works).

The reason behind my choice is because Plotly creates interactive html figures: pan, zoom, and the other commands are part of Plotly and they are powered by its engine (I think it is written in JavaScript).
They work in notebooks (similarly to %matplotlib notebook magic command) and with a bit more effort also in Medium, and this is what I was looking for.
To display interactive features also in my Medium story I used chart-studio: this service host my plots so that I can embed them in all their interactive glory.
Last but not least you can upload your figures directly from Jupyter with some simple commands (if export_to_chart_studio in the import section is set to true all figures are loaded on chart-studio).

gavingro · May 18, 2021, 10:26pm

Really cool! Thanks for the extra info!

careershifter · May 19, 2021, 6:03am

Can I ask why some charts are static images while others are fully interactive? Is there a limitation with chart studio?

michelealberti · May 19, 2021, 8:36pm

Yes, there is a 500 kbyte limit for the free plan (see at the end of this page for details).
The distribution plots were too big for my free subscription (I even try to change the number of bins to reduce the size). Maybe using histograms instead of distribution plots would fit, but I liked them better.

The first static image (distance walking/running with different data sources) is from my first EDA: I used a snapshot instead of uploading another plot to chart-studio.
You can re-create the interactive figure by making some changes to the notebook named Import Apple Health (as described on Medium).

careershifter · May 20, 2021, 5:02pm

Could you try this process if it works? Especially for plots exceeding the limit?

Save plotly chart as html file.
Commit and push the html to any of your Github repositories.
Embed the html file on your Medium Blog Post.

michelealberti · May 23, 2021, 10:16am

I have done a quick test by using an old html figure (from Plotly) that I have available and it doesn’t work, even with GitHub pages.
I’m not sure it should but I haven’t dug that much: I suggest you to run some more test.
You can convert figures to html from my notebook (figures bigger than the upload limit are the ones with upload statements commented out).

If you are really searching for a workaround you should read Medium docs (Supported Embed Providers).