- Took me over 3 days to do the project, but great revision;
- ReJupyter notebook
a. I have to constantly ‘run’ every single cell from the top;
b. a perfect cell runs perfectly one day, the next day, becomes ‘error’;
c. how to keep cell numbers from 1 to …, nice and neat?
- Do I need more explanation or insights in my Presentation?
Judy from Melbourne
Thank you for posting your project!
Yep, Jupyter sessions will time out and yes one has to re-run the cells from the top (which is where all your imported libraries are, files, etc). The same thing will happen in Google’s Colab (an app based on Jupyter)
Is there a
.ipynb file in this repo?
Looks like you have a solid understanding of the data and the goals of the project.
It’s a good idea to have an introductory slide that describes what your data set(s) is/are and what you’ve investigated or what story you’re about to tell. In addition to looking for correlations and possible causation & making predictions with data sets, data scientists (and data analysts) are storytellers.
You might want to create an Appendix portion of your presentation b/c the slides become a bit cluttered with the python code. (Plus, your audience might not be technical and might not understand what you’re conveying). A mentor once told me, ‘less is more on slides’ and, ‘know your audience’ Two things I always keep in mind when creating a presentation. I also think what it would be like to be an audience member viewing the slides and not knowing what the focus is of each slide.
So, you’re ultimately telling a story about the data set. You could have slides featuring the tables and plots with one or two sentences of a description of them and the findings for each step & then elaborate from your notes in your presentation. If someone wants to see your code, you can refer them to the Appendix of your presentation or show them your Jupyter Notebook.
You could also have a Final Thoughts or, Summary slide at the end where you tie up everything that you looked at in the data. (it’s a thought)
Thank you Lisa for giving me detailed feedback. Your time and efforts are much appreciated. Jude
Based on your detailed suggestions, I spent another 4 hours:
- completely re-organised power point presentation slides, they look so much more professional now;
- attach Jupyter Notebook separately, declutter;
- I find one of my fellow buddies at codecademy who also post the work, I like its clean line of slides, I modified mine based on the inspiration.
- Would you please have a quick look on my new work?
Thank you for your time and effort commenting on my project. You are the reason that ‘codecademy’ is so successful. Go Lisa.
Jude from Melbourne, Australia
ping-s-code-academy-projects/Final_PDF_ Ping_ Capstone2_Biodiversity_python_pandas_matplotlib_chi2_contingency.pdf at main · judyping2436/ping-s-code-academy-projects · GitHub
ping-s-code-academy-projects/Final_PYNB_Ping_Capstone2_Biodiversity_python_pandas_matplotlib_chi2_contingency.ipynb at main · judyping2436/ping-s-code-academy-projects · GitHub
Cool! I think that by having a separate jupyter notebook it makes your thought process and the analysis a bit easier to understand. (Plus, if there are technical ppl in your audience, you can always share it with them).
There is one thing though, the chi-square test in cells 146, 147 & 96. There seems to be some numbers missing from the output. (You also don’t need to define “significance = 0.05”, so, I’d remove that.)
The pval for the first test mammals vs. birds is 0.68.
The chi-square test is supposed to print out 4 values:
chi2, pval, dof, expected = chi2_contingency(contingency) print(pval) print(chi2, pval, dof, expected)
Also, in cell 96, you can delete the defined variable you have, “significance = 0.05”.
The pval is
p-value=0.038356 which is less that 0.05 and is significant. You’re trying to determine if there is an association between the two variables–mammals and reptiles and their conservation status. Mammals are more likely to be endangered than reptiles but not more likely to be endangered than birds, right?
This is a (somewhat) detailed description of the chi square test that I like to refer to sometimes. Scroll down to the Contingency Tables section:
There’s also some really useful articles in the “Communicating Data Science Findings” section of the DS path. Or, is this project part of the Analyze Data w/Python? I can’t recall.
The presentation is great!! I like the cleaner slides and the beautiful picture on the first slide. Getting inspiration from others is part of the process.
You have created an ‘animal’. Under your inspiration, I have gone on and created a short video on Biodiversity Presentation on youtube. My Qs are:
- it’s about 3 minutes, I find I am repeating myself in the "Executive Summary’ at the end;
- Love your feed back.
Jude from Melbourne
I think this is a great idea!!
Also, if you’re looking for another course to take there’s a new one, “Master Statistics with Python” and every Tuesday there’s a YouTube event where Alex and Sophie go over a dataset. They record them and they are here: https://www.youtube.com/watch?v=YwadRm2sfpQ
It’s pretty cool (the course and events). I’m watching the videos and have downloaded the dataset.
You have been the most important person in my code learning journey so far. Looking back 5 days ago when I first presented the “Capstone2 Biodiversity Project” to the work I am showing today, your feedback is an integral part of my codecademy learning success. Give yourself a big pat on the shoulder.
cheers, Jude from warm & sunny Melbourne
That’s very kind of you to say. I’m happy to have helped out!
- Would you please give me some feed back on the “Executive Summary” (3rd page) on the slide PDF , I find my explanation on p-value needs some help.
- I choose the front cover of ‘Muscle Hub’ especially for you.
Thank you. Judy from beautiful sunny Melbourne
Capstone1_MuscleHub_AB_Test-/Final_PDF_PPT_MuscleHub_AB_Test_ Ping_ Capstone1_python_pandas_SQL_matplotlib_chi2_contingency.pdf at main · judyping2436/Capstone1_MuscleHub_AB_Test- · GitHub
Capstone1_MuscleHub_AB_Test-/Final_Ping_MuscleHub_AB_test_Project.ipynb at main · judyping2436/Capstone1_MuscleHub_AB_Test- · GitHub
Sure, I can take a look.
I also encourage others to review as well. (more than one opinion is a good thing sometimes! )
Definitely, Thank you Lisa.
I think it depends on how technical your audience is that you’re presenting to. So, let’s say it’s a tech savvy audience and they know what Chi Square contingency testing is and what p values are.
In your notebook and slides I think you can show the actual p value you found (rather than stating <0.05).
I’m confused about the second bullet point in item 3 of your Executive Summary.
The percent of visitors who did not take the test had a higher percentage of purchasing a membership (10% vs 7%) AND that test showed that the difference between the two groups was significant at 0.0147… (My numbers are slightly different than yours–A: member: 200, not member: 2304; B: member:250, not member: 2250. contingency = [[200, 2304], [250, 2250]])
Ultimately the a/b tests showed that:
1.) More people in group B (who did not take the fitness test) submitted an application and the chi square test results were statistically significant at 0.0009647827. (My numbers are slightly diff. for A and B. contingency = [[250, 2254], [325, 2175]])
2.) People in group A (took fitness test) were more likely to purchase a membership if they had already picked up an application. BUT, the chi sq. test showed that there wasn’t a difference between the groups and that the p value was not significant
3.) More people overall from group B who didn’t take the fitness test did purchase a membership. (the chi sq test results were significant.)
So, I think that we can conclude that the fitness test is prohibitive towards purchasing a membership. My recommendation (after reading some of the sample interviews as well) was to ditch the test.
Oops, I didn’t fully answer your question about p values.
Just remember that the p value is the probability that the set of circumstances in your data could have occurred due to random chance if the null hypothesis were true. (Basically, the Null (Ho) is that there is no difference between the two variables, and the Alternative hypothesis (Ha) is that there is a difference between the two variables.) Low p value; < 0.05, unlikely to have happened randomly then reject the null.
I have found some YT videos that really help explain p values, null/alt hypotheses, standard deviation, confidence intervals, etc. Mr. Nystrom is a high school AP stats teacher and his explanations are very thorough and understandable (plus, he’s funny). https://youtu.be/-MKT3yLDkqk
- You explanation is so clear, make so much sense.
- H_null – no difference; H_alt – is difference; I always thought that
H_null – independent; H_alt – associated
- Just a quick note, I try to up load ‘dog_data.csv’ to my own folder, so I can do the project on ipynb, I clicked ‘upload’, not working, maybe another way, please advise. (please see pic below)
Thank you for your valuable feedback, you are an integral part of codecademy’s success.
Judy from sunny Melbourne
It can be a difficult thing to wrap one’s head around. (trust me! lol)
YouTube to the rescue again (and Mr. Nystrom) about Null and Alt. hypotheses:
You understand why this project for A/B testing requires a chi-square test, right? (because the variables are categorical). Here’s a couple articles:
I highly recommend reading up on hypothesis testing and the different types of tests according to your data set (off the CC platform. Though, they have added a ton of helpful articles that expand what they’d previously taught).
Are you on the Analyze Data with Python course?
I just checked that project and you shouldn’t be uploading anything. If you want to download the file, I suggest copying the data and pasting it into something like Excel (using “paste special” choose the option “text”), then all the data is in one column.
Then click on the Data tab, select the “text to columns” option and follow the steps in the Text Wizard there. All the data is separated by commas, so, each comma would then become a column. And then you can save the file as a csv and load that into Jupyter Notebook.
Thank you Lisa for another detailed suggestion. I have now created a special file named "Feedback from Lisa - Codecademy’, so I can follow your your suggestion.
Thank you making my code journey much more pleasant.
Jude from Sunny Melbourne
I am on the ‘Data Scientist’ Career path, but I also do some short ‘skill path’, to get a piece of paper, to keep me motivated. Your valuable feedback is a big part of motivation now.
You’re welcome! It’s a good community of learners here and we’re all here to help one another.
Some more good references for DS stuff:
And, if you’re ever stuck looking for data sets, Kaggle is a good resource: https://www.kaggle.com/datasets