Hi,
I’d be really grateful if anyone could spare the time to review my portfolio project. I thoroughly enjoyed making it, and it really pushed me, including using some stats I haven’t used for some years!
It’s hard to estimate how much time it took me. I worked on it over probably 4 days, and I doubt it was less than 12 hours - maybe more.
You can see it in the Portfolio repo on my Github page here, or alternatively, you can see it as an zipped HTML file here, if that’s less bother than downloading the Jupyter Notebook.
Thanks in advance for your time and consideration of this.
Ian
Congrats on completing the project.
It must seem weird & incredibly expensive to view this data set as a non-US person. One can see why medical debt is the number one reason people claim bankruptcy here. Anyway, some thoughts…
-
You do a really thorough job of introducing the notebook and the scope. Your use of commentary about your process and questions of the data is helpful for walking anyone through the analysis. And, you wrap it up nicely at the end w/your conclusions and further questions for research.
-
To get some initial, basic descriptive stats, you might consider using .describe()
.
ex:
df.describe(include='all')
Which gives you this:
- No need to do a crosstab for counts of men v. women. You can just use
value_counts()
(Also, this is just my preference to use built-in methods, rather than writing out my own functions for everything. My thought is, if it’s there, use it).
df["sex"].value_counts()
>>male 676
female 662
#could also use groupby for median charges:
insurance[["sex", "charges"]].groupby("sex").median().round(2)
>> charges
sex
female 9412.96
male 9369.62
#numbers in each region
insurance.groupby(['region', 'sex'])['sex'].count()
>>region sex
northeast female 161
male 163
northwest female 164
male 161
southeast female 175
male 189
southwest female 162
male 163
#Or,
insurance.groupby(['region', 'sex'])['charges'].median().round(2)
>>region sex
northeast female 10197.77
male 9957.72
northwest female 9614.07
male 8413.46
southeast female 8582.30
male 9504.31
southwest female 8530.84
male 9391.35
-
In the section that looks at men v. women and who pays more, you used np.mean
, rather than median.
Also, slight copy+paste typo: the second print()
for female median costs has a typo–it says “male median insurance charge” twice.
-
Good to include the chi-square test with M&W and smoking & charges.
-
I like the inclusion of the Census population density viz. (I am a Census data nerd).
Good work!
1 Like
Thanks so much for this - invaluable.
Yes - the sums invovled in US healthcare are a little baffling! (Not that private healthcare doesn’t exist in the UK, this is a whole other thing…!)
You points were hugely healpful, and I learnt lots from your reply. I did actually have whole chunks of me coding unnecessary fuinctions “to show I could” from a portfolio point of view, and in the end, just thought, “This is crazy, and all I’m showing I can o is iterate over loops - they can see from the rest of my portfolio that I can do that - it doesn’t need to be here!”
Thanks again.
1 Like