Review requested for US Medical Insurance project

Hi,

I’d be really grateful if anyone could spare the time to review my portfolio project. I thoroughly enjoyed making it, and it really pushed me, including using some stats I haven’t used for some years!

It’s hard to estimate how much time it took me. I worked on it over probably 4 days, and I doubt it was less than 12 hours - maybe more.

You can see it in the Portfolio repo on my Github page here, or alternatively, you can see it as an zipped HTML file here, if that’s less bother than downloading the Jupyter Notebook.

Thanks in advance for your time and consideration of this.

Ian

Congrats on completing the project.
It must seem weird & incredibly expensive to view this data set as a non-US person. One can see why medical debt is the number one reason people claim bankruptcy here. Anyway, some thoughts…

  • You do a really thorough job of introducing the notebook and the scope. Your use of commentary about your process and questions of the data is helpful for walking anyone through the analysis. And, you wrap it up nicely at the end w/your conclusions and further questions for research.

  • To get some initial, basic descriptive stats, you might consider using .describe().
    ex:
    df.describe(include='all')

Which gives you this:

  • No need to do a crosstab for counts of men v. women. You can just use value_counts() (Also, this is just my preference to use built-in methods, rather than writing out my own functions for everything. My thought is, if it’s there, use it).
df["sex"].value_counts()

>>male      676
female      662

#could also use groupby for median charges:

insurance[["sex", "charges"]].groupby("sex").median().round(2)

>>      charges
sex	
female	9412.96
male	9369.62

#numbers in each region
insurance.groupby(['region', 'sex'])['sex'].count()

>>region     sex   
northeast  female    161
           male      163
northwest  female    164
           male      161
southeast  female    175
           male      189
southwest  female    162
           male      163

#Or, 
insurance.groupby(['region', 'sex'])['charges'].median().round(2)

>>region     sex   
northeast  female    10197.77
           male       9957.72
northwest  female     9614.07
           male       8413.46
southeast  female     8582.30
           male       9504.31
southwest  female     8530.84
           male       9391.35

  • In the section that looks at men v. women and who pays more, you used np.mean, rather than median.
    Also, slight copy+paste typo: the second print() for female median costs has a typo–it says “male median insurance charge” twice.

  • Good to include the chi-square test with M&W and smoking & charges.

  • I like the inclusion of the Census population density viz. (I am a Census data nerd).

Good work!

1 Like

Thanks so much for this - invaluable.

Yes - the sums invovled in US healthcare are a little baffling! (Not that private healthcare doesn’t exist in the UK, this is a whole other thing…!)

You points were hugely healpful, and I learnt lots from your reply. I did actually have whole chunks of me coding unnecessary fuinctions “to show I could” from a portfolio point of view, and in the end, just thought, “This is crazy, and all I’m showing I can o is iterate over loops - they can see from the rest of my portfolio that I can do that - it doesn’t need to be here!”

Thanks again.

1 Like