U.S Medical Insurance Costs Project

Hi everyone,

The link to my project is https://github.com/TashenM/U.S.-Medical-Insurance-Costs-Portfolio-Project.git

It was daunting at first but eventually became fun to make this project my own. It took me about a week to complete but took my time with it. I found this project easier the further along I got with it.

Please check it out and give me your feedback on what I could’ve done better or what you thought was good.
Thank you! :blush:

Congrats on completing the project. You put a lot of work into it.

Some thoughts:

  • the readme file made me chuckle at the “TLDR;” part. You could always add what you put at the end of your notebook in the readme file. Including where you got the data from (Kaggle).

  • goals are clearly stated at top of notebook.

  • good use of comments so anyone viewing the notebook can follow along as you analyze.

  • if you’re more comfortable creating classes, then do so. But, you’ve imported Pandas which is a pretty powerful library w/a ton of built in functionality…and it hasn’t really been used.

#.describe() will give you basic stats about the data:
>>age	bmi	children	charges
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.663397	1.094918	13270.422265
std	14.049960	6.098187	1.205493	12110.011237
min	18.000000	15.960000	0.000000	1121.873900
25%	27.000000	26.296250	0.000000	4740.287150
50%	39.000000	30.400000	1.000000	9382.033000
75%	51.000000	34.693750	2.000000	16639.912515
max	64.000000	53.130000	5.000000	63770.428010

>>>no     1064
yes     274

df[["smoker", 'charges']].groupby('smoker').mean()
>>>	  charges
no	8434.268298
yes	32050.231832 
  • But, remember that outliers will affect the mean and skew results. So, it might be better to check for the median.
df[["smoker", 'charges']].groupby('smoker').median()
no	7345.40530
yes	34456.34845
  • you can use value_counts() to find the totals in a column,
>> southeast    364
southwest    325
northwest    325
northeast    324

#and groupby()

df[['sex', 'region', 'charges']].groupby('region').median()
>>>	          charges
northeast	10057.652025
northwest	8965.795750
southeast	9294.131950
southwest	8798.593000

Seaborn docs

  • further, smokers and non-smokers can be pulled out of the data set --using df.iloc --and analyzed separately too if you’re so inclined:
non_smokers = df.iloc[(insurance['smoker']=='no').values]
>>>age	sex	bmi	children	smoker	region	charges
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520
5	31	female	25.740	0	no	southeast	3756.62160

Sorry, that was a bit long-winded. I guess I am a Pandas advocate. :panda_face:

Good work! :woman_technologist:t2: :technologist:t2:

Thank you for your feedback @lisalisaj

Looking back on it now, I don’t know why I didn’t put the “TLDR” part in the readme file. I will change that! :smile:

With respect to the Pandas built in functionality: I did originally make use of most of the examples you illustrated here and I was not very comfortable making use of classes at the time.

I ended up changing my mind because I felt the need to dive into using classes based off of what I had planned using some of the Pandas built in functionality to get out of my comfort zone for this project.

I really do appreciate your response and taking the time to look at my project! Thanks again!

1 Like