Hi everyone,
The link to my project is https://github.com/TashenM/U.S.-Medical-Insurance-Costs-Portfolio-Project.git
It was daunting at first but eventually became fun to make this project my own. It took me about a week to complete but took my time with it. I found this project easier the further along I got with it.
Please check it out and give me your feedback on what I could’ve done better or what you thought was good.
Thank you! 
Congrats on completing the project. You put a lot of work into it.
Some thoughts:
-
the readme file made me chuckle at the “TLDR;” part. You could always add what you put at the end of your notebook in the readme file. Including where you got the data from (Kaggle).
-
goals are clearly stated at top of notebook.
-
good use of comments so anyone viewing the notebook can follow along as you analyze.
-
if you’re more comfortable creating classes, then do so. But, you’ve imported Pandas which is a pretty powerful library w/a ton of built in functionality…and it hasn’t really been used.
Ex:
#.describe() will give you basic stats about the data:
df.describe()
>>age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
df["smoker"].value_counts()
>>>no 1064
yes 274
df[["smoker", 'charges']].groupby('smoker').mean()
>>> charges
smoker
no 8434.268298
yes 32050.231832
- But, remember that outliers will affect the mean and skew results. So, it might be better to check for the median.
ex:
df[["smoker", 'charges']].groupby('smoker').median()
>>>charges
smoker
no 7345.40530
yes 34456.34845
- you can use
value_counts()
to find the totals in a column,
df["region"].value_counts()
>> southeast 364
southwest 325
northwest 325
northeast 324
#and groupby()
df[['sex', 'region', 'charges']].groupby('region').median()
>>> charges
region
northeast 10057.652025
northwest 8965.795750
southeast 9294.131950
southwest 8798.593000
Seaborn docs
- further, smokers and non-smokers can be pulled out of the data set --using
df.iloc
--and analyzed separately too if you’re so inclined:
non_smokers = df.iloc[(insurance['smoker']=='no').values]
non_smokers.head()
>>>age sex bmi children smoker region charges
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
5 31 female 25.740 0 no southeast 3756.62160
Sorry, that was a bit long-winded. I guess I am a Pandas advocate. 
Good work!

Thank you for your feedback @lisalisaj
Looking back on it now, I don’t know why I didn’t put the “TLDR” part in the readme file. I will change that! 
With respect to the Pandas built in functionality: I did originally make use of most of the examples you illustrated here and I was not very comfortable making use of classes at the time.
I ended up changing my mind because I felt the need to dive into using classes based off of what I had planned using some of the Pandas built in functionality to get out of my comfort zone for this project.
I really do appreciate your response and taking the time to look at my project! Thanks again!
1 Like