Thanks for coming to share your portfolio Project with other learners!
When posting your project for review, please be sure to include the following:
Your review of the Project. Was it easy, difficult, just right?
An estimate of how long it took you to complete
The link to your code repo
We hope you enjoyed this project!
Hi I am currently 45% through my Data science foundation. This is the link to my first stab at the US medical project.
I found the project quite good, I was able to practice most of the techniques I have learnt, however I am still alittle confused about the pivot table method.
Any feedback anyone can give would be much apprciated:
Congrats on completing the project. It’s clear you know how to write functions to arrive at descriptive stats, but since you’ve imported Pandas, why not use the built in functions?
Ex:
df['age'].median()
39.0
#use .value_counts()
df['sex'].value_counts()
count
sex
male 676
female 662
df["smoker"].value_counts()
count
smoker
no 1064
yes 274
df[["age", "sex", 'charges']].groupby('sex').median().round(2)
age charges
sex
female 40.0 9412.96
male 39.0 9369.62
When you used the .describe() method, you could see that there are some max and min values in the charges col. that pull the mean, so perhaps median is a better stat to look at for charges. Just something to consider.
There’s a record, where the max charges are:
print(df[df.charges == df.charges.max()])
age sex bmi children smoker region charges
543 54 female 47.41 0 yes southeast 63770.42801
I think your average charges for smokers is a bit off, might want to revisit that.
This is what I got:
df[["smoker", 'charges']].groupby('smoker').mean().round(2)
charges
smoker
no 8434.27
yes 32050.23
#or, median:
df[["smoker", 'charges']].groupby('smoker').median().round(2)
charges
smoker
no 7345.41
yes 34456.35
It’s a good idea that you separated out the smokers from the dataset and analyzed it from there.
Also, it’s good that you looked at region and its potential effect–if any-- on charges.
Pandas, might help here again,
df.groupby(['region', 'sex', 'smoker'])['charges'].median().round(2)
charges
region sex smoker
northeast female no 8681.14
yes 22331.57
male no 8334.46
yes 33993.37
northwest female no 7731.86
yes 28950.47
male no 6687.44
yes 26109.33
southeast female no 7046.72
yes 35017.72
male no 6395.95
yes 38282.75
southwest female no 7348.14
yes 34166.27
male no 7318.96
yes 35585.58
Basically for this EDA, you’re trying to discern what, (if any) variables have an effect on charges, or, is there any correlation (keeping in mind that correlation ≠causation). You did a good job w/looking at smoking (b/c one of the first questions one is asked when applying for insurance is, “Are you a smoker?” Rates for smokers are higher.) Maybe add a couple sentences at the beginning of the notebook about the data set (if you know the source of it, state it here), and what you want to look at in the EDA. Additionally, wrap up your thoughts/findings/next steps in a brief conclusion at the end of the notebook.