This was my first ever project in Data Science! I’m trying to change my career and was pursuing Javascript full-stack pretty heavily, but more recently I realized that Data Science is a better fit for my personality type. I enjoyed the coding, but the documentation and markup was new to me. I think that as I get further along in this Codecademy Path I will learn more about how the analysis should go, and I can return to this project for improvements.
This project took me 2 days, one for the coding, and one for the documentation.
Thanks in advance for reviewing my project!
Congrats on completing the project.
-
You’re adept at writing functions to extract insights from the dataset.
-
Good use of comments describing the results from the functions.
-
One thing I’ll say is that the mean of charges will be skewed b/c of outliers. So, perhaps a better stat to use is median.
Ex:
df['charges'].mean().round(2)
>> 13270.42
vs:
df['charges'].median().round(2)
>9382.03
Or,
df.describe()
>>. age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
df[["age", "sex", 'charges']].groupby('sex').median().round(2)
age charges
sex
female 40.0 9412.96
male 39.0 9369.61
df[["age", "sex", 'charges']].groupby('sex').mean().round(2)
age charges
sex
female 39.50 12569.58
male 38.92 13956.75
Same goes for smokers, slightly smaller difference:
df[["smoker", 'charges']].groupby('smoker').median().round(2)
charges
smoker
no 7345.41
yes 34456.35
vs:
df[["smoker", 'charges']].groupby('smoker').mean().round(2)
charges
smoker
no 8434.27
yes 32050.23
Keep up the good work!
1 Like