Hi Guys, I´m in Machine learning course with 27% and this is my project
U.S. Medical Insurance Costs
Git: GitHub - WilliamGaleanoM/python-portfolio-project-starter-files: Code to Codecademy project
I would like you comment about this
This is module 6 without Pandas library 
Congrats on completing the project.
A few thoughts:
-
you have a solid grasp on how to write functions to glean insights from the data set.
-
you clearly describe your goals at the top of the notebook.
-
don’t forget to cite where the dataset came from in the readme file or at the top of the notebook.
-
maybe rather than compute the mean cost of insurance, it might be better to find the median. There are outliers in the data set and that will skew the mean.
Ex:
df.describe()
>> age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
#or,
df['charges1'].mean().round(2)
>>13270.42
df['charges'].median().round(2)
>> 9382.03
-
that said, it might be good to show the mean and median of the charges col. in the notebook.
-
The difference in costs between smoker vs. non smoker is a bit less ($27110.94):
insurance[['smoker', 'charges']].groupby('smoker').median().round(2)
>> charges
smoker
no 7345.41
yes 34456.35
Short and to the point. Good work. 
1 Like