Review Request: U.S. Medical Insurance Costs Portfolio Project

Hi! I just finished this project as a part of the Data and Programming Foundations for AI Skill Path. This took me about 3 hours to complete. Since I am not familiar with any Machine Learning algorithms yet, I applied my knowledge of Python Fundamentals taught in the skill path and attempted some data analysis tasks.
I would love to update this project with things like predictive modelling and data visualization once I learn these concepts in other Codecademy lessons.

Here’s the link to the GitHub repository: GitHub - aby18/U.S.-Medical-Insurance-Costs: Codecademy Portfolio Project from the Data and Programming Foundations for AI Skill Path

Hi there!

Looks awesome!
I like the fact that you worked with the key;value pairs to store the data per age group, I didn’t really use this in mine.

A minor thing that I would change would be in the next part:
print(average_cost_by_smoker_status)
{‘smoker’: 32050.232, ‘non smoker’: 8434.268}
Here I would have printed a $ sign and a rounded value to two decimals.

The rest of the code looks nice, and although we follow differnt paths, most -if not all- the stuff that I learned in the data science path is included in your project.

Greetings!

1 Like

A suggestion:

  • Rather than look at the average costs, perhaps look at the median instead. There are outliers in the data that pull the mean. (you can see that if you do some descriptive stats on that column-min, max, mean, median)
    ex:
#I'm using Pandas here

df['charges'].describe().round(2)
count     1338.00
mean     13270.42
std      12110.01
min       1121.87
25%       4740.29
50%       9382.03
75%      16639.91
max      63770.43

df[["sex", "charges"]].groupby("sex").median().round(2)

       charges
sex	
female	9412.96
male	9369.62

#by region:
df.groupby(['region', 'sex'])['charges'].median().round(2)
region     sex   
northeast  female    10197.77
           male       9957.72
northwest  female     9614.07
           male       8413.46
southeast  female     8582.30
           male       9504.31
southwest  female     8530.84
           male       9391.35

#smoker v. non

df.groupby(['region', 'sex', 'smoker'])['charges'].median().round(2)

region     sex     smoker
northeast  female  no         8681.14
                   yes       22331.57
           male    no         8334.46
                   yes       33993.37
northwest  female  no         7731.86
                   yes       28950.47
           male    no         6687.44
                   yes       26109.33
southeast  female  no         7046.72
                   yes       35017.72
           male    no         6395.95
                   yes       38282.75
southwest  female  no         7348.14
                   yes       34166.27
           male    no         7318.96
                   yes       35585.58

#etc
  • Also, FWIW, I wouldn’t add a $, b/c you’d have to strip it if you were to do any calculations.

  • I like the granularity of your analysis.

1 Like

Thanks for the feedback!

Really appreciate the feedback! Would definitely keep in mind taking care of outliers in future projects.

1 Like