Please review: US Medical Insurance Project

Hi, I’ve just finished this project and would appreciate any advice or suggestions on how it could be improved. Thanks in advance!

  • This project was just right for my level
  • It probably took me about 4 hours to complete
  • My code repo can be found here

Thanks for coming to share your portfolio Project with other learners!

When posting your project for review, please be sure to include the following:

  • Your review of the Project. Was it easy, difficult, just right?
  • An estimate of how long it took you to complete
  • The link to your code repo

Some things to consider:

  • Add your conclusions at the end of the notebook.

  • Rather than looking at the total amount of the charges, perhaps look at the median or mean. (median is probably more useful here as there are some outliers in the data set that pull the mean). Related: look at the median costs for smokers v. non-smokers.

#I'm using Pandas here.
df['charges'].describe().round(2)

count     1338.00
mean     13270.42
std      12110.01
min       1121.87
25%       4740.29
50%       9382.03
75%      16639.91
max      63770.43

This is interesting:

df.groupby(['region', 'sex', 'smoker'])['charges'].median().round(2)

region     sex     smoker
northeast  female  no         8681.14
                   yes       22331.57
           male    no         8334.46
                   yes       33993.37
northwest  female  no         7731.86
                   yes       28950.47
           male    no         6687.44
                   yes       26109.33
southeast  female  no         7046.72
                   yes       35017.72
           male    no         6395.95
                   yes       38282.75
southwest  female  no         7348.14
                   yes       34166.27
           male    no         7318.96
                   yes       35585.58
  • Other things to look at: How many total records are in the data? Were there any NULLs? How many women v. men are in the data? Smokers v. non? What were the median charges for the different groups of children? What region of the country had a higher median of charges? Etc.

  • I would ignore bmi (for reasons mentioned in the project description) and any subjective language surrounding it b/c it’s not an accurate measure of one’s health. (it’s a number for insurance companies to charge people higher premiums).

  • The last part with all the functions–is a little unclear as to what you want people to glean from the output.

  • I’m not sure what course you’re taking, or if EDA (exploratory data analysis) has been mentioned yet(?) If not, I recommend reading up on it as a guide for first steps with any data set.

A good start!

Thanks a lot for the feedback, you really are a super user!

1 Like