First attemt at US medical insurance project

Thanks for coming to share your portfolio Project with other learners!

When posting your project for review, please be sure to include the following:

  • Your review of the Project. Was it easy, difficult, just right?
  • An estimate of how long it took you to complete
  • The link to your code repo

We hope you enjoyed this project!

Hi I am currently 45% through my Data science foundation. This is the link to my first stab at the US medical project.

I found the project quite good, I was able to practice most of the techniques I have learnt, however I am still alittle confused about the pivot table method.

Any feedback anyone can give would be much apprciated:

Congrats on completing the project. It’s clear you know how to write functions to arrive at descriptive stats, but since you’ve imported Pandas, why not use the built in functions?

  • Ex:

df['age'].median()
39.0

#use .value_counts()
df['sex'].value_counts()
      count
sex	
male	676
female	662


df["smoker"].value_counts()
      count
smoker	
no	  1064
yes	   274

df[["age", "sex", 'charges']].groupby('sex').median().round(2)

         age	charges
sex		
female	40.0	9412.96
male	39.0	9369.62

  • When you used the .describe() method, you could see that there are some max and min values in the charges col. that pull the mean, so perhaps median is a better stat to look at for charges. Just something to consider.

There’s a record, where the max charges are:

print(df[df.charges == df.charges.max()])

    age     sex    bmi  children smoker     region      charges
543   54  female  47.41         0    yes  southeast  63770.42801
  • I think your average charges for smokers is a bit off, might want to revisit that.
    This is what I got:
df[["smoker", 'charges']].groupby('smoker').mean().round(2)

     charges
smoker	
no	8434.27
yes	32050.23

#or, median:
df[["smoker", 'charges']].groupby('smoker').median().round(2)

          charges
smoker	
no	     7345.41
yes	    34456.35

  • It’s a good idea that you separated out the smokers from the dataset and analyzed it from there.

  • Also, it’s good that you looked at region and its potential effect–if any-- on charges.
    Pandas, might help here again,

df.groupby(['region', 'sex', 'smoker'])['charges'].median().round(2)
                      charges
region	sex	smoker	
northeast	female	no	8681.14
                   yes	22331.57
male	          no	8334.46
                 yes	33993.37
northwest	female	no	7731.86
                   yes	28950.47
            male	no	6687.44
                   yes	26109.33
southeast	female	no	7046.72
                   yes	35017.72
            male	 no	6395.95
                   yes	38282.75
southwest	female	no	7348.14
                 yes	34166.27
male	no	7318.96
       yes	35585.58

  • Basically for this EDA, you’re trying to discern what, (if any) variables have an effect on charges, or, is there any correlation (keeping in mind that correlation ≠ causation). You did a good job w/looking at smoking (b/c one of the first questions one is asked when applying for insurance is, “Are you a smoker?” Rates for smokers are higher.) Maybe add a couple sentences at the beginning of the notebook about the data set (if you know the source of it, state it here), and what you want to look at in the EDA. Additionally, wrap up your thoughts/findings/next steps in a brief conclusion at the end of the notebook.

Good job! :panda_face:

1 Like