U.S. Medical Insurance Costs

Hello!
Yeah this project had me stumped but not the coding portion, I get all that. I’m trying to figure out how these skills are used irl. Anyway, I’ll keep going. Other than that it took me a few days off and on to really get going. Mostly because of my lack of motivation, what was I supposed to be showcasing? averages? How many people did what? median ranges or finding and eliminating outliers? I didn’t really know. I kinda just went and did a bunch of averages and figured out what the base cost was for everything and if BMI did actually affect the cost. I noticed there was this comment about it “One of the columns contains BMI data. While insurance companies do use BMI in their calculations, and that is reflected in this project, BMI is not necessarily an accurate predictor of health.” I focused on this too much I beleive, I wanted to see if it was factor or not. Turns out it is a constant of 1.39 increase in cost per 0.1 increase in BMI. weird.
Anyway It was ok.

Here is the link! even though it doesn’t seem like anyone looks at these.

Congrats on completing the project. :partying_face:

Basically with this project (which you will revisit after learning Pandas, etc) you’re doing EDA and trying to see if there are any correlations between variables in the sample data. All these skills are used by data analysts and scientists: obtaining data, cleaning it, exploring it, extracting insights from it, presenting your findings (also doing so in a way that speaks to technical and non-technical folks. ie: knowing your audience).

An aside: Nice to see they changed the wording for that project’s BMI column. (Perhaps based on my objections to it being in the data at all).

Anyway, some thoughts:

  • You have a solid grasp of writing the code to dig through the data. Though, you might want to re-work some of the functions (see third comment below).

  • You’re clear with your intentions and use comments well (so anyone looking at the notebook will see your thought processes).

  • I did this project awhile back and there were only 1338 rows in the data. (So, I wonder if this is an updated data set?) News to me! :joy: I was just wondering how you arrived at 10704 customers when there are only 1339 rows in the csv file you used?

See:

df.info()
>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)

#rows, cols:
df.shape
>>(1338, 7)
  • One thing to remember and that you mentioned is the presence of outliers in the data. Rather than calculating the mean of charges, it might be better to look at the median (b/c outliers skew the data).

Ex:

df['charges'].mean()
>>13270.42

df['charges'].median()
>> 9382.033

Further:

df[["age", "sex", 'charges']].groupby('sex').median()
>>             age	            charges
sex		
female	40.0	9412.96
male	39.0	9369.61
  • And for the regional calculations, it might be better to look at the median.
    Ex:
df[['region', 'charges']].groupby('region').median()

>>	charges
region	
northeast	10057.65
northwest	8965.79
southeast	9294.13
southwest	8798.59

#as opposed to:

df[['region', 'charges']].groupby('region').mean()

>>                     charges
region	
northeast	13406.384516
northwest	12417.575374
southeast	14735.411438
southwest	12346.937377

Just something to think about when you’re looking through a data set. :slight_smile:

1 Like

Hey! Thank you, I didn’t think anybody actually looked at these. lol but yeah you mentioned that there was an obscene amount of customers and there was, which is odd. So I went back and restarted the kernel and reran all the cells, somewhere along the lines many additions were added. Not sure where but after rerunning the data it is back to 1338


def average_insurance_cost(charges):
    return round(charges_total(charges) / len(charges), 2)
        
average_insurance_cost(charges)

print(f'''
The average cost of insurance is ${average_insurance_cost(charges)} from {len(charges)} customers. This does not
take into account the outliers.
''')
The average cost of insurance is $13270.42 from 1338 customers. This does not
take into account the outliers.

But again Thank You for the feedback, you don’t know how much it means! Really!

1 Like

Okay, cool. Whew.
I kept looking at that function, thinking, ‘no, that’s correct… But how is the number so high?’ :joy::woman_facepalming:t2:

1 Like