What is with these strata? - My Medical Insurance Portfolio Project

Hi all,

Link to the repository: GitHub - aaaaaaaana/codecademy_insurance_costs_project: This exploratory data analysis looks at correlations between different person characteristics and their insurance cost.

I am posting my work on the medical insurance project. It was a fun experience and I got to make use of some topics learned in the Codecademy courses. This work took me about one evening, let’s say 4-5h. I don’t remember exactly.

I took the chance to experiment with graphical representations since I find this the easiest and most fun way to deal with exploratory data analyses. I did find an interesting feature about the data but I couldn’t manage to find what was causing it.
If anyone here has an idea please leave a reply!

Also, any feedback on the project is welcome!

Cheers

1 Like

Some thoughts after review:

  • You imported Pandas, so why not just use that to load the dataset and to inspect it, rather than the csv library? (It’s great that you know how to write functions though.) You could look at the first 5 rows (or 10 like you’ve done) or so and that way, you won’t have a huge wall of text (1338 rows of data) for people to scroll through to get to the rest of your analysis. I almost missed the part where you did use Pandas b/c I was constantly scrolling to get to the next cell in the notebook.

  • You’re looking at potential correlations here with this EDA (descriptive stats). You haven’t done any significance testing (inferential stats). So, in your readme file, this: “This exploratory data analysis looks at correlations between…” Should say, “potential correlations”. Same w/your notebook. And, remember: correlation ≠ causation.

  • You could do a df.info() and see the types of data:

RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)

It also might be better to keep age and children as int64 dtype. It’s up to you. I guess the object dtype is a little slower and if you’re worried about your computer’s ram, then change it. See here.

  • Before you jump into plots (which are great, btw!), it might be beneficial for you to explore some basic descriptive stats about the data (because that’s what EDA is). Ask questions like:

    • What’s the mean/median age?

    • How many women vs. men are in the data?

    • What are the regions represented?

    • How many smokers vs. non-smokers?

    • Where do the smokers live? Are those costs different between regions?

    • etc.

df['charges'].mean().round(2)
13270.42

df['charges'].median()
9382.03

Another quick way to see that basic info is by using df.describe():

	    age	 bmi	            children	          charges
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.663397	1.094918	13270.422265
std	14.049960	6.098187	1.205493	12110.011237
min	18.000000	15.960000	0.000000	1121.873900
25%	27.000000	26.296250	0.000000	4740.287150
50%	39.000000	30.400000	1.000000	9382.033000
75%	51.000000	34.693750	2.000000	16639.912515
max	64.000000	53.130000	5.000000	63770.428010

You can see some outliers in the data with the max and min of the charges column too.
Further,

print(df[df.charges == df.charges.max()])

      age  sex     bmi   children   smoker     region      charges
543   54  female  47.41   0            yes  southeast  63770.42801

print(df[df.charges == df.charges.min()])

     age   sex    bmi  children smoker     region    charges
940   18  male  23.21         0     no  southeast  1121.8739

  • If you do a basic search online about ‘what affects insurance costs in the US?’ you’ll see that the variables that have the most influence on costs are, age, smoker status, region. One of the first questions that’s asked of someone when they sign up for insurance is, “Do you smoke?”

From the data:

df.sort_values('charges', ascending=False).head(7)

    age	sex	bmi	children	 smoker	    region	charges
543	 54	female	47.410	0	   yes	   southeast	63770.42801
1300  45	male	30.360	0	yes	   southeast	62592.87309
1230  52	male	34.485	3  yes 	   northwest	60021.39897
577	31	female	38.095	1	   yes	     northeast	58571.07448
819	33	female	35.530	0	    yes	   northwest	55135.40209
1146	60	male	32.800	0.  yes	southwest	52590.82939
34	28	male	36.400	1	   yes	   southwest	51194.55914

#you can do the same with min:
df.sort_values('charges', ascending=True).head()

You can do all sorts of EDA w/Pandas. :panda_face:

Good work. Keep at it.

2 Likes

Hi @lisalisaj
Thanks a lot for your constructive feedback! Indeed, I am rather a beginner so I can not say I know much about the capabilities of pandas. I imported it for a specific purpose and didn’t think to explore what else I could have used it for.

Regarding the remark about potential correlations, I can definitely see your point. I tried to stay away from making any statements about causations but I honestly wasn’t fully aware of the need to also prove correlations (here I imagine you are referring to measures like the correlation coefficient). It was more a visual observation of points fitting a trend.

Also thank you for the resources provided and overall the effort you put into answering! I appreciate it!

Cheers

2 Likes

YW! :slightly_smiling_face:
Yes, correlation, or degree to which two variables are related. Pearson’s correlation is one of them.