Hi all,
Link to the repository: GitHub - aaaaaaaana/codecademy_insurance_costs_project: This exploratory data analysis looks at correlations between different person characteristics and their insurance cost.
I am posting my work on the medical insurance project. It was a fun experience and I got to make use of some topics learned in the Codecademy courses. This work took me about one evening, let’s say 4-5h. I don’t remember exactly.
I took the chance to experiment with graphical representations since I find this the easiest and most fun way to deal with exploratory data analyses. I did find an interesting feature about the data but I couldn’t manage to find what was causing it.
If anyone here has an idea please leave a reply!
Also, any feedback on the project is welcome!
Cheers
1 Like
Some thoughts after review:
-
You imported Pandas, so why not just use that to load the dataset and to inspect it, rather than the csv library? (It’s great that you know how to write functions though.) You could look at the first 5 rows (or 10 like you’ve done) or so and that way, you won’t have a huge wall of text (1338 rows of data) for people to scroll through to get to the rest of your analysis. I almost missed the part where you did use Pandas b/c I was constantly scrolling to get to the next cell in the notebook.
-
You’re looking at potential correlations here with this EDA (descriptive stats). You haven’t done any significance testing (inferential stats). So, in your readme
file, this: “This exploratory data analysis looks at correlations between…” Should say, “potential correlations”. Same w/your notebook. And, remember: correlation ≠ causation.
-
You could do a df.info()
and see the types of data:
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
It also might be better to keep age and children as int64 dtype. It’s up to you. I guess the object
dtype is a little slower and if you’re worried about your computer’s ram, then change it. See here.
df['charges'].mean().round(2)
13270.42
df['charges'].median()
9382.03
Another quick way to see that basic info is by using df.describe()
:
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
You can see some outliers in the data with the max and min of the charges column too.
Further,
print(df[df.charges == df.charges.max()])
age sex bmi children smoker region charges
543 54 female 47.41 0 yes southeast 63770.42801
print(df[df.charges == df.charges.min()])
age sex bmi children smoker region charges
940 18 male 23.21 0 no southeast 1121.8739
- If you do a basic search online about ‘what affects insurance costs in the US?’ you’ll see that the variables that have the most influence on costs are, age, smoker status, region. One of the first questions that’s asked of someone when they sign up for insurance is, “Do you smoke?”
From the data:
df.sort_values('charges', ascending=False).head(7)
age sex bmi children smoker region charges
543 54 female 47.410 0 yes southeast 63770.42801
1300 45 male 30.360 0 yes southeast 62592.87309
1230 52 male 34.485 3 yes northwest 60021.39897
577 31 female 38.095 1 yes northeast 58571.07448
819 33 female 35.530 0 yes northwest 55135.40209
1146 60 male 32.800 0. yes southwest 52590.82939
34 28 male 36.400 1 yes southwest 51194.55914
#you can do the same with min:
df.sort_values('charges', ascending=True).head()
You can do all sorts of EDA w/Pandas. 
Good work. Keep at it.
2 Likes
Hi @lisalisaj
Thanks a lot for your constructive feedback! Indeed, I am rather a beginner so I can not say I know much about the capabilities of pandas. I imported it for a specific purpose and didn’t think to explore what else I could have used it for.
Regarding the remark about potential correlations, I can definitely see your point. I tried to stay away from making any statements about causations but I honestly wasn’t fully aware of the need to also prove correlations (here I imagine you are referring to measures like the correlation coefficient). It was more a visual observation of points fitting a trend.
Also thank you for the resources provided and overall the effort you put into answering! I appreciate it!
Cheers
2 Likes
YW! 
Yes, correlation, or degree to which two variables are related. Pearson’s correlation is one of them.