Medical Insurance Data Analysis Project

Hi! This is my first time feeling like a project is complete enough to share it here.
(I was working on the web dev career path for over a year and recently made the switch to data analytics and I couldn’t be happier!)

I’m really proud of it but would also like some feedback in ways I could improve. For example there is quite a lot of repeated code because if I were to try to make a function around it I would need to insert another variable into a variable name and that didn’t seem possible. I feel like there’s got to be ways I can rework the functions and clean it up a little.

The other thing I personally would change is adding some charts to better the visualize the data. I’m hoping we cover that in the course soon, if not I’ll do some personal research and update it one of these days.

Anyway…
Here it is!

1 Like

Yes, this is a project that you will revisit as you progress along in the course.

It’s quite clear you understand how to write functions to pull details out of the dataset while looking for potential correlations as to what affects charges. (This could lead to doing two sample t-tests later on). I also like the level of granularity here. Solid work.

A few observations:

  • “This data set excludes children and seniors who qualify for medicare”. The data set includes ppl 18-64 b/c insurance is tied to work here (unless one doesn’t have insurance thru work and has ACA instead). Who knows if the data set includes commercial insurance holders and ACA. :woman_shrugging:

  • The EDA should read like a story: intro (data source citation, initial questions), analysis, conclusions. I would add the conclusions at the end of the notebook. But, I do see why you put observations after each set of functions. I do that too with v. brief comments in my notebook.

  • I would recommend avoiding the bmi variable altogether. It’s not an accurate measure of one’s overall health (It’s a number that was originally based on white men). It doesn’t take into account one’s bone density, muscle mass, sex & racial differences or genetics.
    More importantly, it’s best to be mindful of using subjective/biased language like, “underweight”, “healthy”, “overweight”, “obesity1”, “obesity2”, “obesity3”. If you’re going to keep the variable in your analysis, then it’s better to use numerical bin ranges rather than charged language.

  • Rather than look at the mean of the charges variable, it might be better to look at the median value b/c there are outliers in the data that pull the mean. Maybe check for min, max, mean, median of that column to see.

  • I’d avoid using words like “bias” too when analyzing the dataset. Data analysts/scientists should strive to be objective, rather than subjective in their analyses.

For further, off-platform practice, if you’re interested in US population demographics, the US Census website is a good place to start.

2 Likes

Thanks so much for your feedback! :grin:

A few questions about your few observations.

  • Thanks for helping me understand why the phrasing I use to explain the age range doesn’t fully capture what is known and isn’t known about the dataset. I’ll be sure to change that verbiage.

  • Do you have suggestions for examples / templates to base the narrative flow of the EDA around?

  • Because I’m analyzing how much insurance companies charge based on these factors and those insurance companies do increase prices because of BMI it is an extremely relevant variable in this analysis. I myself would fall in class 3 obesity and have been discriminated against by insurance companies. BMI is not an indicator of health for me or anybody but denying the fact it currently does impact fat peoples ability to access care doesn’t lead to more objective analysis. The language I used for the categories is the official labels given to the numerical bin ranges. Should I add commentary into my notebook that explains why I am approaching BMI including the choice to refer to the numerical ranges by their official names?

  • I focused on the averages because in a data set like this all known factors besides individual insurance companies are present in the data. So I don’t feel like for these categories outliers could be disregarded. The extremes in any one category were because those same people could be traced to another. For example those with the highest charges in the general male category were because they were class 3 obese smokers. If I were to take out those outliers at the start of the analysis and focus on the median I worry I would have missed the data that pointed to men’s average prices being higher because there were more men in high price impact categories and not because there was an innate bias against men. Could you explain to me how analyzing the median could deepen my analysis? Or conclusions I came to based on average that would be more sound if I used median to back them up? I did use minimum and maximum at times, mostly for the full range at once. Would there be a benefit to looking at the min and max in specific subcategories? I do want my logical conclusions to be sound, and genuinely want to know if there is something off about my current approach and logic.

  • I used bias here to imply how much weight those that set the insurance prices put on these different categories. Not necessarily my own subjective bias. Is there a better way to quickly phrase this that I can use in place of bias in my subheadings?

Hope this isn’t too many questions. If you chose to take the time to break this down for me, I would really appreciate it. But I do already appreciate you giving my project the once over and do feel encouraged and motivated by your feedback! Thank you so much :grin:

1 Like

Data analysis at its very core is supposed to be objective and based on empirical evidence, rather than subjective opinions or assumptions. You’re looking for potential correlations and possible causation. But, you’ve not yet done any statistical analysis to look for a significant difference between the means. You can research EDA as there’s a lot out there written about the process. For some reason I thought that there were articles about the process in all DA/DS courses. Perhaps I’m wrong(?).

This is your project and these are just some observations from a stranger. In addition to BMI not being an accurate measure of one’s health, using charged/biased language like, “healthy”, “underweight”, “overweight”, and “obese” are subjective terms and should be avoided. I’ve reviewed way too many projects here that utilized this type of language. What is “healthy” or “overweight”? Those are not objective. Again, this is something you can research. I think it was in 2023 that the AMA released a report suggesting that doctors stop using this number as a health measurement.

As far as mean vs. median–Outliers in data can pull the mean and skew the data which is why I mentioned looking at the median.
ex:

#this is Pandas
df['charges'].describe().round(2)
charges
count	1338.00
mean	13270.42
std	12110.01
min	1121.87
25%	4740.29
50%	9382.03
75%	16639.91
max	63770.43

Here’s something else that I found interesting about the dataset.
You can pull out the rows with the max & min values in a column (using Numpy):

df.nlargest(5, columns=['charges'])

    age	  sex   bmi children smoker	region	 charges
543	54	female	47.410	0	yes	 southeast	63770.42801
1300 45	male	30.360	0	yes	 southeast	62592.87309
1230 52	male	34.485	3	yes	 northwest	60021.39897
577	31	female	38.095	1	yes	 northeast	58571.07448
819	33	female	35.530	0	yes	 northwest	55135.40209

#so, there's 7 rows that have charges greater than 50,000

df.sort_values('charges', ascending=False).head(7)

   age	 sex	 bmi children smoker region	  charges
543	54	female	47.410	0	yes	southeast	63770.42801
1300 45	male	30.360	0	yes	southeast	62592.87309
1230 52	male	34.485	3	yes	northwest	60021.39897
577	31	female	38.095	1	yes	northeast	58571.07448
819	33	female	35.530	0	yes	northwest	55135.40209
1146 60	male	32.800	0	yes	southwest	52590.82939
34	28	male	36.400	1	yes	southwest	51194.55914

#you can do the same with minimum values:

df.sort_values('charges', ascending=True).head()

     age  sex bmi  children	smoker	region	charges
940	18	male	23.21	0	no	southeast	1121.8739
808	18	male	30.14	0	no	southeast	1131.5066
1244  18 male	33.33	0	no	southeast	1135.9407
663	18	male	33.66	0	no	southeast	1136.3994
22	18	male	34.10	0	no	southeast	1137.0110

#you can also isolate the smokers from the data set and analyze them. Though, that's a small sample of 274

smokers = df.iloc[(df['smoker']=='yes').values]
smokers.head()

age	sex	bmi	children	smoker	region	charges
0	19	female	27.90	0	yes	southwest	16884.9240
11	62	female	26.29	0	yes	southeast	27808.7251
14	27	male	42.13	0	yes	southeast	39611.7577
19	30	male	35.30	0	yes	southwest	36837.4670
23	34	female	31.92	1	yes	northeast	37701.8768

smokers.groupby(['region', 'sex'])['charges'].median().round(2)

                 charges
region	sex	
northeast	female	22331.57
male	33993.37
northwest	female	28950.47
male	26109.33
southeast	female	35017.72
male	38282.75
southwest	female	34166.27
male	35585.58

                                 
                  

Just other ways to investigate the data.

The Codecademy forum thread discusses a Medical Insurance Data Analysis project, focusing on Python skills like data cleaning, exploration, and visualization. Participants analyze insurance costs and identify trends, enhancing their understanding of data science concepts and practical implementation.

What is the point of posting CGPT replies, rather than actual, helpful commentary?

Please stop posting CGPT replies in this community. Posting such content adds zero value here, plus, it violates Community Guidelines.

Hasn’t posted more than once in a single day. Possibly trying to keep the daily/weekly streak alive when completing a lesson/exercise is not convenient. Just a theory.
Might even be a bot or some seo related activity.

Incidentally, link to these forums has already disappeared from the Codecademy main page.

1 Like

Well that stinks. :pensive: