Hi everyone,
I would like to share my version of the Medical Insurance project. This was actually my first project in Data Science, and I had a great experience working on it, though I spent more time than expected—probably because I got stuck on the discussions I was trying to elaborate, and also because English is not my first language.
I would appreciate your feedback on the discussions I made and a review of the code. Thank you for helping me!
Link to the repository: GitHub - mcanoff/Medical-insurance-costs: A Python-driven project that examines medical insurance costs in the United States, uncovering key demographic and behavioral factors that influence pricing.
Congrats on completing the project.
Some random observations:
-
Really solid work. I also like the comments in the notebook. Anyone reading it could follow along with your step by step analysis. Your analysis is clear and supported with visualizations throughout.
-
This part, ‘bmi: The Body Mass Index, which indicates whether the person is at a healthy weight or not.’ isn’t true. Using words like “healthy” and “not healthy” are biased. BMI is a controversial number and is not an indicator of one’s overall health. Historically, it was created in the 19th century by a mathematician, not a physician. It’s a number based on white men and was never meant to be used as a health measure. It doesn’t take into account family history, bone density, muscle mass, etc. It’s just bad statistics overall.
I do like that you further explain it here, “It is important to note, however, that BMI is not a comprehensive measure of an individual’s health. While it provides a useful framework for understanding weight categories, it does not capture the full complexity of a person’s health.”
I have reviewed so many projects where people aren’t mindful of biased language that they use with this variable. Personally, I would avoid it completely. There are other variables that could possibly correlate with charges (region, sex, smoker).
- Region (in addition to age and smoker status) actually plays a large role in how much people are charged for insurance (income by region as well, tho not reflected in the dataset).
Some other things to consider:
df.groupby(['region', 'sex'])['charges'].median().round(2)
charges
region sex
northeast female 10197.77
male 9957.72
northwest female 9614.07
male 8413.46
southeast female 8582.30
male 9504.31
southwest female 8530.84
male 9391.35
#granted, this is just descriptive stats, so some significance testing would need to be done to see if there's any real difference between the means.
#men also seem to pay more in different regions & if they're smokers:
df.groupby(['region', 'sex', 'smoker'])['charges'].median().round(2)
charges
region sex smoker
northeast female no 8681.14
yes 22331.57
male no 8334.46
yes 33993.37
northwest female no 7731.86
yes 28950.47
male no 6687.44
yes 26109.33
southeast female no 7046.72
yes 35017.72
male no 6395.95
yes 38282.75
southwest female no 7348.14
yes 34166.27
male no 7318.96
yes 35585.58
Good work! Keep at it.