US Medical insurance project - feedback

Hi everyone,

I would like to share my version of the Medical Insurance project. This was actually my first project in Data Science, and I had a great experience working on it, though I spent more time than expected—probably because I got stuck on the discussions I was trying to elaborate, and also because English is not my first language.

I would appreciate your feedback on the discussions I made and a review of the code. Thank you for helping me!

Link to the repository: GitHub - mcanoff/Medical-insurance-costs: A Python-driven project that examines medical insurance costs in the United States, uncovering key demographic and behavioral factors that influence pricing.

Congrats on completing the project. :partying_face:

Some random observations:

  • Really solid work. I also like the comments in the notebook. Anyone reading it could follow along with your step by step analysis. Your analysis is clear and supported with visualizations throughout.

  • This part, ‘bmi: The Body Mass Index, which indicates whether the person is at a healthy weight or not.’ isn’t true. Using words like “healthy” and “not healthy” are biased. BMI is a controversial number and is not an indicator of one’s overall health. Historically, it was created in the 19th century by a mathematician, not a physician. It’s a number based on white men and was never meant to be used as a health measure. It doesn’t take into account family history, bone density, muscle mass, etc. It’s just bad statistics overall.
    I do like that you further explain it here, “It is important to note, however, that BMI is not a comprehensive measure of an individual’s health. While it provides a useful framework for understanding weight categories, it does not capture the full complexity of a person’s health.”

I have reviewed so many projects where people aren’t mindful of biased language that they use with this variable. Personally, I would avoid it completely. There are other variables that could possibly correlate with charges (region, sex, smoker).

  • Region (in addition to age and smoker status) actually plays a large role in how much people are charged for insurance (income by region as well, tho not reflected in the dataset).

Some other things to consider:

df.groupby(['region', 'sex'])['charges'].median().round(2)

		          charges
region	sex	
northeast	female	10197.77
            male	9957.72
northwest	female	9614.07
            male	8413.46
southeast	female	8582.30
            male	9504.31
southwest	female	8530.84
            male	9391.35
#granted, this is just descriptive stats, so some significance testing would need to be done to see if there's any real difference between the means.

#men also seem to pay more in different regions & if they're smokers:

df.groupby(['region', 'sex', 'smoker'])['charges'].median().round(2)

			           charges
region	sex	smoker	
northeast	female	no	8681.14
                   yes	22331.57
 male	no	8334.46
       yes	33993.37
northwest	female	no	7731.86
                  yes	28950.47
male	no	6687.44
           yes	26109.33
southeast	female	no	7046.72
                     yes	35017.72
male	no	6395.95
          yes	38282.75
southwest	female	no	7348.14
                    yes	34166.27
male	no	7318.96
        yes	35585.58

Good work! Keep at it. :woman_technologist: