US Medical Insurance Project Feedback

The project was not challenging for me, but since it was my first project on data analysis, I think I prepared it inexperiencedly. I think I will perform better and technical analysis after completing my career path. I would appreciate it if you give feedback. Happy coding!

Data Science Project

Your project was neatly formatted and included some interesting results. The code was clean and spelled out coherently. Your calculation of the correlation coefficient made it pretty clear that smoking had the biggest impact on insurance charges.

The code is perfectly fine and I like the granularity of the analysis.

I think overall, be careful about making generalizations without any evidence for claims. Especially be mindful of the charged language used with regards to the BMI variable (words like “normal”, “obese”, etc). I would disregard that variable entirely b/c it’s not a measure of one’s health at all (doesn’t consider genetics, bone density, muscle mass, race & gender differences, etc). It’s a made up number that insurance companies use to charge people higher rates. And, the medical community doesn’t really rely on it w/regards to one’s health. (google it).

Be mindful of assumptions like this:

  • “It shows that individuals with high BMI generally tend to have higher healthcare expenses and therefore insurance costs.”
    And, “This suggests that social and cultural factors may influence smoking habits across genders. Additionally, the high number of non-smokers in both genders may indicate that awareness of healthy living is increasing and smoking is decreasing.”

  • This part: “The age of 39 indicates that you are generally in the middle age group before applying for insurance…”
    is not true. The median age in the data set, is just that: the median age for the sample population in the data.

  • In the descriptive stats part, if you checked the .min() & .max(), you’d see that there are outliers that pull the mean for the charges variable. Median is a more accurate number.

  • Insurance companies charge smokers higher rates (it’s called a “tobacco rating”) which can generally increase one’s premiums up to 50%. It also depends on where one lives:

df.groupby(['region', 'sex', 'smoker'])['charges'].median().round(2)

			           charges
region	    sex	   smoker	
northeast	female	no	8681.14
                   yes	22331.57
            male	no	8334.46
                   yes	33993.37

northwest	female	no	7731.86
                   yes	28950.47
             male	no	6687.44
                   yes	26109.33
southeast	female	no	7046.72
                    yes	35017.72
             male	no	6395.95
                   yes	38282.75
southwest	female	no	7348.14
                   yes	34166.27
             male	no	7318.96
                   yes	35585.58

Thank you for the detailed feedback. I will review what you said and try to make my project even better. I will also disclaim some crude language I made in the analysis.