The project was not challenging for me, but since it was my first project on data analysis, I think I prepared it inexperiencedly. I think I will perform better and technical analysis after completing my career path. I would appreciate it if you give feedback. Happy coding!
Your project was neatly formatted and included some interesting results. The code was clean and spelled out coherently. Your calculation of the correlation coefficient made it pretty clear that smoking had the biggest impact on insurance charges.
The code is perfectly fine and I like the granularity of the analysis.
I think overall, be careful about making generalizations without any evidence for claims. Especially be mindful of the charged language used with regards to the BMI variable (words like “normal”, “obese”, etc). I would disregard that variable entirely b/c it’s not a measure of one’s health at all (doesn’t consider genetics, bone density, muscle mass, race & gender differences, etc). It’s a made up number that insurance companies use to charge people higher rates. And, the medical community doesn’t really rely on it w/regards to one’s health. (google it).
Be mindful of assumptions like this:
-
“It shows that individuals with high BMI generally tend to have higher healthcare expenses and therefore insurance costs.”
And, “This suggests that social and cultural factors may influence smoking habits across genders. Additionally, the high number of non-smokers in both genders may indicate that awareness of healthy living is increasing and smoking is decreasing.” -
This part: “The age of 39 indicates that you are generally in the middle age group before applying for insurance…”
is not true. The median age in the data set, is just that: the median age for the sample population in the data. -
In the descriptive stats part, if you checked the
.min()
&.max()
, you’d see that there are outliers that pull the mean for thecharges
variable. Median is a more accurate number. -
Insurance companies charge smokers higher rates (it’s called a “tobacco rating”) which can generally increase one’s premiums up to 50%. It also depends on where one lives:
df.groupby(['region', 'sex', 'smoker'])['charges'].median().round(2)
charges
region sex smoker
northeast female no 8681.14
yes 22331.57
male no 8334.46
yes 33993.37
northwest female no 7731.86
yes 28950.47
male no 6687.44
yes 26109.33
southeast female no 7046.72
yes 35017.72
male no 6395.95
yes 38282.75
southwest female no 7348.14
yes 34166.27
male no 7318.96
yes 35585.58
Thank you for the detailed feedback. I will review what you said and try to make my project even better. I will also disclaim some crude language I made in the analysis.