Project: Does gender have a causal effect on a person's insurance cost?

Hello everyone!

This is my version of the med project. My main objective was to find out whether there is a causal effect between gender and insurance cost. I took this as my objective because in addition to being genuinely interested in the results, finding causal connection also requires me to find answers to many other side questions. This made the project diverse and interesting for me.

It was very interesting and pretty mentally challenging at times, although from a coding perspective - nothing too crazy.
Timewise, the whole project took me one day, about 7 hours in total to complete.

Thank you for investing your time in my project - I hope you like it!

Hi @vikemarkus,

I had a look at your project and I think the code you wrote is elegant and efficient. It was straightforward and pretty easy for me as an outsider to understand what you did, and why you did it.

With regard to the statistical method you used, you can’t infer a causal relationship between gender and insurance costs this way. To be fair, the courses haven’t paid much attention to this yet at this point in the course, but comparing averages of groups within a dataset isn’t enough to find a causal relationship, even if you filter out observations to make the remaining ones relatively similar to each other, apart from their gender.

In order to get to your final dataset, you made sure that all the people included in the final dataset are comparable to each other. Your intuition to do this makes sense to me, but it does mean that your final dataset only contains people from South East, and doesn’t include anyone who is a smoker or a parent, so you’re left with only 117 observations out of the original 1338. That makes the people in your sample easier to compare to each other, but it also causes a bias in your sample. The conclusion that you can draw from this is that non-smoking women from South East who don’t have children spend 21% more on average than non-smoking men from South East who don’t have children, but you don’t know whether that means that it also holds true for women in general, or whether the effect is truly causal (there could be a correlation for some reason, other than gender being a causal factor).

I would recommend looking into regression analysis, which is a group of methods that allows you to find relationships between a dependent variable and independent variables, and shows you how each of your independent variables (not just your variable of interest, but also the control variables) affect your dependent variable.

The most common and simple type is linear regression, especially OLS regression. It allows you to find relationships (and see whether they’re statistically significant) while controlling for confounding factors (i.e. the control variables). It still doesn’t prove a causal effect (unless you work with more advanced methods, but it’s very difficult with observational data), but it does allow you to use all of the observations you have instead of just a filtered subset.

If you want to do this using Python, you can do it in Jupyter using one of many packages for this (for example sklearn, statsmodels or linearmodels).

I enjoyed looking into your work, hope this helps! :slight_smile: