Hi @vikemarkus,

I had a look at your project and I think the code you wrote is elegant and efficient. It was straightforward and pretty easy for me as an outsider to understand what you did, and why you did it.

With regard to the statistical method you used, you canâ€™t infer a causal relationship between gender and insurance costs this way. To be fair, the courses havenâ€™t paid much attention to this yet at this point in the course, but comparing averages of groups within a dataset isnâ€™t enough to find a causal relationship, even if you filter out observations to make the remaining ones relatively similar to each other, apart from their gender.

In order to get to your final dataset, you made sure that all the people included in the final dataset are comparable to each other. Your intuition to do this makes sense to me, but it does mean that your final dataset only contains people from South East, and doesnâ€™t include anyone who is a smoker or a parent, so youâ€™re left with only 117 observations out of the original 1338. That makes the people in your sample easier to compare to each other, but it also causes a bias in your sample. The conclusion that you can draw from this is that non-smoking women from South East who donâ€™t have children spend 21% more on average than non-smoking men from South East who donâ€™t have children, but you donâ€™t know whether that means that it also holds true for women in general, or whether the effect is truly causal (there could be a correlation for some reason, other than gender being a causal factor).

I would recommend looking into regression analysis, which is a group of methods that allows you to find relationships between a dependent variable and independent variables, and shows you how each of your independent variables (not just your variable of interest, but also the control variables) affect your dependent variable.

The most common and simple type is linear regression, especially OLS regression. It allows you to find relationships (and see whether theyâ€™re statistically significant) while controlling for confounding factors (i.e. the control variables). It still doesnâ€™t prove a causal effect (unless you work with more advanced methods, but itâ€™s very difficult with observational data), but it does allow you to use all of the observations you have instead of just a filtered subset.

If you want to do this using Python, you can do it in Jupyter using one of many packages for this (for example sklearn, statsmodels or linearmodels).

I enjoyed looking into your work, hope this helps!