Analysis of the region influence on the average medical charges

Hi all!
I would like to share my bit on US Medical Insurance Cost project.

It is my first stand alone data analysis project.
I was interested to find the influence of particular region on the insurance cost. In my analysis I tried to find the least expensive and the most expensive regions in terms of the medical costs.
I found the project quite interesting, although it would be greate to have a better description of the data set. For example, it is still not clear to me, what the charges are - is it the money the person paid for the insurance or the actua medical costs the insurance company have paid for the person medical services to the hospitals. What do you think?

The results, which I have got from the analysis are:

For the non-smoking people the result of the analysis is quite conclusive:

The non-smoking person with the given age, bmi and number of children will have the lowest insurance charges in Southeast. Although the difference between Southeast and Southwest is not that big and in fact difficult to establish because of the skew in Southeast, due to much higher average bmi. The most expensive region will be the Northwest with 14% higher charges in average.

For the smoking people the results are not decisive, because of the bigger influence of the bmi values. It looks like the same average bmi skew in the South regions combined with the smoking status drives the regional prices there and it is not feasible to quantitavely describe that influence with the given data set.

I would like to hear any comments or feedback. Let me know if you found the similar or different patterns and conclusions.


The dataset is a public one from somewhere…and I’ve also seen it analyzed on Kaggle. It’s my understanding that the costs column is in regards to an individual’s annual medical costs (though, no procedures are listed which made me initially wonder if the charges were for deductibles).

IMO, another detail that would help this dataset is median income of the regions…b/c insurance costs are calculated by insurance co’s based on income levels of zip codes. But, I digress. :wink:

It looks like you have a thorough understanding of the data and how to get insights from it. Thank you for posting your thoughts here too! :slight_smile:
I like that you included explanations of what you were doing before the code cells and also analyzing your results. It gives any reader an idea of your mindset as you explore the data. I liked the “Goal” and “Limitations” sections and the map too! Though, it might be more useful to include Limitations at the end of the analysis–where you could talk about what you’d further research if you chose to do so.

If you wanted to do more thorough statistical tests, like a 2-tail T test, you could import SciPy, NumPy and the sqrt function from the math module to create some functions to calculate the means of two samples and then see if there’s a significant difference between them. It’s just a thought for further analysis, beyond EDA.

There are some health insurance datasets on the US Census site if you’re interested:

1 Like

Thanks a lot for your encouraging feedback! It is very improtant for me to realize the pros and cons of my analysis and how the other people understand and follow it.
I tried to use limited tools, which were mostly covered in the Data Scientist path so far (except the matplotlib). For example, I did not use pandas on purpose, because we did not study it yet and I was curious if I can perform the decent work using only the dictionaries and mimic what the pandas is doing myself.
I think I will return back to the analysis and update my project further after completion of other modules in the path.

What puzzles me about the insurance charges is that a lot of them look very similar to each other, so it looks more like the insurance purchase (calculated value), rather than the actual medical costs.

1 Like