My US Medical Insurance Cost Project (Updated)

Here is my project:

This project was pretty easy for the most part and took me about 3 hours.

I would appreciate feedback on my queries.

1 Like

Is this repo currently set to private or something similar? I’m unable to access it with that link.

1 Like

Sorry I just fixed it. Does it work now?

1 Like

Yup, got it now. Seems like you have some sensible conclusions in there which is good.

For feedback on the code itself-
They are quite a few similar looking chunks of code for calculating averages. Perhaps these could be bundled together into a single function or reduced in number; consider using some of the methods available to a pandas data frame since you’ve added the data to it anyway. The section on data grouped by region for example is much cleaner and readable so perhaps expand that style of coding to the earlier functions.

I think most everything else is in regards to presentation which is always tricky and very opinion based so consider the following comments and come to your own decision if they’re worthwhile implementing or not.

A little introduction on what the dataset is (where it’s from, layout and such) and perhaps a short description on what you’re looking into and why would greatly improve the introduction to this project.

There’s a suggestion that “older people” pay $83,000+ more than “middle-aged” people. Perhaps a little more scientific terminology might be appropriate e.g. banding (16-25, 25-45, 45-65, 65+ or something similar) instead of ‘older people’ (it also saves a viewer having to read your code to understand what you mean) and double check that value; it seems far too high for this dataset.
As an extension is there a relation (linear, quadratic or otherwise) between age and charge? Perhaps not but it’s worth considering if you have the time.

You format a number of values and limit them to a sensible number of decimal places but then others are without this, it may be best to format all those you choose to display to reasonable values. With differences consider adding % changes for easier inference.

Consider wrapping up your each of your separate analyses into a conclusion or summary. Avoid directly repeating yourself but a couple of sentences or more depending on the length of your analysis can help wrap up your analysis and tie everything together.


So, I have conclusions placed after each code section, but it would be best to summarize at the end?

Would you suggest placing all the questions the analysis is going to answer at the beginning?

Also, if I wanted to use this terminology, would I say like
50+ aged people?

1 Like

I’m a little biased since that’s the typical format for scientific documents but consider how you’d introduce this for example as a presentation. Imagine talking through this dataset with someone if you like. What point do you want to make after each section of analysis, is there anything that needs clarification or expansion. Is there anything you think is an essential point of interest?

A summary isn’t supposed to be a word for word repeat of what you’ve already analysed but a take home message. For those who sat through your presentation what do you want to make sure they remember?

Launching immediately into the data and analysis can be make it hard to follow without prior knowledge of the dataset. Without knowing the audience this is trickier but it should still be accessible to those who don’t know the background if at all possible. If they do know then they can always skim-read such an introduction.

The exact questions don’t always have to be added at the start but give your audience a reason as to why you did this analysis and why they should care. In some cases a summary of what you’re looking for can help answer why. Keep it short and sweet if you like but engaging an audience is worthwhile.

I think that reads as a little more professional, something along the lines of “Typical costs for those aged 50+ are on average x$ higher than those aged 40-50.” seems both more appropriate and much easier to understand than vague terminology.


Actually yes. Excluding the beginning, my whole project was really to see what factors influence how much you are charged.

Thanks so much for your feedback!

Oh one more question. Before every code section I briefly explain how I will answer each question by describing a function. Is that part really necessary?

1 Like

There’s a good reason why. Since this is on github consider adding a little detail to your README file about this too. There are some good guides online discussing how to write a good README but consider how someone who views the repo will understand what your work is about and how best to use it.

1 Like

Maybe I’m wrong, but didn’t I technically answer this question when I compared the charges of people aged 18-35 to middle-aged people and to people aged 50+?

I was talking more about an actual fit of the data. It might not be worthwhile but at the moment 3 data points and inference is not a strong argument. Your inference may well be correct but convince the viewer of this fact.

If you could say there’s a trend of +$300 +/- 50 extra charges for every year of age you’d have a much stronger argument. I’m fairly certain that’s not the case but consider an alternative way of presenting the data before making such an inference, you want a strong argument to make your point clear. Three data points for such a wide spread of data isn’t as convincing as it could be and it’d miss any interesting patterns (perhaps 30-35 pay more than 40-45).

If you have time- I changed some things:

Grand so, afraid this will be the last time I check but I hope you think it has improved.

In the introductory sentence you mention the data provided but perhaps add a little more information than that. Perhaps mention that it’s a medical insurance dataset and where it’s from or something along those lines. If that is the first thing anyone reads on this project then make sure they know wheat they’re dealing with.

You could skip a few uses of round where you’re using strings by using string formatting instead, for example f-strings style force float to two decimal places would be f'{x:.2f}' or with method '{:.2f}'.format(x).

So far as I’m aware there’s nothing special about the use of lambda in the dataframe grouping so you could just remove the lambda in this instance-

ages_charges['age_group'] = ages_charges.apply(

I like the fit and I think it adds something to the analysis. For a fit like this adding the error is commonplace so perhaps include it- $4200 +/- 500 for example (your decimal accuracy is limited by the errors).

I like the image grouping ages and charges, perhaps a legend might suit better than a colour bar since there are only 3 colours.

Those percentages seem a little off to me, you might want to double check their calculation.

Glad to see there’s an addition of a summary. Since you went to the trouble of assessing the increase of charges with age maybe include that value (ideally with errors).

1 Like

So something like $4184.07 +/- 1019.66 (this is my std_err)?

Yes that’s the kind of error I was referring to. Significant figures with errors are typically limited to only a couple of digits so $4200 +/- 1000 would be a sensible enough representation.

Okay thanks. Sorry for all the many questions. I just have one more. Is it common to include that error in the graph (like in a legend) or put it in the caption?

Yes, that’s fairly standard to include it somewhere with the figure. Where it goes may vary a bit depending on the purpose of the figure (a fully fledged caption might suit an essay, a short one might fit a presentation) perhaps it works best with both? Have an search for linear regression if you wanted some inspiration as I don’t think there’s a strict answer to that.

1 Like