Request for feedback on US Medical Insurance Cost Project

Hi there!

I just finished my first portfolio project from the Data Science career path and I am pretty satisfied how it turned out. Ofcourse I am still a novice so any form of constructive (or maybe even positive) feedback is more than welcome! I really want to improve by learning from you guys!

It took me around 5-7 hours to complete. Mainly because I was unable to stop combining more variables with eachother. When I started this project, I was quite confused of the goal of the project, because this project is less guided than previous projects. I had have some difficulty defining the scope of the project, which resulted in me being unable to stop at some point. Overall I had fun. I felt powerful when my code worked (sometimes at the first try) and I really think this project is a nice test of the fundamentals of my understanding of Python.

Enough talk, here is the link to my project!

Thanks in advance!

1 Like

Very descriptive! I learn a lot going through it!
Now i see i have a lack on analysis on my project, and the way you approached it is very simple and clear. Congratulations. Got me ideas to work with :slight_smile:


Congrats on completing the project. You do a good job of describing what you’re doing in the code cells and ppl can follow along with your thought processes and it’s clear that you understand how to use the csv library and write functions.

I’m assuming that only the csv library has been taught thus far and not other libraries like Pandas. You’re right–there’s a lot more one can do with this data set, once more libraries are known (Seaborn, Pandas, SciPy), along with creating data vizzes and writing functions for testing a difference between the means of 2 variables.
But, what’s cool is–and you’re right–you can get lost in the data while exploring it and it is fun! :slight_smile:

From one data person to another:
One thing that’s always bothered me about this project & the subsequent ones (and this has nothing to do with you, but the people who created these learning modules utilizing this data set), is the assumptions that learners will make about bmi (which is a made up unreliable number from 100 years ago that has no indication of one’s overall health.) Essentially, it’s a number that insurance companies use to charge people more $ and not an accurate measure of one’s health. So, using words like “obese”, “normal”, etc. are charged and personally, I’d avoid using them altogether. There’s lots of professional, medical research out there on the whole bmi controversy if one is curious about it.

I’d also be mindful of using phrases like, ‘It is safe to conclude…’ as well (b/c you really haven’t done any sort of statistical testing here). But rather use, non-charged statements like, ‘the data shows us this…’

1 Like

Thanks for the kind words, toust. I’m happy I’ve got to inspire you to delve deeper into the data!

1 Like

Wow, thanks for the extensive comment, Lisa. I really appreciate you took the time to look so closely into my project. Big thanks! Your assumptions about my Python skills are right. I’ve only been learning about Python in itself. Can’t wait to start with pandas, numpy, seaborn and the like.

As for your comment on BMI: I agree with you that it isn’t the most ideal statistic about ones health. It does however provide us with a statistic that enables us to compare BMI between countries, which can lead to interesting comparisons. In this project specifically: I just worked with the variables given from the dataset.

The most important thing from your post that will stick with me, is to be careful how to interpret the data and how to draw conclusions from it (or don’t :slight_smile:). Thanks for that!


1 Like

First off, amazing work! I’m new to this course, but your work has really inspired me. You have a great understanding of the code. I don’t see anything functionally wrong with it. My only suggestions concern small typos and light recommendations for improved readability, clarity and output consistency. Sorry if it seems overwhelming, I just like to clarify and give examples. Take it all with a grain of salt!

Having worked in the insurance field, the term “subject” seems a bit general. Even the term “insuree” (the people being covered by insurance) includes children and spouses. The term I would recommend for this study is “policyholder” which is the sole person who pays for the insurance, makes changes, and has dependents on the policy (children/spouses).

Cell 89 - Small typo in your #comments explaining the code. You mention “minimal” to describe the code for both the minimum and maximum number of children. Also, the output of this data would be consistent with other cells if it contained a string of text describing it.

print(“The minimum number of kids a policyholder reported having is ” + …)

print(“The maximum number of kids a policyholder reported having is ” + …)

Cell 118 - to make the paragraph more readable, perhaps use commas in the dollar values and a space between the values and the “USD”

“smokers is 32050USD, nonsmokers is 8434USD.”

Change to …

“smokers is 32,050 USD, nonsmokers is 8,434 USD.”

Cell 126 - In your sentence describing the data you write “From our 1338 subjects we 274 subjects are smokers, which is 20.5% of the population.” I believe you added the word “we” in error.

To make the output consistent with others, perhaps convert the sentence that follows the output data into a string within the print function, while rounding to one decimal place in your round function, if you want to show 20.5%, instead of 20.48%. I would also recommend changing “subjects” to “policyholders’’ here.


print(“From our 1,338 policyholders” + str(smoker_counter)+ " are smokers, which is " + str(round(smoker_counter / len(smoker_status) * 100, 1)) + ”% of the population.”)

Output- “From our 1,338 policyholders, 274 are smokers, which is 20.5% of the population.”

Cell 123 - “bmi_upto_25” may leave some readers thinking a bmi value of 25 may be included in that group. And “bmi_from_25” is not as intuitive as it could be. I think more intuitive definitions would be “bmi_below_25” and “bmi_25_andabove”, for example. Printing the output data as a string would make this cell consistent with others.

125 - In the very last sentence you write “the second group (BMI over 25)”. This group includes the bmi value of 25, so it would be better to write “ the second group (BMI of 25 and above) " while the first group description “(BMI below 25)” matches intuitively with my suggestion earlier (Cell 123). Also the values in the paragraph description of the data , “13.940USD and 10.280USD”, have decimals instead of commas, which may confuse some people. I would also suggest adding a space between the dollar amount and the “USD” to make it more readable, and perhaps clarifying that these are averages would send a clearer message.

Ex. “We can see that the second group (BMI of 25 and above) is paying more for their medical insurance than policyholders from the first group (BMI below 25), averaging around 13,940 USD and 10,280 USD, respectively.”

1 Like

Thank you for your kind words, regarding the code in my project! Really appreciate it. I also just started with this careerpath and sometimes coding feels somewhat alien to me. Not with this project though, going through it felt solid and it’s nice to see how people react to it.

Also thank you for your extensive and detailed feedback. I read it all and agreed on most points with you. I did struggle with coming up with a proper name for the rows; subjects was the best I got :wink: