U.S. Medical Insurance Projects - Feedback please

At first, I had no idea what I need to do. Compared to doing exercises during a course, the project helped me to learn how to import data practically and documented more details in the way that other people or even I can understand if they read my notebook later (I hope so!).

I took a day for doing this project but after I had a look at the other friend’s project I came out with more ideas. So, took another day to analyze more and learned again about how to publish it on Github (I forgot so fast!).

Here is my link : https://github.com/Ranchana-K/US_medical_insurance_by_Ranchana_K/blob/main/us-medical-insurance-costs.ipynb

Thank you in advance for any feedback and review. I am going to review some of you guys as well.
Enjoy your coding.

1 Like

Thank you for posting your project and offering to review others’ work! :slight_smile:

It’s good that you offer a brief description of the data and then the questions you want to explore in your analysis. Remember, your reader isn’t always going to be familiar with your dataset, so, a little description goes a long way.

I like that you described what you were doing before you defined each function as well. That’s helpful!

It might be a little confusing to see the numbers of men in each age group represented by a “-” negative number. (or, maybe that’s just me.). Maybe tweak that.

I like how in your summary you mention things that you want to investigate further. Good work!

Some ideas to ponder…Just suggestions, really. Also, I wasn’t sure where this project fell in the DS path so I wasn’t sure how much exposure you had to Pandas.

  • You could maybe do a df.head() at the beginning so the reader can see what the data frame looks like (or at least the first 5 rows).

  • To get a count, mean, min, 25%, 50%, 75% & max, std, etc of a column you can use the .describe() method in Pandas. Ex: df['ages].describe(). That could also give you a baseline for statistical significance testing a diff between means of age and insurance costs or male vs. female and ins costs, etc. (using scipy stats and math).

  • to get a series of value counts for unique rows you can use .value_counts().
    Exs: new_df = old_df[‘colname’].value_counts()
    regions_df = med_ins.csv[‘regions’].value_counts()
    See:
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html

  • For each one of your findings after your functions, it might be neat to create a viz for each using Seaborn–for the age groups and the numbers of children in the df.

  • When you’re summing up your findings at the end, rather than use words like 'majority" and “minority” say, ‘the mean (or avg.) age was ___’ and 'the age groups that were not really represented in the data were ___"

Great work! You should be proud of what you accomplished. :partying_face:

2 Likes

Kudos on finishing up. I like the fact you properly introduce your dataset and with the proper summary too it makes it seem like a well put together investigation which is good. I get the impression someone viewing this file could get a grasp of what you were doing without viewing any code that is another useful target to meet. My only addition for this section is that you could potentially add a little more detail on why this analysis was done (what’s interesting about it, what are the main goals, what possible benefit does it have).

For some general feedback-
It’s up to you but since you went to the trouble of including pandas and matplotlib then consider making more use of them, @lisalisaj lists several methods of doing so that could simplify the coding. I’d also note that Pandas has some excellent tools for grouping data that could eliminate or at least reduce some of your more complex if statements. SInce you imported matplotlib I’m kinda sad there’s only one figure, maybe add a few more you if you have the time :grinning:.

Perhaps it doesn’t concern you but there are a few return statements that have very long lines. Consider taking a few tips from a style guide and cleaning this up as it’s more prone to error, harder to read and harder to edit in it’s current set-up. It is worthwhile considering readability in code.

I’m afraid I can’t check if the following works with your data at present but there’s a neat way to concatenate strings in Python by just lining them up between some parantheses-

return (
    f"Average age: {total_age / len(ages_list) :.2f}\n"
    f"Maximum age: {max(ages_list)}\n"
    f"Minimum age: {min(ages_list)}"
    )

At present that snippet needs Python3 for real division (not integer) and the f-strings but a similar trick can be done with a little alteration in Python2. Note that instead of Decimal I used formatting placeholders; you may prefer not to.

You mention some inferences in the summary that I think benefit greatly from either a reference to a figure displaying the data in question or potentially some actual statistics (insurance cost increase with age is a valid inference but with a fit of the data you have evidence of this fact).

2 Likes

Thank you very much for reading my project thoroughly. This project in DS path is supposed to use Python fundamentals but I saw some examples in course introduction that uses pandas and found it is interesting. Thus, I tried by myself.

Your suggested functions are very useful and so too do other comments. I’ll try to play with those functions and improve my project :innocent:

1 Like

Thank you very much for your review and suggestions. I’ll learn more about what I can do with matplotlib and pandas to simplify functions and use some interesting methods.

Yes, I do agree with you about return statements. I would like to simplify them as well but at first I didn’t have any clue about how to write neatly. :blush: I’ll redo it for sure when I’ve got time.

1 Like