US medical insurance costs - seeking feedback and improvements!

I found my take on this project pretty straightforward. However, I know what I have done is quite simplistic, so any advice for greater complexity would be appreciated. The project took me about 30 minutes to complete.GITHUB - analytical breakdown of the CSV data

Doesn’t work:

It would be better if you uploaded a python notebook (.ipynb) to your GH repo.
Use Jupyter or Colab.

@lisalisaj Hey! thanks for the reply. I have added a python notebook copy to the GitHub now :slight_smile:

1 Like

A few thoughts:

  • it’s good that you describe your thought processes in comments in each cell. Data people are also storytellers and if someone (audience) isn’t familiar with the dataset, it’s good to describe what you’re doing in your analysis so they can follow along.

  • if you wanted to share your notebook with others, they’d not be able to run the cells b/c you’ve imported the csv locally, from your machine.
    Rather, 1.) you could create a repo in your GH account–call it “csv files” or whatever and upload it to there (as long as it’s 25MB and under. Over that, you have to run some sneaky code that will allow you to upload larger files). Then, 2.) to load the csv into your notebook, go to the file and view it raw (on GH), copy the web address, then 3.) add this snippet of code to your notebook (Colab or Jupyter) like so:

df = pd.read_csv("https://raw.githubusercontent.com/etc/etc/medical_insurance.csv")
  • you imported pandas…so, you really don’t need to write functions for things like the built in methods like df.head(), .mean(), value_counts(), etc. Maybe that’s personal preference, but if there are built in methods, I say use them. :slight_smile:
    ex:
df.head()
df.columns
df.info()
df.describe(include='all')
df['column name'].mean()
df['column name'].median()
etc
or,
df['col name'].value_counts()
Ex:
df["sex"].value_counts()
male      676
female    662

df["region"].value_counts()

southeast    364
southwest    325
northwest    325
northeast    324

Further, if you wanted to, you could break out the df even more granularly, by creating a women-only (or smoker, or by each region) df and a men-only df like so:
.iloc pulls out the rows where the particular condition is met.

women_only = df.iloc[(df['sex']=='female').values]
women_only.head()

     age	sex	 bmi  children smoker   region	 charges
0	19	female	27.90	0	yes	southwest	16884.92400
5	31	female	25.74	0	no	southeast	3756.62160
6	46	female	33.44	1	no	southeast	8240.58960
7	37	female	27.74	3	no	northwest	7281.50560
9	60	female	25.84	0	no	northwest	28923.13692

or:
smokers = df.iloc[(df['smoker']=='yes').values]
smokers.head()

or:

southwest = df.iloc[(df['region']=='southwest').values]
southwest.head()

Good work on finishing the project! :partying_face:

1 Like

Thanks! super useful advice!

1 Like