I found my take on this project pretty straightforward. However, I know what I have done is quite simplistic, so any advice for greater complexity would be appreciated. The project took me about 30 minutes to complete.GITHUB - analytical breakdown of the CSV data
Doesn’t work:
It would be better if you uploaded a python notebook (.ipynb) to your GH repo.
Use Jupyter or Colab.
@lisalisaj Hey! thanks for the reply. I have added a python notebook copy to the GitHub now
A few thoughts:
-
it’s good that you describe your thought processes in comments in each cell. Data people are also storytellers and if someone (audience) isn’t familiar with the dataset, it’s good to describe what you’re doing in your analysis so they can follow along.
-
if you wanted to share your notebook with others, they’d not be able to run the cells b/c you’ve imported the csv locally, from your machine.
Rather, 1.) you could create a repo in your GH account–call it “csv files” or whatever and upload it to there (as long as it’s 25MB and under. Over that, you have to run some sneaky code that will allow you to upload larger files). Then, 2.) to load the csv into your notebook, go to the file and view it raw (on GH), copy the web address, then 3.) add this snippet of code to your notebook (Colab or Jupyter) like so:
df = pd.read_csv("https://raw.githubusercontent.com/etc/etc/medical_insurance.csv")
- you imported pandas…so, you really don’t need to write functions for things like the built in methods like
df.head()
,.mean()
,value_counts()
, etc. Maybe that’s personal preference, but if there are built in methods, I say use them.
ex:
df.head()
df.columns
df.info()
df.describe(include='all')
df['column name'].mean()
df['column name'].median()
etc
or,
df['col name'].value_counts()
Ex:
df["sex"].value_counts()
male 676
female 662
df["region"].value_counts()
southeast 364
southwest 325
northwest 325
northeast 324
Further, if you wanted to, you could break out the df even more granularly, by creating a women-only (or smoker, or by each region) df and a men-only df like so:
.iloc
pulls out the rows where the particular condition is met.
women_only = df.iloc[(df['sex']=='female').values]
women_only.head()
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.92400
5 31 female 25.74 0 no southeast 3756.62160
6 46 female 33.44 1 no southeast 8240.58960
7 37 female 27.74 3 no northwest 7281.50560
9 60 female 25.84 0 no northwest 28923.13692
or:
smokers = df.iloc[(df['smoker']=='yes').values]
smokers.head()
or:
southwest = df.iloc[(df['region']=='southwest').values]
southwest.head()
Good work on finishing the project!
Thanks! super useful advice!