Hello, and thank you for reviewing my code!
I started the Data Science Path because we did a half-semester lesson on Python last year in school. I was super interested in it, so I decided to learn more. Even with that brief background, this career path has always challenged me! And this project was no exception.
This portfolio project took me about 4 days to do, working on the project on and off for about 1hr 30min each day (I get distracted sometimes lol).
I didn’t have too much trouble coding it, but I was kind of at a loss for how to analyze what my code was saying. It’s much different from my other classes. Tips on how I could have improved on that would be super helpful!
Thanks again,
mews_mochi
Do you have a .py or .ipynb file to view?
Sorry, I was trying to figure out how to share files through GitHub but I guess I messed it up.
http://localhost:8888/files/US%20Medical%20Insurance%20Proj?_xsrf=2|c37ddddb|3cef9a06bceab113e381beb8c5d2719f|1688766539
Tell me if this one works!
No, that one won’t work b/c it’s local on your computer.
In the repo on GH, you can select “add file” and then locate the .py file or the Jupyter Notebook and upload it that way.
Yep, that one worked.
Congrats on finishing the project. This is one that you’ll return to as you accumulate more python skills.
Some thoughts:
-
Good job on writing the functions and using the csv module.
-
That said, there are only 1338 rows in the csv file, so I think you need to double check the values for men & women in the dataset. Or, maybe this is an issue of re-running the code cell in the notebook.(You have: Count for female: 1324, Count for male: 1352)
-
Good use of comments and describing your thought processes as you sift through the data.
-
b/c there are outliers in the data you might want to use median rather than mean when looking at the charges column.
-
You also might not want to print out the (lengthy) results of the columns, b/c it’s a lot to scroll through. Maybe just return the results for your own use but not show them in the notebook.(?)
See:
df['sex'].value_counts()
male 676
female 662
df['charges'].mean().round(2)
13270.42
vs:
df['charges'].median().round(2)
9382.03
Or:
df.describe()
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
An aside, if you don’t want to use Jupyter Notebook, look into Google Colab. It’s built like Jupyter, but your files are in your Drive. There’s a menu option in the dropdown File menu to push a copy of the notebook directly to a GH repo (you just have to select the correct repo when doing so).
Good work! Keep at it. 
1 Like
Hi! Thank you so much for your input,
I just have a question about the values in the male and female datasets. When I use len() to look at the length of the two dictionaries, it adds up. Maybe I’m looking at a different part than you?
Also, for the bottom example code, are you using pandas there? I just started using pandas so I don’t know a whole lot about it, but I’m going through that part of the career course now.
You are a lifesaver,
mews_mochi
1 Like
But, if there are only 1338 records/rows in the data set, you can’t have 1324 women and 1352 men. It doesn’t add up. You function is doubling the numbers. So, go back and check that (indentation).
df.info()
>><class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
df["sex"].value_counts()
>>male 676
female 662
Yep, that’s 