Review my technique in the U.S. Medical Insurance Portfolio Project

The project itself wasn’t hard to “assemble”, but it surely was something else when I realized I had to come up with my own requirements.
I already has some experience with Numpy and Pandas from a previous freeCodeCamp course, and decided why not use it, it said it can be used :slight_smile:
It took about 2 hours because of the brainstorming and because I didn’t have any practical experiences with the libraries, but it was definietly fun

git repo ascii file: codecademy_portfolio_projects/us-medical-insurance-costs.asciidoc at main · Tofan-afk/codecademy_portfolio_projects · GitHub
Thank you for taking your time

A few considerations:

  • Add a brief intro at the top of the notebook including the citation for the data. Pretend you’re presenting the project to someone who knows nothing about the data set. You’re telling a story, so you’d want to have an intro, then analyze the data w/ some comments, and then a conclusion and possible next steps.

  • It might be better to look at the median, rather than the mean of charges in the data b/c there are outliers that pull the mean.

You could see some basic descriptive stats by using methods from the Pandas library. Stuff like: the .describe() method on the df, or, use .value_counts() on a specific column too. Don’t forget about .groupby() as well. Just a suggestion.

Ex:

df.describe()

>	        age	   bmi	  children	    charges
count 1338.000000	1338.000000	1338.000000	1338.000000
mean 39.207025	30.663397	1.094918	13270.422265
std	 14.049960	6.098187	1.205493	12110.011237
min	 18.000000	15.960000	0.000000	1121.873900
25%	 27.000000	26.296250	0.000000	4740.287150
50%	 39.000000	30.400000	1.000000	9382.033000
75%	 51.000000	34.693750	2.000000	16639.912515
max	 64.000000	53.130000	5.000000	63770.428010

#or
df['charges'].describe().round(2)
>>count     1338.00
mean     13270.42
std      12110.01
min       1121.87
25%       4740.29
50%       9382.03
75%      16639.91
max      63770.43
#or:
df['charges'].mean().round(2)
>13270.42

df['charges'].median().round(2)
> 9382.03

df["smoker"].value_counts()

no     1064
yes     274
#etc
  • Are you trying to show some sort of correlation with the functions for bmi & charges and children & charges? There are no hypothesis tests; nor significance testing here.

Good start! Keep at it. :technologist:

That’s actually really cool. Thank you for the suggestions it’s been long since I had the opportunity to actually apply my Pandas skills

1 Like