U.S. Medical Insurance Costs: Comments and reviews welcome!

Project was a lot of fun. I thought I was finished several times, but kept learning more with each draft. I am sure I could keep going for many more days, but I think I will call it good.

I learned a lot from fellow students, expert reviewers, and AI.

It’s great that you used all those different approaches in order to understand the data set. You cover a lot and lay it out clearly in the readme file.
I think it might be a good idea to focus on one (library) for the analysis because it’s (the EDA) like a presentation and it’s difficult for the reader to see the focus. If you’re going to use Pandas, then use that instead. It’s pretty powerful and has a ton of built-in methods for descriptive statistics.(so you don’t have to use classes and create functions to pull insights out of the data).

Things like:

df.head() #will give you the first 5 rows or, you can pass through any number

df.describe()
>> 	age	        bmi	    children	        charges
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.663397	1.094918	13270.422265
std	14.049960	6.098187	1.205493	12110.011237
min	18.000000	15.960000	0.000000	1121.873900
25%	27.000000	26.296250	0.000000	4740.287150
50%	39.000000	30.400000	1.000000	9382.033000
75%	51.000000	34.693750	2.000000	16639.912515
max	64.000000	53.130000	5.000000	63770.428010

df.info()
>><class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)

  • rather than look at mean, try median for the charges column. There are some outliers that skew the mean.

  • good use of comments, but I think there needs to be a conclusion–just a short blurb about what you found in the data.

  • bmi isn’t a string; it’s a float (see above). You can already perform calcs on it,

df['bmi'].describe()

>>count    1338.000000
mean       30.663397
std         6.098187
min        15.960000
25%        26.296250
50%        30.400000
75%        34.693750
max        53.130000
Name: bmi, dtype: float64
1 Like