Project was a lot of fun. I thought I was finished several times, but kept learning more with each draft. I am sure I could keep going for many more days, but I think I will call it good.
I learned a lot from fellow students, expert reviewers, and AI.
It’s great that you used all those different approaches in order to understand the data set. You cover a lot and lay it out clearly in the readme file.
I think it might be a good idea to focus on one (library) for the analysis because it’s (the EDA) like a presentation and it’s difficult for the reader to see the focus. If you’re going to use Pandas, then use that instead. It’s pretty powerful and has a ton of built-in methods for descriptive statistics.(so you don’t have to use classes and create functions to pull insights out of the data).
Things like:
df.head() #will give you the first 5 rows or, you can pass through any number
df.describe()
>> age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
df.info()
>><class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
rather than look at mean, try median for the charges column. There are some outliers that skew the mean.
good use of comments, but I think there needs to be a conclusion–just a short blurb about what you found in the data.
bmi isn’t a string; it’s a float (see above). You can already perform calcs on it,
df['bmi'].describe()
>>count 1338.000000
mean 30.663397
std 6.098187
min 15.960000
25% 26.296250
50% 30.400000
75% 34.693750
max 53.130000
Name: bmi, dtype: float64