Hi! This is my first time feeling like a project is complete enough to share it here.
(I was working on the web dev career path for over a year and recently made the switch to data analytics and I couldn’t be happier!)
I’m really proud of it but would also like some feedback in ways I could improve. For example there is quite a lot of repeated code because if I were to try to make a function around it I would need to insert another variable into a variable name and that didn’t seem possible. I feel like there’s got to be ways I can rework the functions and clean it up a little.
The other thing I personally would change is adding some charts to better the visualize the data. I’m hoping we cover that in the course soon, if not I’ll do some personal research and update it one of these days.
Yes, this is a project that you will revisit as you progress along in the course.
It’s quite clear you understand how to write functions to pull details out of the dataset while looking for potential correlations as to what affects charges. (This could lead to doing two sample t-tests later on). I also like the level of granularity here. Solid work.
A few observations:
“This data set excludes children and seniors who qualify for medicare”. The data set includes ppl 18-64 b/c insurance is tied to work here (unless one doesn’t have insurance thru work and has ACA instead). Who knows if the data set includes commercial insurance holders and ACA.
The EDA should read like a story: intro (data source citation, initial questions), analysis, conclusions. I would add the conclusions at the end of the notebook. But, I do see why you put observations after each set of functions. I do that too with v. brief comments in my notebook.
I would recommend avoiding the bmi variable altogether. It’s not an accurate measure of one’s overall health (It’s a number that was originally based on white men). It doesn’t take into account one’s bone density, muscle mass, sex & racial differences or genetics.
More importantly, it’s best to be mindful of using subjective/biased language like, “underweight”, “healthy”, “overweight”, “obesity1”, “obesity2”, “obesity3”. If you’re going to keep the variable in your analysis, then it’s better to use numerical bin ranges rather than charged language.
Rather than look at the mean of the charges variable, it might be better to look at the median value b/c there are outliers in the data that pull the mean. Maybe check for min, max, mean, median of that column to see.
I’d avoid using words like “bias” too when analyzing the dataset. Data analysts/scientists should strive to be objective, rather than subjective in their analyses.
Thanks for helping me understand why the phrasing I use to explain the age range doesn’t fully capture what is known and isn’t known about the dataset. I’ll be sure to change that verbiage.
Do you have suggestions for examples / templates to base the narrative flow of the EDA around?
Because I’m analyzing how much insurance companies charge based on these factors and those insurance companies do increase prices because of BMI it is an extremely relevant variable in this analysis. I myself would fall in class 3 obesity and have been discriminated against by insurance companies. BMI is not an indicator of health for me or anybody but denying the fact it currently does impact fat peoples ability to access care doesn’t lead to more objective analysis. The language I used for the categories is the official labels given to the numerical bin ranges. Should I add commentary into my notebook that explains why I am approaching BMI including the choice to refer to the numerical ranges by their official names?
I focused on the averages because in a data set like this all known factors besides individual insurance companies are present in the data. So I don’t feel like for these categories outliers could be disregarded. The extremes in any one category were because those same people could be traced to another. For example those with the highest charges in the general male category were because they were class 3 obese smokers. If I were to take out those outliers at the start of the analysis and focus on the median I worry I would have missed the data that pointed to men’s average prices being higher because there were more men in high price impact categories and not because there was an innate bias against men. Could you explain to me how analyzing the median could deepen my analysis? Or conclusions I came to based on average that would be more sound if I used median to back them up? I did use minimum and maximum at times, mostly for the full range at once. Would there be a benefit to looking at the min and max in specific subcategories? I do want my logical conclusions to be sound, and genuinely want to know if there is something off about my current approach and logic.
I used bias here to imply how much weight those that set the insurance prices put on these different categories. Not necessarily my own subjective bias. Is there a better way to quickly phrase this that I can use in place of bias in my subheadings?
Hope this isn’t too many questions. If you chose to take the time to break this down for me, I would really appreciate it. But I do already appreciate you giving my project the once over and do feel encouraged and motivated by your feedback! Thank you so much
Data analysis at its very core is supposed to be objective and based on empirical evidence, rather than subjective opinions or assumptions. You’re looking for potential correlations and possible causation. But, you’ve not yet done any statistical analysis to look for a significant difference between the means. You can research EDA as there’s a lot out there written about the process. For some reason I thought that there were articles about the process in all DA/DS courses. Perhaps I’m wrong(?).
This is your project and these are just some observations from a stranger. In addition to BMI not being an accurate measure of one’s health, using charged/biased language like, “healthy”, “underweight”, “overweight”, and “obese” are subjective terms and should be avoided. I’ve reviewed way too many projects here that utilized this type of language. What is “healthy” or “overweight”? Those are not objective. Again, this is something you can research. I think it was in 2023 that the AMA released a report suggesting that doctors stop using this number as a health measurement.
As far as mean vs. median–Outliers in data can pull the mean and skew the data which is why I mentioned looking at the median.
ex:
#this is Pandas
df['charges'].describe().round(2)
charges
count 1338.00
mean 13270.42
std 12110.01
min 1121.87
25% 4740.29
50% 9382.03
75% 16639.91
max 63770.43
Here’s something else that I found interesting about the dataset.
You can pull out the rows with the max & min values in a column (using Numpy):
df.nlargest(5, columns=['charges'])
age sex bmi children smoker region charges
543 54 female 47.410 0 yes southeast 63770.42801
1300 45 male 30.360 0 yes southeast 62592.87309
1230 52 male 34.485 3 yes northwest 60021.39897
577 31 female 38.095 1 yes northeast 58571.07448
819 33 female 35.530 0 yes northwest 55135.40209
#so, there's 7 rows that have charges greater than 50,000
df.sort_values('charges', ascending=False).head(7)
age sex bmi children smoker region charges
543 54 female 47.410 0 yes southeast 63770.42801
1300 45 male 30.360 0 yes southeast 62592.87309
1230 52 male 34.485 3 yes northwest 60021.39897
577 31 female 38.095 1 yes northeast 58571.07448
819 33 female 35.530 0 yes northwest 55135.40209
1146 60 male 32.800 0 yes southwest 52590.82939
34 28 male 36.400 1 yes southwest 51194.55914
#you can do the same with minimum values:
df.sort_values('charges', ascending=True).head()
age sex bmi children smoker region charges
940 18 male 23.21 0 no southeast 1121.8739
808 18 male 30.14 0 no southeast 1131.5066
1244 18 male 33.33 0 no southeast 1135.9407
663 18 male 33.66 0 no southeast 1136.3994
22 18 male 34.10 0 no southeast 1137.0110
#you can also isolate the smokers from the data set and analyze them. Though, that's a small sample of 274
smokers = df.iloc[(df['smoker']=='yes').values]
smokers.head()
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.9240
11 62 female 26.29 0 yes southeast 27808.7251
14 27 male 42.13 0 yes southeast 39611.7577
19 30 male 35.30 0 yes southwest 36837.4670
23 34 female 31.92 1 yes northeast 37701.8768
smokers.groupby(['region', 'sex'])['charges'].median().round(2)
charges
region sex
northeast female 22331.57
male 33993.37
northwest female 28950.47
male 26109.33
southeast female 35017.72
male 38282.75
southwest female 34166.27
male 35585.58
The Codecademy forum thread discusses a Medical Insurance Data Analysis project, focusing on Python skills like data cleaning, exploration, and visualization. Participants analyze insurance costs and identify trends, enhancing their understanding of data science concepts and practical implementation.
Hasn’t posted more than once in a single day. Possibly trying to keep the daily/weekly streak alive when completing a lesson/exercise is not convenient. Just a theory.
Might even be a bot or some seo related activity.
Incidentally, link to these forums has already disappeared from the Codecademy main page.