I feel like I might need more data on why some factor could affect the results! and maybe I should have wrote functions instead…

I hope you enjoy some part of this project! :

Congrats on completing the project.

  • I’m curious as to the reasoning behind only analyzing the first 200 rows of a 1338 row dataset?
    From here: “”***Remember that these calculation only represents the first 200 rows/patients from the csv file!"

  • Rather than calculating the average cost, maybe look for the median cost for each group and compare/contrast them that way. There are outliers in the data that pull the mean, so median is better.

  • I like that you didn’t include any analysis about bmi.

  • Seems like you understand how to write functions and pull out key pieces of info from the dataset. But I wonder how different the numbers would be if you looked at more than just 200 people.

  • Might be beneficial to include a brief summary of your findings, or concluding thoughts at the bottom of the notebook. You know, to wrap it up.

Good work, keep at it. :slight_smile:

Hey lisalisaj, first of all thank you for the feedback.

So the reason why I only used the first 200 rows was simply because of some conditions like smokers/non-smokers, because the amount of patients who doesn’t smoke wasn’t equal to the amount of patients who smokes, I simply balanced them out to make the calculations more precise. So I went with the condition “first 200 rows” with every other calculations to have a balanced outcome/conclusion.

For the median/mean, I guess I forgot I could just use them. Even though the exercises after this project reminds you about them with pandas.

Anyway, tank you again and I hope it helps. :slightly_smiling_face:

Fwiw, you’re not building or training a ML model, you’re analyzing what’s represented in the data. It doesn’t have to be even. It’s better to look at all the data in this case.

1 Like

I understand, but we’re working with data, the result has to be correct even if I sacrifice a certain amount of observations. My calculations are alway met with conditions I have no choice but to balance the amount of rows I use.

What I mean is this: With exploratory data analysis, you don’t need a balanced data set (luckily there weren’t any missing values or anything that one had to clean up here). You’re exploring all of the data for potential relationships between the variables. If you were going to do statistical testing, then you could maybe randomly select 200 rows and then go with that, or remove the outliers in the data which would affect the mean.

By only looking at the first 200 rows you’re not really getting an accurate look at the data–which is what EDA is–especially when it comes to finding any averages (or median values) when looking at any potential relationships & differences between the variables–women vs. men, non-smokers vs. smokers, people in the four different regions. You could start with a correlation matrix to see the positive and negative correlations between the variables. If you were to say, only plot the first 200 records, that would not be an accurate representation of the distribution of the data set.