I did the US Medical Insurance Costs Project in the BI Analyst Track (Python for DS II course).
I only had two questions that came to mind: How does BMI differ regionally among men and women? And, how does BMI differ between people with children and people without?
I’m not really satisfied with what I have so far bc neither question revealed much insight or differences (other than the southeast region averaging the highest BMIs).
Okay, so those are good questions to investigate (even if BMI is a controversial number & not an indicator of one’s overall health & can lead to value judgments). I haven’t looked at the course you’re taking but have there been any articles about EDA yet?
When you approach any data set, you’re going to want to start with EDA (exploratory data analysis) which includes descriptive statistics & data visualization. You’ll also want to see for yourself the basic info about the data–column names, how large is the dataset (here, it’s 1338 rows if I’m not mistaken), what are quantitative and qualitative variables, data type of each column, stuff like that.
So, for this data set, you see that charges is the quantitative variable. Think about questions like:
what are the mean and median of charges in each region? is one higher? lower?
how many people are in each region?
how many women and men in the dataset? in each region?
What are the mean and median of charges for men vs women?
what are the mean & median of charges for people who have children vs. those that don’t?
what are the age ranges and range of charges for that?
how many smokers vs. non smokers are in the dataset?
do charges differ for smokers vs. non smokers?
Since you only imported the Pandas library, I’m not sure if you’ve gotten to data visualization in the course yet. That is also really helpful when you can plot variables and see how they fall on a plot & see if the data is skewed or not, etc. All of this will eventually lead to inferential statistics and hypothesis testing to see if there are any correlations in the data–is there a significant difference in the mean between women v. men and charges (t-tests–quantitative variable & a binary categorical variable), etc. etc.