Congrats on finishing the project.
It seems like you have a solid understanding of how to create functions to look at the data set. I like that you state you’re going to return to the data once you know more Python libraries. I can say that an understanding of Pandas, NumPy, SciPy for analysis and statistical testing and Seaborn or Matplotlib/PyPlot for visualization add to the analysis of the data to see if there are indeed any areas of statistical significance between the variables.
In data analysis an important step is EDA (which you’ve started). One of the things you really want to be careful about is making assumptions about the data set and then those assumptions influence your analysis. As scientists, we strive to be objective. The data needs to speak for itself, so to speak and our job as data people is to uncover what’s already there. I mention this b/c I wondered how did you decide the categories/labels/buckets for the column BMI?
(I also say this b/c I’ve done this project and explored the data. Concerning, is the treatment of BMI in the data as there is much debate (in the medical community & beyond) as to the validity of that number and overall health of a person and further, using that number to decide if someone pays a higher cost for insurance. But I guess that’s a topic for another time & forum). The data set is from Kaggle and it’s just (presumably) a random sample. OR, it could be dummy data, I’m not entirely sure.
There are a lot of built in functions with Pandas. You could use the
df.describe() method and get a snapshot of the count, mean, min, max, std, etc values for all columns of data. That would give you the range of numbers for the column bmi. The min value is 15.96, the max is 53 and the avg in the data set is 30.66. You could also plot the data to see the spread of that column of data.
To continue to investigate the associations between the variables one could use either the math module or import SciPy and the stats module to calculate the variance and standard deviation and then test the difference between two means in order to see if there is any statistical significance between the variables. There’s a lot that could be done here and you’ve got a solid projet and a great start! I hope that you do return to the data as you learn more. IMO it’s always fun to go back to data sets and apply what one has learned.