My US Medical Insurance Cost Project: Please help me review

Thanks a lot for coming here. In this project, I’ve branched out to explore the correlation between quite a lot of things. I’ve just finished the Python Fundamentals, so I applied most of the knowledge I’ve got from the Unit. But probably I’m gonna come back and revise this using more Python libraries like NumPy, etc.

In the Jupyter Notebook, I’ve included explanation for what I intend to do.

Much appreciated.

https://github.com/breyenguyen/datascienceportfolio/blob/8afa50f5ea08ab080834a775cc032691e4d7d2bd/us-medical-insurance-costs.ipynb

By the way, I’ve reviewed some others’ works. I think if I knew pandas, it would be much easier to work with these data. Interesting.

Congrats on finishing the project. :partying_face:

It seems like you have a solid understanding of how to create functions to look at the data set. I like that you state you’re going to return to the data once you know more Python libraries. I can say that an understanding of Pandas, NumPy, SciPy for analysis and statistical testing and Seaborn or Matplotlib/PyPlot for visualization add to the analysis of the data to see if there are indeed any areas of statistical significance between the variables.

In data analysis an important step is EDA (which you’ve started). One of the things you really want to be careful about is making assumptions about the data set and then those assumptions influence your analysis. As scientists, we strive to be objective. The data needs to speak for itself, so to speak and our job as data people is to uncover what’s already there. I mention this b/c I wondered how did you decide the categories/labels/buckets for the column BMI?
(I also say this b/c I’ve done this project and explored the data. Concerning, is the treatment of BMI in the data as there is much debate (in the medical community & beyond) as to the validity of that number and overall health of a person and further, using that number to decide if someone pays a higher cost for insurance. But I guess that’s a topic for another time & forum). The data set is from Kaggle and it’s just (presumably) a random sample. OR, it could be dummy data, I’m not entirely sure.

There are a lot of built in functions with Pandas. You could use the df.describe() method and get a snapshot of the count, mean, min, max, std, etc values for all columns of data. That would give you the range of numbers for the column bmi. The min value is 15.96, the max is 53 and the avg in the data set is 30.66. You could also plot the data to see the spread of that column of data.

To continue to investigate the associations between the variables one could use either the math module or import SciPy and the stats module to calculate the variance and standard deviation and then test the difference between two means in order to see if there is any statistical significance between the variables. There’s a lot that could be done here and you’ve got a solid projet and a great start! I hope that you do return to the data as you learn more. IMO it’s always fun to go back to data sets and apply what one has learned. :slight_smile:

Happy coding!

2 Likes

Thanks a lot for your response and recommendations.

As for the assumptions, to be honest, I was quite concerned about them when dividing the data into different small categories. Yet, I was just wanting to challenge myself a little bit in terms of using the knowledge I’ve got to play with the data a little bit.

Speaking of the analyses, there’s one more thing that I think is misleading, which is the Average/Mean. If you’ve read Factfulness, you might understand what I’m trying to say here. The average gives us bias view at the real world, only who wants to manipulate data would use this as a indicator of anything. So basically, what I’m trying to do here is just to review my knowledge. This is not going to be the final version.

Anyway, thanks a lot. I’m gonna come back to this dataset with more Python libraries knowledge to see how much more convenient it would be using the library methods instead of writing my own functions to sort the data.

1 Like

Yep, the averages are only for this sample data set; not the population at large.

1 Like

So happy to receive your feedback. I’m gonna keep learning and comeback when I have more knowledge to analyze data more efficiently and of course, objectively.

1 Like

You’re welcome. That’s what we’re all here for—to help one another out! :slight_smile:

1 Like

Thanks a lot for giving me encouragement. As I promised, after finishing Manipulation Data with Pandas, I came back and do some further analyses on the data as well as visualize the data using my basic knowledge of Matplotlib and Seaborn.

Here is the new version of the project. If you have some time, could you help me review it briefly.
https://github.com/breyenguyen/project/blob/main/us-medical-insurance-costs%20(EDAver).ipynb

Much appreciated.

1 Like

Hooray! Good on you for returning to the data and adding more value to your analysis.

I’m not familiar with the IPython display library and html module. Does it create html objects in the notebook? I will read up more on it. Thank you for bringing it to my attention. I always like to learn new things.

You could also break out the data into regions by using .iloc and values methods too:
Ex: southwest = insurance.iloc[(insurance['region']=='southwest').values] southwest.head()
and then use .describe() on the columns to see the basic stats.
(I like shortcuts/timesavers. lol)

Now you have another project to add to your portfolio. Good work! :dancer:t2:

Oh, Right! I forgot that I could use .iloc to make things simpler with fewer codes.

As for the display, html module, it helps create a better-looking table in the notebook. Like this:

Thanks a lot for your suggestions!

1 Like