Can you review my Medical Insurance Portfolio Project?

Hello everyone!

I would like to share this portfolio project with you all and get your feedback. While I’m pleased with the outcome so far, I feel it’s not as polished and well-structured as I’d like it to be.

Here’s the link to the project: GitHub - raserr11/Medical-Insurance-Data-Analysis-Course-Project: This project is based on a dataset provided by Codecademy, where you're encouraged to freely apply your knowledge and skills."

I would greatly appreciate any comments or suggestions you have to improve it.

Thank you very much for taking the time to review it!

Best regards,
Raúl.

Congrats on completing the project.

Some thoughts:

  • You imported Pandas but didn’t use it for descriptive stats…I just wondered the reasoning why.
    ex:
df['age'].median().round(1)
>>39.0
df['charges'].mean().round(2)
>> 13270.42
df['charges'].median().round(2)
>> 9382.03

or,
df['charges'].describe().round(2)

>>count     1338.00
mean     13270.42
std      12110.01
min       1121.87
25%       4740.29
50%       9382.03
75%      16639.91
max      63770.43
  • You calculated the correlation coefficient (r), but, that’s not all that should be looked at. That only shows us how tightly the points cluster around the fit line. It’s a start for sure though! :slight_smile: (You also didn’t do any significance testing here). You can say that it seems that they might be related, but you’re not entirely sure.
    More: you also want to do some regression analysis–in order to measure the change in y, given a one-unit increment in x…also called the beta coefficient (b), how steep the line is. Sometimes those numbers can be quite different and the correlation isn’t really that strong. TLDR; Pearson’s correlation doesn’t tell the entire story of the relationship; it just shows the points around the line, it doesn’t look at how steep the line is (change in x —> affect on y).

  • Rather than look at avg/mean costs, you might want to look at median costs b/c there are outliers in the data that pull the mean.

  • Good use of visualizations to emphasize findings. Again, might be better to look at median.

  • It’s better to omit the subjective commentary from EDA. Stuff like, “With that being said, we can conclude that smoking is not only bad for your health but also for your wallet, as smoking is a significant factor in calculating medical insurance prices.” isn’t really necessary b/c this is supposed to be an objective analysis.

You used comments well, so anyone looking at the notebook could follow along as you did EDA. You also have an understanding on how to write functions. (assuming you didn’t use CGPT to write them, correct?).
You’ll return to the dataset further along in the course…so you can always utilize your newly found Python skills on it. Good job.

1 Like

First of all, thank you very much for the advice and observations. This is exactly what I was looking for when I posted the project here. I will work on fixing the errors today. :heart:

I didn’t use the predefined Pandas functions for descriptive statistics because I’m not familiar with them yet. I’ve tried to rely on my current knowledge, and I hope to improve these aspects over time.

Regarding the use of statistics, as you mentioned (regression analysis), I’m eager to learn more about mathematics and statistics and apply them in my projects. I’ve already requested the syllabus for the next course where these topics are covered, so I can study them over the summer.

As for the functions, yes, they are written by me. The sorting functions are taken from another algorithm analysis project I had to do for university last month (which I will also publish here once I have it on GitHub).

Again, thank you very much for taking the time to review the project. I hope to do better next time. :smiley: