Review needed for Linear Regression on Portfolio Project: US Medical Insurance Costs

Hello fellow Codecademy user,

This was the first portfolio project I have completed in the Codecademy Data Science: Analytics path. The recommended analysis ideas were very simple. For the extended analysis I tried to do something similar to Reggie’s Linear Regression project, but instead of 2 dimensions, there are 7 dimensions to the medical data!

This took me about a day of coding to complete and can be seen here GitHub - samWyatt/codecademy_projects.

My two questions are:

  • I think I have found the optimal factors for the linear insurance cost estimate function, how can I be sure? How do I know I’m not at a local minimum of the total error function instead of the actual minimum? Can you find a better set of factors?

  • I have two approaches to looping through all the possible factor combinations. One uses a series of nested for loops and the other uses itertools.product. Which one is better?

Thanks for your time!

  • I think some visualizations (with Seaborn or Matplotlib) might be useful for this regression analysis b/c it’s difficult to see or visualize what you’re doing with the regression alone. Perhaps some basic scatter plots would help(?) What is the dependent variable? Charges? Independent variables, the rest? What independent variable(s) have the greatest effect on charges? Are there any correlations?

  • It would be good to look at the median of the charges rather than the mean bc there are some outliers in the data that skew the mean.

  • How would you explain this to a non-technical audience? Maybe include some brief comments to describe your thought processes about what you’re doing. EDA is telling the story of the data.

  • It might be a good idea to add some concluding comments too about what you found in your analysis, just to wrap things up a bit.

Hi lisalisaj,

Thanks for reviewing my project! I’m still a couple of units away until I get to data visualization, but I will come back to this project and add a line chart showing the decrease in total error over time.

I think a scatter plot would be very difficult for this project since there are 7 independent variables, but I could definitely point out which variables have the greatest effect on the cost estimate.

Finally, you are totally right about adding clarification on the regression process. Right now, the project relies too heavily on the reader’s knowledge of linear regressions, so to make it more accessible I will add an explanation of the process.

Thanks again lisalisaj

1 Like

You’re welcome.
Also, keep in mind with regards to correlation, there are generally a few factors that affect one’s health insurance premiums, or charges: age, tobacco use, region of country, dependents on the plan, and the level of coverage.

Hi lisaliasj,

I’m sorry this took so long, but I have made some major updates to this project in terms of data visualization and concept explanation. The new file is called ‘us-medical-insurance-costs-v2-1Region.ipynb’ on my Git Hub page here: GitHub - samWyatt/codecademy_projects

I added an ‘introduction to linear regression’ section for those who are unfamiliar with the topic. I also removed the overly long explanations of the more technical aspects to improve readability.

I would appreciate some feedback when you get the chance,
Thank you