Here is my project. I took a couple of days to work on it, and since I was most of the way through the Data Science course when this project became available, I decided to practice some of the tools I had learned. It’s long because I included my reasoning in comments, and tried to maintain a flow of thought.
The graphs produced by the script are in a comment below.
I’d love to hear where I may have made mistakes, misunderstandings, conclusions I missed, where my reasoning and data interpretation could be improved, and if there are cleaner and/or more Pythonic ways I could’ve written my code!
I’m afraid I don’t have the time to analyse your code in detail but from a quick glance it seems quite straightforward and readable (sensible function and variable names, no overcomplicated lines and even some CLI style functions with docstrings ). Perhaps it’s a personal preference but your code seems advanced enough to warrant using the matplotlib objects themselves rather than just the wrapper functions. I think you may find some of the multiple subplot type figures much easier to work with if you treat them as objects (there should be some guidance on this in the matplotlib docs and elsewhere online if you’re interested).
I think my main points are on the presentation of data more so than the code or the analysis itself. If your main goal is to practice the coding then that’s fine and you can ignore these comments but consider the purpose of such an analysis. Chances are you’d be passing this information on to other people so it’s always worth considering how best to present your deductions/analysis to the viewer.
There are a few figures where I’m not 100% sure what I’m looking at (some of the others are quite clear so perhaps they weren’t intended for inclusion in which case my apologies). They have one or more titles but the data themselves are unmarked. Consider adding some labels, legends, colour schemes or captions (add whatever suits best but not everything ) so that the viewer can understand the data as easily as possible.
I’d also advocate for making data in different graphical plots designed for comparison have the same axis limits wherever possible (huge magnitude changes might not be possible to plot but perhaps indicate the change to the viewer). If say the charges in one area go between $100 and $200 and in the next area it’s $150 to $4000 then the same axis limits (say $0 to $4000) makes that difference visually obvious; if the axis limits move between different plots then you have to check carefully to note that a similar visual appearance is not a similar dataset (ideally the viewer shouldn’t have to work to understand anything).
If you have the time consider moving some of your coding comments (which contain interesting analysis) into a more practical format for viewing (ideally with the figures included); there shouldn’t be a need to view the code itself in order to understand anything. A presentation style document is a decent target if you have the time. As always consider how accessible the information you’ve obtained is to the viewer.
I believe it is a very good analysis. I arrived to the same conclutions as you. There isn’t enough data to detemine why some non smokers with low bmi pay so much in insurance costs.
I was wondering why you didn’t use a notebook instead of a python file. It would have made easier to see your conclutions by showing your graphs with your code.
Thanks for the comments and feedback! I’m glad my conclusion could be corroborated.
I chose to use a Python file instead of a Notebook because I don’t really like using Jupyter Notebook, and I wanted practice doing everything in PyCharm. I assume that if I were to get a job in Data Science, I’d be using an IDE.