Hi all- looks like I am posting the first Life Expectancy vs GDP project for 2022.
In the project I plotted Life Expectancy to GDP for all six countries on the same graph. This led me to question whether it was appropriate compare data from large countries like China and the US to small countries like Zimbabwe, without adjusting GDP somehow to account for country size.
I decided to download each country’s population data from the UN so I could merge this with the provided data to get a “GDP per capita” column.
*** Questions - Your Opinion? ***
Is it at all appropriate to seek additional information on your own if you feel you need it to make an accurate analysis, or should you only work with and make conclusions from the data provided to you by whomever is asking for the analysis?
The UN data was in Excel format and had far, far more data than I wanted. I know Excel so I removed extraneous formatting and ‘cleaned’ the data before importing it into my Juptyer notebook. I know that data cleaning is done -and needs to be done- on imported csv files, but is it appropriate to modify the source Excel file before importing it?
First, for this project there are definitely other factors to consider–like healthcare and access to it. Which definitely varies from country to country as does diet, access to food, etc. I think it’s good that you had more questions and got the U.N. data.
For Q2, Depending on the size and type of the data set, yes, I think it’s totally fine to use Excel to clean data or whatever you feel comfortable with (regex, pandas, OpenRefine). I’ve done it myself with data, especially if it’s Census data. You’re always going to have to clean up data to some extent. That doesn’t mean that you’re going to omit values or nulls, or pieces of data that you actually need for your overall analysis.
For Q1- I think it depends on a number of factors.
What is the scope of the project? Where are you in defining the goals of the project (have you–data people–been involved in all the team discussions of the project in order to define the tasks/goals?)? What’s the timeframe you are working with? Sometimes when you’re deep into a project you find more questions that weren’t considered when the goals have been defined. If that’s the case, then you may have to push the project to the next sprint and talk to the data engineers & gather more data.
If it’s more straightforward like, say, you work in marketing at a non profit and you want to see what version of an email was more successful, ie: ppl made more donations, then that’s fairly straightforward. But, if you work on say, a political campaign and you wonder why voter turnout isn’t higher in a certain district, perhaps you didn’t consider a variable like commute time to the polls from peoples’ places of employment. Or, something like that from my made up examples.
The key is making sure that data people are involved in team discussions at all points of a project–from inception to the end.
I just finished my project in the correlation between Life Expectancy and GDP and was wondering if anybody else felt the need to include population data in the analysis, since to me relative GDP (per capita) makes more sense as a parameter than absolute GDP. I’m glad to find someone asking the same question!
I liked your insights and I made some new conclusions as well, if you have time visit and check it out: