I have just finished my project. I want to improve it so please let me know about your comments.
Github link for project: Life_expectancy_GDP/life_expectancy_gdp.ipynb at master · HoangAnhNguyen269/Life_expectancy_GDP · GitHub
Thank you alot
#%% md
# Goal
#%% md
#%% md
# Data
#%% md
Data sources
- GDP Source: World Bank national accounts data, and OECD National Accounts data files.
- Life expectancy Data Source: World Health Organization
The data for this project is in all_data.csv.
The dataset provided has the following columns of data:
- Country - nation
- Year - the year for the observation
- Life expectancy at birth (years) - life expectancy value in years
- GDP - Gross Domestic Product in U.S. dollars
#%% md
# Analysis
#%% md
## import libraries
#%%
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
#%%
# %matplotlib notebook
%matplotlib inline
#%% md
## load the data
#%%
df = pd.read_csv("all_data.csv")
df.head()
#%%
df.shape #return the shape of the dataframe
#%% md
## check data cleanliness
#%%
df.info() #seems the data is clean when the shape of df is (96,4)
#%%
df.isna().sum()
#%%
df.drop_duplicates(inplace=True)
df.shape #there is no duplicate
#%%
#I saw that the column "Life expectancy at birth (years)" is hard to for coding
# change to new name LEABY
df.rename(columns={'Life expectancy at birth (years)': 'LEABY'},inplace=True)
df.columns
#%% md
## first look at the data
#%%
#Now I gonna see what countries and years data are collected in the dataset
df["Country"].unique()
#%%
df["Year"].unique()
#%%
df.describe()
#we can see that the countries that we investigate have a high LEABY on average
#%% md
## Answer the questions
#%% md
### Has life expectancy increased over time in the six nations?
#%%
# df["Year"] = df["Year"].apply(pd.to_datetime)
#%%
df["Year"]
#%%
plt.clf()
for nation in df["Country"].unique():
plt.plot(df[df["Country"]==nation]["Year"],df[df["Country"]==nation]["LEABY"], label=nation)
plt.xlabel("Years")
plt.ylabel("Life expectancy at birth")
plt.legend()
plt.title("Life expectancy over years")
plt.show()
#%%
#another way to make the above plot
plt.clf()
sns.lineplot(data= df, x= "Year", y="LEABY", hue="Country")
plt.xlabel("Years")
plt.ylabel("Life expectancy at birth")
plt.legend()
plt.title("Life expectancy over years")
plt.show()
#%% md
WE can see that there is a trend that the LEABY increase over years in all countries, especially Zimbabwe
Now I will create individual plot for each country.
#%%
df[df["Country"]=="Chile"]["LEABY"]
#%%
LEABY_facegrid = sns.FacetGrid(df, col="Country", col_wrap=3,
hue = "Country", sharey = False)
LEABY_facegrid = (LEABY_facegrid.map(sns.lineplot, "Year", "LEABY")
.add_legend()
.set_axis_labels("Year","Life expectancy at birth (years)"))
LEABY_facegrid;
#%% md
We can see that Chile, Mexico and Zimbabwe had lowered LEABY in some years. The linear increase in other countries is not smooth.
Overall, LEABY at each country has increased over years.
#%% md
### Has life expectancy increased over time in the six nations?
#%%
#another way to make the above plot
plt.clf()
sns.lineplot(data= df, x= "Year", y="GDP", hue="Country")
plt.xlabel("Years")
plt.ylabel("GDP in Trillions of U.S. Dollars")
#plt.legend() #this legend make the graph look not nice
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), title=" Country")
plt.title("GDP of all countries over years")
plt.show()
#%% md
We can see that only China and US have increased the GDP sharply over years. However, a deeper look should be conducted because all data now is using on the same scale
#%%
GDP_facegrid = sns.FacetGrid(df, col="Country", col_wrap=3,
hue = "Country", sharey = False)
GDP_facegrid = (GDP_facegrid.map(sns.lineplot, "Year", "GDP")
.add_legend()
.set_axis_labels("Year","GDP in Trillions of U.S. Dollars"))
GDP_facegrid;
#%% md
We can see that overall all countries has increased their GDP over years.
Espicially, US and China has increased shaprly. Compared to China and the US, the other country's GDP growth appeared small.
The linear increases of other countries has fluctuated significiantly.
#%% md
### Is there a correlation between GDP and life expectancy of a country?
#%%
# Scatter plot is a good choice to investigate the relationhip between GDP and LEABY
plt.clf()
sns.scatterplot(x=df.LEABY, y=df.GDP, hue=df.Country)
plt.xlabel("Life expectancy at birth")
plt.ylabel("GDP in Trillions of U.S. Dollars")
#plt.legend() #this legend make the graph look not nice
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), title=" Country")
plt.title("GDP and Life expectancy in each country")
plt.show()
#%% md
At the first look, we can see that there may be no relationship between GDP and LEABY in Zimbawe and Chile. In other countries, we can notice a positive correlation between LEABY and GDP, especially China and USA
We need to create a plot for each country to have a deeper look.
#%%
LEABY_GDP = sns.FacetGrid(df, col="Country", col_wrap=3, hue ="Country", sharey = False, sharex = False)
LEABY_GDP = (LEABY_GDP.map(sns.scatterplot, "LEABY", "GDP") .add_legend()
.set_axis_labels("Life expectancy at birth (years)", "GDP in Trillions of U.S. Dollars"));
LEABY_GDP
#%% md
Overall, we can see a positive correlation between LEABY and GDP in all countries.
#%% md
### What is the average life expectancy in these nations?
#%%
df["LEABY"].describe()
#%% md
The average of LEABY is 72. We can have a deeper look
#%%
df.groupby("Country")["LEABY"].mean().reset_index()
#%%
plt.clf()
sns.barplot(x="LEABY", y="Country", data=df, ci=None)
plt.xlabel("Life expectancy at birth (years)")
plt.title("Average life expectancy at birth (years) for countries")
plt.show()
#%% md
The average LEABY of all countries excepts Zimbabwe is from 70 to 80.
#%% md
### What is the distribution of that life expectancy?
#%%
plt.clf()
sns.distplot(df.LEABY)
plt.xlabel("Life expectancy at birth (years)");
#%% md
The distribution of LEABY in the data is very left skewed
#%%
plt.clf()
sns.displot(data=df, x= "LEABY", col="Country", col_wrap=3, kde=True)
plt.title("Life expectancy at birth (years) in each country");
plt.show()
#%%
plt.clf()
graph = sns.FacetGrid(df, col="Country", col_wrap=3,
hue = "Country", sharey = False, sharex = False)
graph = (graph.map(sns.histplot,"LEABY",kde=True, bins=10)
.add_legend())
graph.fig.subplots_adjust(top=0.9)
graph.fig.suptitle('Life expectancy at birth (years) in each country')
plt.show()
#%%
plt.clf()
fig = plt.figure(figsize=(15,6))
sns.violinplot(data=df, x="Country", y="LEABY")
plt.xticks(fontsize= 12)
plt.show()
#%%
plt.clf()
graph = sns.FacetGrid(df, col="Country", col_wrap=3,
hue = "Country", sharey = False, sharex = False)
graph = (graph.map(sns.violinplot,"LEABY")
.add_legend())
graph.fig.subplots_adjust(top=0.9)
graph.fig.suptitle('Life expectancy at birth (years) in each country')
plt.show()
#%% md
On the plot, China, Germany have left-skewed LEABY distributions, Zimbabwe has a right-skewed LEABY distributions.