Data Cleaning with Pandas - [Cleaning US Census Data]

Hi,
I have just finished the exercise in this link: Cleaning US Census Data

It is not easy to be honest.
I would like to share what I did and love to hear feed back on the code that I have written.
Some parts of it might be a little long winded in the data cleansing portion so if there are suggestions to improve upon it, please let me know! Any help would be greatly appreciated.

Importing Panda Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot
import codecademylib3_seaborn
import glob

Combing Multiple Files

state_files = glob.glob(‘states*.csv’)
df_list =
for state_file in state_files:
data = pd.read_csv(state_file)
df_list.append(data)
us_census = pd.concat(df_list)

Data Cleansing

us_census[‘Income’] = us_census[‘Income’].replace(’$’, ‘’, regex=True)
us_census[‘Income’] = pd.to_numeric(us_census[‘Income’])
us_census[‘Men’] = us_census[‘GenderPop’].str.split(’’).str[0]
us_census[‘Women’] = us_census[‘GenderPop’].str.split(’
’).str[1]
us_census[‘Men’] = us_census[‘Men’].replace(‘M’, ‘’, regex=True)
us_census[‘Men’] = pd.to_numeric(us_census[‘Men’])
us_census[‘Women’] = us_census[‘Women’].replace(‘F’, ‘’, regex=True)
us_census[‘Women’] = pd.to_numeric(us_census[‘Women’])
us_census = us_census.fillna(value={‘Women’: us_census[‘TotalPop’] - us_census[‘Men’]})
us_census = us_census.drop_duplicates()
us_census[‘Hispanic’] = us_census[‘Hispanic’].replace(’%’, ‘’, regex=True)
us_census[‘White’] = us_census[‘White’].replace(’%’, ‘’, regex=True)
us_census[‘Black’] = us_census[‘Black’].replace(’%’, ‘’, regex=True)
us_census[‘Native’] = us_census[‘Native’].replace(’%’, ‘’, regex=True)
us_census[‘Asian’] = us_census[‘Asian’].replace(’%’, ‘’, regex=True)
us_census[‘Pacific’] = us_census[‘Pacific’].replace(’%’, ‘’, regex=True)
us_census[‘Hispanic’] = pd.to_numeric(us_census[‘Hispanic’])
us_census[‘White’] = pd.to_numeric(us_census[‘White’])
us_census[‘Black’] = pd.to_numeric(us_census[‘Black’])
us_census[‘Native’] = pd.to_numeric(us_census[‘Native’])
us_census[‘Asian’] = pd.to_numeric(us_census[‘Asian’])
us_census[‘Pacific’] = pd.to_numeric(us_census[‘Pacific’])
us_census.fillna(value={
‘Pacific’: us_census[‘Pacific’].mean(),
‘Hispanic’: us_census[‘Hispanic’].mean(),
‘White’: us_census[‘White’].mean(),
‘Black’: us_census[‘Black’].mean(),
‘Native’: us_census[‘Native’].mean(),
‘Asian’: us_census[‘Asian’].mean(),
}, inplace=True)

Plotting the Histograms

fig, ax = pyplot.subplots(2,3)
ax[0][0].hist(us_census[‘Hispanic’])
ax[0][1].hist(us_census[‘Pacific’])
ax[0][2].hist(us_census[‘White’])
ax[1][0].hist(us_census[‘Black’])
ax[1][1].hist(us_census[‘Native’])
ax[1][2].hist(us_census[‘Asian’])
ax[0][0].set(title=‘Hispanic’, xlabel=’% of population, ‘ylabel=‘Number of states’)
ax[0][1].set(title=‘Pacific’, xlabel=’% of population, ‘ylabel=‘Number of states’)
ax[0][2].set(title=‘White’, xlabel=’% of population, ‘ylabel=‘Number of states’)
ax[1][0].set(title=‘Black’, xlabel=’% of population, ‘ylabel=‘Number of states’)
ax[1][1].set(title=‘Native’, xlabel=’% of population, ‘ylabel=‘Number of states’)
ax[1][2].set(title=‘Asian’, xlabel=’% of population, 'ylabel=‘Number of states’)
fig.suptitle(‘Histograms for different races’, y=1.05, fontsize=15)
fig.tight_layout()
pyplot.show()

Histograms plotted below

image

6 Likes

This was amazingly heplful. Thank you so much for taking the time out to support the community!