Solution: Cleaning US Census Data (Project)

I was doing this project and just noticed that unlike all projects there is no video walkthrough available for this one. So thought it might help someone who is having trouble completing it. :blush:

Here’s my code-

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import codecademylib3_seaborn
import glob

files = glob.glob(‘states*.csv’)
data_list = [pd.read_csv(file) for file in files]
us_census = pd.concat(data_list)

print(us_census.columns)

us_census[‘Income’] = us_census.Income.str[1:]
us_census[‘Income’] = pd.to_numeric(us_census.Income)

gender_split = us_census.GenderPop.str.split(’_’)
us_census[‘Men’] = gender_split.str.get(0)
us_census[‘Women’] = gender_split.str.get(1)

replace ‘M’ & ‘F’

us_census[‘Men’] = us_census.Men.str[:-1]
us_census[‘Women’] = us_census.Women.str[:-1]

to number conversion

us_census[‘Men’] = pd.to_numeric(us_census.Men)
us_census[‘Women’] = pd.to_numeric(us_census.Women)

us_census = us_census.fillna(value={
‘Women’: us_census.TotalPop - us_census.Men
})

duplicates = us_census.duplicated(subset=[‘State’])
print(duplicates.value_counts())
us_census = us_census.drop_duplicates()

plt.scatter(us_census[‘Women’], us_census[‘Income’], color=[‘red’,‘green’])
plt.xlabel(‘Women’)
plt.ylabel(‘Income’)
plt.show()
plt.cla()

us_census[‘Hispanic’] = us_census.Hispanic.str[:-1]
us_census[‘Hispanic’] = pd.to_numeric(us_census.Hispanic)
us_census[‘White’] = us_census.White.str[:-1]
us_census[‘White’] = pd.to_numeric(us_census.White)
us_census[‘Black’] = us_census.Black.str[:-1]
us_census[‘Black’] = pd.to_numeric(us_census.Black)
us_census[‘Native’] = us_census.Native.str[:-1]
us_census[‘Native’] = pd.to_numeric(us_census.Native)
us_census[‘Asian’] = us_census.Asian.str[:-1]
us_census[‘Asian’] = pd.to_numeric(us_census.Asian)
us_census[‘Pacific’] = us_census.Pacific.str[:-1]
us_census[‘Pacific’] = pd.to_numeric(us_census.Pacific)

us_census = us_census.fillna(value={
‘Hispanic’: us_census.Hispanic.mean(),
‘White’: us_census.White.mean(),
‘Black’: us_census.Black.mean(),
‘Native’: us_census.Native.mean(),
‘Asian’: us_census.Asian.mean(),
‘Pacific’: us_census.Pacific.mean(),
})

plt.hist(us_census[‘Hispanic’])
plt.title(‘Hispanic’)
plt.show()
plt.cla()

plt.hist(us_census[‘White’])
plt.title(‘White’)
plt.show()
plt.cla()

plt.hist(us_census[‘Black’])
plt.title(‘Black’)
plt.show()
plt.cla()

plt.hist(us_census[‘Native’])
plt.title(‘Native’)
plt.show()
plt.cla()

plt.hist(us_census[‘Pacific’])
plt.title(‘Pacific’)
plt.show()
plt.cla()

plt.hist(us_census[‘Asian’])
plt.title(‘Asian’)
plt.show()

print(us_census.head())
print(us_census.dtypes)

11 Likes

How did you find the file names for the .csv files? I assume from your code the naming convention is states1.csv, states2.csv etc, but where did you find that? The question tells us to use the ‘navigator’ where is that exactly…?
Thanks

1 Like

Click on that icon, You’ll be able to see the .csv files, and yeah they are in the pattern ‘states1, states2…etc’

4 Likes

Solution: Below is the link to the Jupyter Notebook I made for this project, as @ method7564419324 states there is no solution provided on the course.
Cleaning US Census Data

6 Likes

Thanks @method7564419324.

Can someone (preferably from CodeAcademy) tell me why there is no video walkthrough for this one?
It seems like the further you get into those courses/paths, the less thoroughly and consistent they are constructed.

8 Likes

Hi there,

Thanks for posting this. Quite helpful.

Can you explain this part and the rationale, please?

Thanks,
Kabir

It removes the percentage sign (which is the last character). Negative indexes represent positions from the end. [:-1] means the slicing from the beginning to the last 1 character (excluded).

1 Like

Thank you very much. I really appreciated.

thanks a lot!!was looking for it desperately

Why not use:

for race in us_census.columns[3:9]:
us_census[race] = us_census[race].str[:-1]
us_census[race] = pd.to_numeric(us_census[race])

And the same for hist plotting

1 Like

Ok. I got the code but i dont quite understand the plot for histogram. Just by looking at the histogram and the values in the table for Asian, does it mean that for the values between 0-5, 5-10,… the number of data sets that fall within that range is plotted?

plt.hist(us_census[‘Asian’])
plt.title(‘Asian’)
plt.show()

Very helpful, thank you!

Agree! I personally found this project pretty challenging.

At Stage 10 I had no duplicates, and eventually realised it was because of the Unnamed: 0 column which had been created when the .csv files had been imported. I got rid of it and created a new unique index as follows:

census = pd.concat(census_dfs).reset_index(drop=True).drop('Unnamed: 0', axis =1)

At Stage 13 I found the instruction to “make a bunch of histograms out of the race data” very vague, especially as we so far haven’t learnt anything about matplotlib. I spend ages trying to get my histogram to display and eventually found that plt.clf() clears the old plot so the new one can display.

All in all it was frustrating and seemed less well thought out than previous projects in the Data Science career path, but it certainly teaches the skill of using Google to find solutions!

2 Likes

Hi and thanks for putting up the code, very helpful!
I ran into a problem with the number conversion, so after

us_census[‘Men’] = pd.to_numeric(us_census.Men)
us_census[‘Women’] = pd.to_numeric(us_census.Women)

‘Men’ becomes an integer and ‘Women’ a float value with a decimal point (for example 2489527.0). So I am wondering why this is?
And when I try to fill the missings with

us_census[‘Women’] = us_census.fillna(value = {‘Women’: us_census.TotalPop - us_census.Men})

the values for Women become same as the index (0,1,2,…). So just wondering what I’m doing wrong.

My code up to that point (without imports):

files = glob.glob(‘states*.csv’)
df_list =

for filename in files:
data = pd.read_csv(filename)
df_list.append(data)
us_census = pd.concat(df_list)

print(us_census.columns)

print(us_census.dtypes)

print(us_census.head())

us_census[‘Income’] = us_census[‘Income’].str[1:]
us_census[‘Income’] = pd.to_numeric(us_census.Income)

split = us_census.GenderPop.str.split("_")

us_census[‘Men’] = split.str.get(0)
us_census[‘Women’] = split.str.get(1)

us_census[‘Men’] = us_census.Men.str[:-1]
us_census[‘Women’] = us_census.Women.str[:-1]

us_census[‘Men’] = pd.to_numeric(us_census.Men)
us_census[‘Women’] = pd.to_numeric(us_census.Women)

print(us_census.Men, us_census.Women)
print(us_census.dtypes)

us_census[‘Women’] = us_census.fillna(value = {‘Women’: us_census.TotalPop - us_census.Men})

Its kind of string indice. The last one elemnet in string will be removed

files = glob.glob(“states*.csv”)
df_list =

for filename in files:
data = pd.read_csv(filename, index_col=0)
df_list.append(data)

df = pd.concat(df_list)

print(df.columns)

print(df.dtypes)

df.Income = df.Income.str.replace(’$’, ‘’)
df.Hispanic = df.Hispanic.str.replace(’%’, ‘’)
df.White = df.White.str.replace(’%’, ‘’)
df.Black = df.Black.str.replace(’%’, ‘’)
df.Native = df.Native.str.replace(’%’, ‘’)
df.Asian = df.Asian.str.replace(’%’, ‘’)
df.Pacific = df.Pacific.str.replace(’%’, ‘’)

Columns to be applied numeric

cols = [‘Hispanic’,‘White’,‘Black’,‘Native’,‘Asian’,‘Pacific’,‘Income’]

# Apply columns into numeric floats.

df[cols] = df[cols].apply(pd.to_numeric, errors=‘coerce’, axis=1)

gender_split = df.GenderPop.str.split(’_’)

Get Men & Women

df[‘Men’] = gender_split.str.get(0)
df[‘Women’] = gender_split.str.get(1)

Remove the [M’F] at the end of str

df[‘Men’] = df.Men.str[:-1]
df[‘Women’] = df.Women.str[:-1]

Columns Men & Women

gender = [‘Men’,‘Women’]

Change columns into numeric values.

df[gender] = df[gender].apply(pd.to_numeric, errors=‘coerce’, axis=1)

plt.scatter(df.Women, df.Income)
plt.show()

Checking how many rows are NaN

print(df.Women.isna().value_counts())

women_per_state = df.TotalPop - df.Men

df.fillna(women_per_state)

Checking if we still have a NaN values

print(df.Women.isna().value_counts())

Checking for duplicates

print(df.duplicated().value_counts())

new_df = df.drop_duplicates()

print(new_df.duplicated().value_counts())

print(new_df.head())

new_df[‘Women’] = new_df[‘Women’].astype(float)

print(new_df.dtypes)

plt.scatter(new_df.Women, new_df.Income)

plt.show()

df = new_df

print(df.head())

print(df.columns)

df = df.fillna(value={‘Hispanic’: df.Hispanic.mean(),
‘White’: df.White.mean(),
‘Black’: df.Black.mean(),
‘Native’: df.Native.mean(),
‘Asian’: df.Asian.mean(),
‘Pasicif’: df.Pacific.mean()})

plt.hist(df.White)
plt.title(‘White’)
plt.show()
plt.cla()

plt.hist(df.Black)
plt.title(‘Black’)
plt.show()
plt.cla()

Hi everyone, this is my code above but is there any short way to do this code?

df.Income = df.Income.str.replace(’$’, ‘’)
df.Hispanic = df.Hispanic.str.replace(’%’, ‘’)
df.White = df.White.str.replace(’%’, ‘’)
df.Black = df.Black.str.replace(’%’, ‘’)
df.Native = df.Native.str.replace(’%’, ‘’)
df.Asian = df.Asian.str.replace(’%’, ‘’)
df.Pacific = df.Pacific.str.replace(’%’, ‘’)

1 Like

I like how you bulk processed the to_numeric function. I think to quickly replace all the %, you can melt/unpivot the race category, so all the population ‘%’ will become one column, and then you only need to replace the ‘%’ once. @fiko_94

Here is my code to"melt"/ tidy the data. I wonder how can I melt both race and gender in one line of code vs have to process them separately?

#step 1: melt race
us_census2=pd.melt(frame=us_census2, id_vars=[“State”,“TotalPop”,“Income”,“Men”,“Women”], value_vars=[“Hispanic”,“White”,“Black”,“Native”,“Asian”,“Pacific”], value_name=“Pop_Pct”, var_name=“Race”)

#step 2: remove %
us_census2[“Pop_Pct”]=us_census2[“Pop_Pct”].str[:-1]

#step 3: melt gender
us_census3=pd.melt(frame=us_census2, id_vars=[“State”,“TotalPop”,“Income”,“Race”,“Pop_Pct”], value_vars=[“Men”,“Women”], value_name=“Pop_by_Gender”, var_name=“Gender”)

2 Likes

Thanks a lot guys for sharing your codes.

Also, I’d like share my codes for converting the race data. I built a function instead to make things quicker.

# Function to convert Races Data
def convert_data(df, column, letter):
  df[column] = df[column].str.strip(letter)
  df[column] = pd.to_numeric(df[column])
  return df[column]

# Convert all races data to float/int
census_df.Hispanic = convert_data(census_df, 'Hispanic', '%')
census_df.White = convert_data(census_df, 'White', '%')
census_df.Black = convert_data(census_df, 'Black', '%')
census_df.Native = convert_data(census_df, 'Native', '%')
census_df.Asian = convert_data(census_df, 'Asian', '%')
census_df.Pafific = convert_data(census_df, 'Pacific', '%')

Also, I think it’s better to build a function for creating histograms. There’re much fewer codes though.

def histogram(df, column):
  plt.hist(df[column])
  plt.title(column)
  plt.show()
  plt.cla()

histogram(census_df,'Hispanic')
histogram(census_df,'White')
histogram(census_df,'Black')
histogram(census_df,'Native')
histogram(census_df,'Asian')
histogram(census_df,'Pacific')
1 Like

I put a solution together on github using jupyter notebook with an additional nice map on population by state using geopandas [U.S. Census - full solution](The solution to the Project is here: Git Hub - US Census Data)

Happy coding!

1 Like

Aquí tienen mi solución, por su alguien gusta revisarla:

My Solution