Solution: Cleaning US Census Data (Project)

I was doing this project and just noticed that unlike all projects there is no video walkthrough available for this one. So thought it might help someone who is having trouble completing it. :blush:

Here’s my code-

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import codecademylib3_seaborn
import glob

files = glob.glob(‘states*.csv’)
data_list = [pd.read_csv(file) for file in files]
us_census = pd.concat(data_list)

print(us_census.columns)

us_census[‘Income’] = us_census.Income.str[1:]
us_census[‘Income’] = pd.to_numeric(us_census.Income)

gender_split = us_census.GenderPop.str.split(’_’)
us_census[‘Men’] = gender_split.str.get(0)
us_census[‘Women’] = gender_split.str.get(1)

replace ‘M’ & ‘F’

us_census[‘Men’] = us_census.Men.str[:-1]
us_census[‘Women’] = us_census.Women.str[:-1]

to number conversion

us_census[‘Men’] = pd.to_numeric(us_census.Men)
us_census[‘Women’] = pd.to_numeric(us_census.Women)

us_census = us_census.fillna(value={
‘Women’: us_census.TotalPop - us_census.Men
})

duplicates = us_census.duplicated(subset=[‘State’])
print(duplicates.value_counts())
us_census = us_census.drop_duplicates()

plt.scatter(us_census[‘Women’], us_census[‘Income’], color=[‘red’,‘green’])
plt.xlabel(‘Women’)
plt.ylabel(‘Income’)
plt.show()
plt.cla()

us_census[‘Hispanic’] = us_census.Hispanic.str[:-1]
us_census[‘Hispanic’] = pd.to_numeric(us_census.Hispanic)
us_census[‘White’] = us_census.White.str[:-1]
us_census[‘White’] = pd.to_numeric(us_census.White)
us_census[‘Black’] = us_census.Black.str[:-1]
us_census[‘Black’] = pd.to_numeric(us_census.Black)
us_census[‘Native’] = us_census.Native.str[:-1]
us_census[‘Native’] = pd.to_numeric(us_census.Native)
us_census[‘Asian’] = us_census.Asian.str[:-1]
us_census[‘Asian’] = pd.to_numeric(us_census.Asian)
us_census[‘Pacific’] = us_census.Pacific.str[:-1]
us_census[‘Pacific’] = pd.to_numeric(us_census.Pacific)

us_census = us_census.fillna(value={
‘Hispanic’: us_census.Hispanic.mean(),
‘White’: us_census.White.mean(),
‘Black’: us_census.Black.mean(),
‘Native’: us_census.Native.mean(),
‘Asian’: us_census.Asian.mean(),
‘Pacific’: us_census.Pacific.mean(),
})

plt.hist(us_census[‘Hispanic’])
plt.title(‘Hispanic’)
plt.show()
plt.cla()

plt.hist(us_census[‘White’])
plt.title(‘White’)
plt.show()
plt.cla()

plt.hist(us_census[‘Black’])
plt.title(‘Black’)
plt.show()
plt.cla()

plt.hist(us_census[‘Native’])
plt.title(‘Native’)
plt.show()
plt.cla()

plt.hist(us_census[‘Pacific’])
plt.title(‘Pacific’)
plt.show()
plt.cla()

plt.hist(us_census[‘Asian’])
plt.title(‘Asian’)
plt.show()

print(us_census.head())
print(us_census.dtypes)

7 Likes

How did you find the file names for the .csv files? I assume from your code the naming convention is states1.csv, states2.csv etc, but where did you find that? The question tells us to use the ‘navigator’ where is that exactly…?
Thanks

1 Like

Click on that icon, You’ll be able to see the .csv files, and yeah they are in the pattern ‘states1, states2…etc’

4 Likes

Solution: Below is the link to the Jupyter Notebook I made for this project, as @ method7564419324 states there is no solution provided on the course.
Cleaning US Census Data

5 Likes

Thanks @method7564419324.

Can someone (preferably from CodeAcademy) tell me why there is no video walkthrough for this one?
It seems like the further you get into those courses/paths, the less thoroughly and consistent they are constructed.

7 Likes

Hi there,

Thanks for posting this. Quite helpful.

Can you explain this part and the rationale, please?

Thanks,
Kabir

It removes the percentage sign (which is the last character). Negative indexes represent positions from the end. [:-1] means the slicing from the beginning to the last 1 character (excluded).

1 Like

Thank you very much. I really appreciated.

thanks a lot!!was looking for it desperately

Why not use:

for race in us_census.columns[3:9]:
us_census[race] = us_census[race].str[:-1]
us_census[race] = pd.to_numeric(us_census[race])

And the same for hist plotting

1 Like

Ok. I got the code but i dont quite understand the plot for histogram. Just by looking at the histogram and the values in the table for Asian, does it mean that for the values between 0-5, 5-10,… the number of data sets that fall within that range is plotted?

plt.hist(us_census[‘Asian’])
plt.title(‘Asian’)
plt.show()

Very helpful, thank you!

Agree! I personally found this project pretty challenging.

At Stage 10 I had no duplicates, and eventually realised it was because of the Unnamed: 0 column which had been created when the .csv files had been imported. I got rid of it and created a new unique index as follows:

census = pd.concat(census_dfs).reset_index(drop=True).drop('Unnamed: 0', axis =1)

At Stage 13 I found the instruction to “make a bunch of histograms out of the race data” very vague, especially as we so far haven’t learnt anything about matplotlib. I spend ages trying to get my histogram to display and eventually found that plt.clf() clears the old plot so the new one can display.

All in all it was frustrating and seemed less well thought out than previous projects in the Data Science career path, but it certainly teaches the skill of using Google to find solutions!

1 Like