Data Cleaning Pandas Project Feedback: US Census

Hi Codecademy team,

I have just finished with the Data Cleaning Module in the DS Path and in the end of the module, there is a project to clean the US Census data. Overall, I am very happy with the knowledge I gain from this module and with my ability to apply it to the project. I am sure these tools are just the tip of the iceberg.

Well, the reason I am posting this is to get some constructive feedback from other aspiring data scientists/analysts to review my Python Codes. Below is my code:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import codecademylib3_seaborn
import glob

#Number 2
states_files = glob.glob('states[0-9].csv')
df_list = []
for file in states_files:
  df = pd.read_csv(file)
  df_list.append(df)
us_census = pd.concat(df_list)

#Number 3 and 4
print(us_census.columns)
print(us_census.dtypes)
#Renaming a column
us_census.rename(columns = {'Unnamed: 0' : 'ID'}, inplace = True)

# Number 5
#replacing $ sign 
us_census['Income'] = us_census['Income'].replace('\$', " ", regex = True)
#convert to numeric and round to 2 decimal places
us_census['Income'] = pd.to_numeric(us_census['Income'])
us_census['Income'] = round(us_census['Income'], 2)

# Number 6
gender_split = us_census['GenderPop'].str.split('_')
# print(gender_split)
m = gender_split.str.get(0)
f = gender_split.str.get(1)
us_census['Men'] = m
us_census['Women'] = f

#Number 7
#Separating the numbers from the M and F 
men_split = us_census['Men'].str.split('(\d+)', expand = True)
women_split = us_census['Women'].str.split('(\d+)', expand = True)
us_census['Men'] = pd.to_numeric(men_split[1])
us_census['Women'] = pd.to_numeric(women_split[1])
us_census = us_census.drop('GenderPop', axis = 1)

# Number 9
us_census = us_census.fillna(value = {'Women': us_census['TotalPop'] - us_census['Men']})

# Number 10
census_duplicates = us_census.duplicated()
# print(census_duplicates.value_counts())

# Number 11
# No duplicates were found and therefore, no need to execute .drop_duplicates() function

# Number 8
plt.scatter(us_census['Women'], us_census['Income'])
plt.xlabel('Income for Women')
plt.ylabel('Income Level')
plt.title('The relationship between Women and Income Level')
plt.show()

# Number 13
print(us_census.dtypes)

# Number 14
for column in us_census.columns[3:9]:
  us_census[column] = us_census[column].replace('[\%]', " ", regex = True)
  us_census[column] = pd.to_numeric(us_census[column])
  us_census = us_census.fillna(value = {column : 0})
#print(us_census)

#plotting histogram for each ethnicity 
plt.close('all')
plt.figure(figsize = (10,40))
x = 1
colors = ['blue', 'red', 'green', 'orange', 'pink', 'purple']
for column in us_census.columns[3:9]:
  plt.subplot(6,1,x)
  plt.hist(us_census[column], color = colors[x-1])
  plt.xlabel('{} Population Number in Percentage'.format(column))
  plt.ylabel('Frequency')
  plt.title('{}'.format(column))
  plt.subplots_adjust(hspace = .25)
  x += 1  
plt.show()

Now, I can’t post the output of my code here as the tables and graphs are pretty big. It will be too messy. But I want to have you pay particular attention to Python code for number 14.
I spent about 3 hours and 2 cups of tea on that block of code. The objective of this block of code is to create a loop that will automatically create a histogram for each ethnicity in the dataset. This code of mine does work, but I am curious if you guys have a simpler and easier way to create this loop.

To be honest, although that block of code (for number 14) does work, I don’t particularly like the way it looks. Somehow it looks messy to my eyes and I am not sure if other data scientists look at it will understand what I have written and what it will produce.

Therefore, I will be “very very very x 10” grateful if you guys can spare some time to look through my work and give some tips (if there is anything I can improve).

Thanks heaps,

Jimmy

I’m going through this now…quite slowly.
It would be great if there were hints after all the questions. :slight_smile:
Or, maybe a video in the ‘get unstuck’ portion.(?)

For me, there’s a disconnect in what is learned in the lesson and my notes and what’s asked in this project. :grimacing: So, it’s been very slow going.

1 Like

Hey Lisa,

I am very sorry for the late reply… I have been busting my head on Machine Learning Supervised Learning section.
Please do take your time in going through this… No rush at all.
Its better be slow than nothing at all.
I’d really appreciate your effort and looking forward to your feedback on my code :slight_smile:

Kind Regards,
Jimmy

No worries! I was going through the lesson myself and I got stuck a few times!
I haven’t really looked at your code above. I’ll check it out later.

1 Like