Data Cleaning US Consensus with Pandas

Hey!

For the task 9, I’m having trouble using the .fillna() function.
They ask to

We can fill in those nan s by using pandas’ .fillna() function. You have the TotalPop per state, and you have the Men per state. As an estimate for the nan values in the Women column, you could use the TotalPop of that state minus the Men for that state

But how do we exactly do that? How do you put the subtraction into the .fillna()?

Here’s what I have so far

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import codecademylib3_seaborn
import glob

us_census = glob.glob("states*.csv")

df_list = []
for state in us_census:
  df_list.append(pd.read_csv(state))
  
us_census = pd.concat(df_list)

#separate genpop into females and males
# put all ethnic groups into one column "Ethnicity"

us_census['Income'] = us_census['Income'].replace('[\$,]', '', regex=True)
us_census['Income'] = pd.to_numeric(us_census['Income'])

split_gender = us_census['GenderPop'].str.split('_', expand=True)
us_census['Female'] = split_gender[1].str.split('(\d+)', expand=True)[1]
us_census['Male'] = split_gender[0].str.split('(\d+)', expand=True)[1]

#trying to understand how to use the .fillna but this doesn't work. 
values = us_census['TotalPop'] - us_census['Male']
us_census['Female'] = us_census['Female'].fillna(value=values)

us_census['Female'] = pd.to_numeric(us_census['Female'])

us_census['Male'] = pd.to_numeric(us_census['Male'])

# plt.scatter('Female', 'Income')
# plt.show()

print(us_census.columns)
print(us_census.dtypes)
print(us_census.head())

I wish there was a get through video since I’m not sure if anything here is correct.

Thank you in advance!

project link US Consensus Project

1 Like

NVM,

Figured it out (or at least I hope so). If anyone is interested, here’s what I did:

difference = us_census['TotalPop'] - us_census['Male']
us_census['Female'] = us_census['Female'].fillna(value=difference)

actually now that im looking at what i did before it’s the same. so idk. this is confusing

1 Like

That suggests this isn’t what you changed, but something else.
Running your code results in an error message complaining about types, were those columns of types that support subtraction? One of the steps was to convert to numerical values

The problem was running the .fillna() before converting to numerical values.
If I use it before, the code runs.

I was having the same problem … thanks for posting it in the forum

I have the same problem, nothings appear in my plot

I have float in the male and female column
but when I trying to plot my graph nothings happened (all my values are Nan

Would be great if someone of the forum or codecademy staff upload a video with the solution

Thanks a lot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import codecademylib3_seaborn
import glob

# usar glob para expresiones regulares
us_census = glob.glob("states*.csv")

# iterar por todas las listas que se llamen states...
list_1 = []
for x in us_census:
  data = pd.read_csv(x)
  list_1.append(data)
us_census = pd.concat(list_1)

print(us_census.head())
print(us_census.dtypes)

# quitar el signo peso y transformar el objeto en float
us_census.Income = us_census['Income'].replace('[\$,]', '', regex=True)
us_census.Income = pd.to_numeric(us_census.Income)
#print(us_census.dtypes)

# dividir la columna GenderPop es masculino y femenino
split_gender = us_census['GenderPop'].str.split('_', expand=True)
us_census['male'] = split_gender[0].str.split('(\d+)', expand=True)[1]
us_census['female'] = split_gender[1].str.split('(\d+)', expand=True)[1]

print(us_census.head())

# quitar la M y la F de la columna y transformarlo en float
us_census.male = us_census['male'].replace('[\w,]', '', regex=True)
us_census.male = pd.to_numeric(us_census.male)

us_census.female = us_census['female'].replace('[\w,]', '', regex=True)
us_census.female = pd.to_numeric(us_census.female)

#print(us_census.female)
print(us_census.dtypes)
#print(us_census.head())
print(us_census.female.mean())

# mostrar el grafivo M v/s F ... pero todos los valores son Null
plt.scatter(us_census.male, us_census.female)
#plt.show()

valores = us_census['TotalPop'] - us_census['male']
us_census['female'] = us_census['female'].fillna(value=valores)

us_census['female'] = pd.to_numeric(us_census['female'])
us_census['male'] = pd.to_numeric(us_census['male'])

plt.scatter(us_census.male, us_census.Income)
plt.show()

Just posting another way of splitting data because…diversity :slight_smile:

gendersplit = us_census['GenderPop'].str.split('_')
us_census['MalePop'] = gendersplit.str.get(**0**).replace("[MF]","", regex=True)
us_census['MalePop'] = pd.to_numeric(us_census['MalePop'])
us_census['FemalePop'] = gendersplit.str.get(**1**).replace("[MF]","", regex=True)
us_census['FemalePop'] = pd.to_numeric(us_census['FemalePop'])