Data Wrangling and Tidyng, error with DUPLICATES in Cleaning US Census Data

Hi! How a are you? This is my first time in the community forums. I got stuck in the Cleaning US Census Data Proyect. Step 10.
When i run

duplicates = df.duplicated()

I get the error:

Traceback (most recent call last):
  File "script.py", line 47, in <module>
    duplicates = us_census.duplicated()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 4969, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 4947, in f
    vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)
  File "/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 672, in factorize
    values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 508, in _factorize_array
    values, na_sentinel=na_sentinel, na_value=na_value
  File "pandas/_libs/hashtable_class_helper.pxi", line 1798, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 1718, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'list'

May I know the reason why I am getting this error and how to solve it?

Anyway, here is the complete code up to that step:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import codecademylib3_seaborn
import glob

#Concatenamos los files
files = glob.glob("states*.csv")
census_list = []
for filestates in files:
  data = pd.read_csv(filestates)
  census_list.append(data)

us_census = pd.concat(census_list)

#Sacamos el $ de Income para transformarlo a columna de tipo numérico
us_census.Income = us_census['Income'].replace('[\$]',"", regex = True)
us_census.Income = pd.to_numeric(us_census.Income)

print(us_census.GenderPop.head())
#Hay que dividir los datos de los géneros
#split para separar desde _, en una nueva columna
us_census['gender_split'] = us_census.GenderPop.str.split('_')
#Creamos las nuevas columnas de Women y Men
us_census['Women'] = us_census.gender_split.str.get(1)
us_census['Men'] = us_census.gender_split.str.get(0)

#Dejamos solo los números
us_census.Women = us_census['Women'].replace('F',"", regex = True)
us_census.Men = us_census['Men'].replace('M',"", regex = True)

#Convertimos las columnas a números
us_census.Women = pd.to_numeric(us_census.Women)
us_census.Men = pd.to_numeric(us_census.Men)
#Completamos los valores nan de Women con la resta entre TotalPop y Men
us_census = us_census.fillna(value = {"Women": us_census.TotalPop - us_census.Men})

#Chequeamos que estén ok los types
print(us_census.dtypes)
print(us_census.Women)

#Grafico
plt.scatter(us_census.Women, us_census.Income) 
plt.show()

duplicates = us_census.duplicated()

print(duplicates)

At some point it seems you have one or more list objects added to your dataframe and it seems the .duplicated() method relies on some form of hashing to get unique values which won’t work for unhashable python objects like a list. You’ll want to locate where you’re assigning this list to your dataframe and consider using a different type (tuple, numpy array or similar should work) or for a list with multiple elements maybe multiple columns in your dataframe would be appropriate.

I guess it is the list that is used to read the files with glob.

files = glob.glob("states*.csv")
census_list = []
for filestates in files:
  data = pd.read_csv(filestates)
  census_list.append(data)

us_census = pd.concat(census_list)

I don’t know how to read the files without the list, or how to modify them after creating the df.

I think the issue occurs after the read, the list is used but you don’t directly add a list to the dataframe. Be careful about creating new columns, make sure you’re actually passing what you intend to and consider specifying a dtype for ease.

In case you can't find it

I’d have a very close look at lines like this one-

us_census['gender_split'] = us_census.GenderPop.str.split('_')