US Census Project Getting Duplicates

I am getting an error message on Task 10 of this project when I try to check for duplicates. Really not sure what is going on because everything has worked up until this point and this seems like a pretty straightforward task. Code is below. What am I missing here?

import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot
import codecademylib3_seaborn
import glob

files = glob.glob('states*.csv')
df_list = []
for filename in files:
  data = pd.read_csv(filename)
  df_list.append(data)

us_census = pd.concat(df_list)

print(us_census.columns)

print(us_census.dtypes)

print(us_census.head())

us_census.Income = us_census['Income'].replace('[\$,]', '', regex=True)

us_census['str_split'] = us_census.GenderPop.str.split('_')

us_census['men'] = us_census.str_split.str.get(0)

us_census['women'] = us_census.str_split.str.get(1)

us_census.men = us_census['men'].replace('[M,]', '', regex=True)

us_census.women = us_census['women'].replace('[F,]', '', regex=True)

us_census.Income = pd.to_numeric(us_census.Income)

us_census.men = pd.to_numeric(us_census.men)

us_census.women = pd.to_numeric(us_census.women)

pyplot.scatter(us_census.women, us_census.Income)

pyplot.show()

print(us_census.women)

us_census = us_census.fillna(value={'women': (us_census.TotalPop - us_census.men)})

print(us_census.women)

duplicates = us_census.duplicated()

This is the error I’m getting:

Traceback (most recent call last):
  File "script.py", line 49, in <module>
    duplicates = us_census.duplicated()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 4954, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 4932, in f
    vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)
  File "/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 672, in factorize
    values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 508, in _factorize_array
    values, na_sentinel=na_sentinel, na_value=na_value
  File "pandas/_libs/hashtable_class_helper.pxi", line 1798, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 1718, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'list'

This is just a guess: Somewhere along the line, you are trying to identify duplicate lists. As lists are not hashable (i.e., they are mutable, and can’t be used as dictionary keys or members of sets), they can’t be meaningfully compared as duplicate or not.

Thanks. Looks like the “str_split” column that was added when I split male and female population values was formatted as a list. Once I dropped that column from the data frame, I stopped getting the error. Really appreciate your help!

1 Like