Cleaning US Census Data Bugg? Can't delete duplicates

Hi,

I’m completing the project about us census data from ‘How to clean Data with Python’.

In this project we have to delete duplicate rows, which there are, for example :

I have entered this code :

duplicates = us_census.duplicated()
us_census = us_census.drop_duplicates()

But when I print the us_census dataframe, all the duplicated rows are still here (there are one per file we aggregated earlier).

How to fix this ? Can’t wrap my head around it, I tried with subset too but it didn’t worked either.

Thank you !

the .duplicated() method locates duplicate rows in the df and will output a boolean value (per row) of the dupes.

drop_duplicates() returns a df with the duplicate rows removed.

There’s no need to create a variable called duplicates here. If you just do:

print(us_census.duplicated()) #this will return a boolean series for each duplicate row.

us_census = us_census.drop_duplicates() #this returns a df with duplicate rows removed.

See:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

And,
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

1 Like

Thank you for your answer,

sadly the rows are still here — the raw .drop_duplicates() was my first try and I have deviated from it after it failed…

Even if you write the code like this?

print(us_census.duplicated())

us_census = us_census.drop_duplicates()