Cleaning US Census Data Bugg? Can't delete duplicates


I’m completing the project about us census data from ‘How to clean Data with Python’.

In this project we have to delete duplicate rows, which there are, for example :

I have entered this code :

duplicates = us_census.duplicated()
us_census = us_census.drop_duplicates()

But when I print the us_census dataframe, all the duplicated rows are still here (there are one per file we aggregated earlier).

How to fix this ? Can’t wrap my head around it, I tried with subset too but it didn’t worked either.

Thank you !

the .duplicated() method locates duplicate rows in the df and will output a boolean value (per row) of the dupes.

drop_duplicates() returns a df with the duplicate rows removed.

There’s no need to create a variable called duplicates here. If you just do:

print(us_census.duplicated()) #this will return a boolean series for each duplicate row.

us_census = us_census.drop_duplicates() #this returns a df with duplicate rows removed.



1 Like

Thank you for your answer,

sadly the rows are still here — the raw .drop_duplicates() was my first try and I have deviated from it after it failed…

Even if you write the code like this?


us_census = us_census.drop_duplicates()

Hello there,
I am facing the same issue. When I call the following code


It returns only lines with “False” results.

I had to keep my work ongoing.
So I just ignore the False result - as I could observe by printing the DF that there were duplicated - and I applicated a drop.duplicates by specifying the columns ‘State’ to get sure I get the result I wanted.

I had the same issue and I think it may be due to small differences in the data in different csv files.

I used this and it seemed to work.
us_census = us_census.drop_duplicates(subset=“State”, keep=“first”)