Hi,
I’m completing the project about us census data from ‘How to clean Data with Python’.
In this project we have to delete duplicate rows, which there are, for example :
I have entered this code :
duplicates = us_census.duplicated()
us_census = us_census.drop_duplicates()
But when I print the us_census dataframe, all the duplicated rows are still here (there are one per file we aggregated earlier).
How to fix this ? Can’t wrap my head around it, I tried with subset too but it didn’t worked either.
Thank you !
the .duplicated()
method locates duplicate rows in the df and will output a boolean value (per row) of the dupes.
drop_duplicates()
returns a df with the duplicate rows removed.
There’s no need to create a variable called duplicates
here. If you just do:
print(us_census.duplicated()) #this will return a boolean series for each duplicate row.
us_census = us_census.drop_duplicates() #this returns a df with duplicate rows removed.
See:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html
And,
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
1 Like
Thank you for your answer,
sadly the rows are still here — the raw .drop_duplicates() was my first try and I have deviated from it after it failed…
Even if you write the code like this?
print(us_census.duplicated())
us_census = us_census.drop_duplicates()
Hello there,
I am facing the same issue. When I call the following code
Blockquote
print(us_census.duplicated())
It returns only lines with “False” results.
I had to keep my work ongoing.
So I just ignore the False result - as I could observe by printing the DF that there were duplicated - and I applicated a drop.duplicates by specifying the columns ‘State’ to get sure I get the result I wanted.

I had the same issue and I think it may be due to small differences in the data in different csv files.
I used this and it seemed to work.
us_census = us_census.drop_duplicates(subset=“State”, keep=“first”)