Cleaning US Census Data with R

Hello! :grinning:

I just finished the exercise Cleaning US Census Data and I was surprised to notice in the solution that there is a difference in values for the duplicated rows.
The solution shows False 31 and True 5 and I found False 52 and True 9 (I don’t understand why).
Is there an explanation for these different values?
Have a great day!

A.

--

title: "Cleaning US Census Data"`

output: html_notebook

---

```{r message=FALSE, warning=FALSE, error=TRUE}

# load libraries

library(dplyr)

library(readr)

library(tidyr)


# load CSVs

files<- list.files(pattern="states_.*csv")

df_list<- lapply(files,read_csv)

us_census<- bind_rows(df_list)


# inspect data

print(colnames(us_census))

#str(us_census)

#head(us_census)


# drop X1 column

us_census <- select(us_census,-X1)


# remove % from race columns

us_census<- us_census %>%

  mutate(Hispanic=gsub('\\%','',Hispanic),White=gsub('\\%','',White),Black=gsub('\\%','',Black),Native=gsub('\\%','',Native),Asian=gsub('\\%','',Asian),Pacific=gsub('\\%','',Pacific))

#head(us_census)


# remove $ from Income column

us_census<- us_census %>%

  mutate(Income=gsub('\\$','',Income))

#head(us_census)


# separate GenderPop column

us_census<- us_census %>%

  separate(GenderPop,c('male_pop','female_pop', '_'))

#head(us_census)


# clean male and female population columns, removes M and F

us_census<- us_census %>%

  mutate(male_pop=gsub('\\M','',male_pop),female_pop=gsub('\\F','',female_pop))

# mutate(male_pop=gsub('M','',male_pop),female_pop=gsub('F','',female_pop))

#head(us_census)


# update column data types: <chr> becomes <dbl>

us_census<- us_census %>%

  mutate(Hispanic=as.numeric(Hispanic),White=as.numeric(White),Black=as.numeric(Black),Native=as.numeric(Native),Asian=as.numeric(Asian),Pacific=as.numeric(Pacific),Income=as.numeric(Income),male_pop=as.numeric(male_pop),female_pop=as.numeric(female_pop))

#head(us_census)


# update values of race columns

us_census<- us_census %>%

  mutate(Hispanic = Hispanic/100,White = White/100,Black = Black/100,Native = Native/100,Asian = Asian/100,Pacific = Pacific/100)

head(us_census)


# check for duplicate rows

us_census %>%

  duplicated() %>%

  table()

us_census


# remove duplicate rows

us_census<- us_census %>% 

  distinct()

  #distinct(State.keep_all=TRUE)


# check for duplicate rows

us_census %>%

  duplicated() %>%

  table()

us_census


# clean data frame

head(us_census)

Hi,

I also have a problem with this question.
Could someone help ? I am not seeing where the problem is.

# check for duplicate rows
us_census <- us_census  %>%
  duplicated() %>%
    table()
us_census


```{r error=TRUE}
# remove duplicate rows
us_census <- us_census  %>%
  distinct() 
us_census


```{r error=TRUE}
# check for duplicate rows

us_census <- us_census  %>%
  duplicated() %>%
    table()
us_census

Thank you in advance !