Hello!
I just finished the exercise Cleaning US Census Data and I was surprised to notice in the solution that there is a difference in values for the duplicated rows.
The solution shows False 31 and True 5 and I found False 52 and True 9 (I don’t understand why).
Is there an explanation for these different values?
Have a great day!
A.
--
title: "Cleaning US Census Data"`
output: html_notebook
---
```{r message=FALSE, warning=FALSE, error=TRUE}
# load libraries
library(dplyr)
library(readr)
library(tidyr)
# load CSVs
files<- list.files(pattern="states_.*csv")
df_list<- lapply(files,read_csv)
us_census<- bind_rows(df_list)
# inspect data
print(colnames(us_census))
#str(us_census)
#head(us_census)
# drop X1 column
us_census <- select(us_census,-X1)
# remove % from race columns
us_census<- us_census %>%
mutate(Hispanic=gsub('\\%','',Hispanic),White=gsub('\\%','',White),Black=gsub('\\%','',Black),Native=gsub('\\%','',Native),Asian=gsub('\\%','',Asian),Pacific=gsub('\\%','',Pacific))
#head(us_census)
# remove $ from Income column
us_census<- us_census %>%
mutate(Income=gsub('\\$','',Income))
#head(us_census)
# separate GenderPop column
us_census<- us_census %>%
separate(GenderPop,c('male_pop','female_pop', '_'))
#head(us_census)
# clean male and female population columns, removes M and F
us_census<- us_census %>%
mutate(male_pop=gsub('\\M','',male_pop),female_pop=gsub('\\F','',female_pop))
# mutate(male_pop=gsub('M','',male_pop),female_pop=gsub('F','',female_pop))
#head(us_census)
# update column data types: <chr> becomes <dbl>
us_census<- us_census %>%
mutate(Hispanic=as.numeric(Hispanic),White=as.numeric(White),Black=as.numeric(Black),Native=as.numeric(Native),Asian=as.numeric(Asian),Pacific=as.numeric(Pacific),Income=as.numeric(Income),male_pop=as.numeric(male_pop),female_pop=as.numeric(female_pop))
#head(us_census)
# update values of race columns
us_census<- us_census %>%
mutate(Hispanic = Hispanic/100,White = White/100,Black = Black/100,Native = Native/100,Asian = Asian/100,Pacific = Pacific/100)
head(us_census)
# check for duplicate rows
us_census %>%
duplicated() %>%
table()
us_census
# remove duplicate rows
us_census<- us_census %>%
distinct()
#distinct(State.keep_all=TRUE)
# check for duplicate rows
us_census %>%
duplicated() %>%
table()
us_census
# clean data frame
head(us_census)