FAQ: Data Cleaning in R - Splitting By Character

This community-built FAQ covers the “Splitting By Character” exercise from the lesson “Data Cleaning in R”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Learn R

FAQs on the exercise Splitting By Character

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

A bit confused here over the extra = “merge” argument, why (and how) does it work? Merge is types as a character string, how come it behaves like a function?

And what do I do if I want a middle name to belong to the first_name column instead of the last?
Thanks in advance :nerd_face:

1 Like

I keep getting errors by not putting column names in quotes…why, in certain functions, like separate(), do the column names have to be put in quotes, and others, like select(), they do not? Is this something that has to be memorized, or is there a rule?

Have a look at the R documentation for separate.
https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate

As you can see, the data frame (which can be piped as in the exercise), the column name, and the names of new columns (as a character vector) are to be passed to the separate function. We did this with

students %>% 
  separate(full_name, c('first_name', 'last_name'), ...)

There are other optional parameters for the separate function which you can see in the documentation (such as sep, remove, convert, extra, fill). If you don’t pass any arguments for these parameters, then their default values are used. The default for extra is ‘warn’, but when you write extra='merge', you choose to use this value instead of the default. extra isn’t a function, it is an optional parameter.

One way would be to do as Alexey has done in this StackOverFlow post.

He has basically used a regex (regular expression) pattern to use the last space before the end of line as the separator. In the exercise, we used " " as the separator. Alexey has used " (?=[^ ]+$)" as the separator. For the meanings of ?, [^ ], + and $, you can search the regex wikipedia article: https://en.wikipedia.org/wiki/Regular_expression for these symbols (or some other handy regex reference sheet).

1 Like

You have to use the documentation. Whenever you want to do something new, first you have to do a search for any packages, commands or functions that can help you accomplish the task. Then, you have to read their documentation. There is no rule which will instantly tell you the interface of functions. If you use some function frequently, you will pretty much remember which arguments make sense. If you take a long break, things can fade from memory. So, a handy reference sheet of common functions or re-reading the documentation should bring you up to speed. Memorizing isn’t the goal. As long as you know where to look if you are unsure about something, you are fine.

If you look at the snippet for separate,

df %>%
  separate(col_to_separate, c('new_col_name_1','new_col_name_2'), ...)

you can see that the column name to be separated isn’t in quotes, but the new columns names are in quotes. Before using the separate function, you would have a look at the documentation to better understand what it does, what are its parameters, any examples, and any remarks or features which may be of note. If you look at the documentation for the separate function
https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate
you will see that the col parameter accepts column name or position, so we don’t need quotes when passing this argument. On the contrary, the documentation mentions that the into parameter accepts “Names of new variables to create as character vector”. Character is one of the data types in R, and a character vector would consist of a vector containing objects having character type. Character types have quotes around them. Reading the documentation is how we figure out the expected format of the arguments. One way to create character vectors is through the c() function as done in the exercise. You can read the documentation of this function as well https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/c

1 Like

I’m confused about the example versus the solution to the instructional part. In the example given during the lesson, the columns are split into two new columns. When outputting the new table, the table shows the original column with the two new columns next to it. In the exercise, after splitting the column into two new columns, the original column is removed. Which is the actual output of using the select() function? (Does it keep the original column or not?)

Yes, you are correct. The example is a bit inconsistent.
If we look at the documentation for the separate function,
https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/separate
we see that you can pass an optional argument remove which lets you specify whether you want to keep or remove the original column. The default value of remove is TRUE, so as we see in the instructional part, we didn’t pass the remove argument and the full_name column was removed in the new data frame and replaced with two new columns. if we had used the code

students <- students %>%
  separate(full_name, c('first_name', 'last_name'), ' ', extra = 'merge', remove = FALSE)

then, the resulting dataframe would include all the three columns i.e. full_name, first_name, last_name

The example should have been coded like

# Create the 'user_type' and 'country' columns
df %>%
  separate(type,c('user_type','country'),'_', remove = FALSE)

Since, the remove argument wasn’t used in the example, so you are correct that the result shouldn’t have the original type column.

1 Like

Thank you! For coding language, specifics are important