Why/when do I need two escape characters?

In the Cleaning US Census Data project in the “Learn R: Data Cleaning” module, there is a section that utilizes the gsub function to eliminate % and $ symbols in columns’ values.

In the provided hint for this section, the first argument is the % or $ sign preceded by two escape characters / backslashes, as follows:

df %>%  mutate(column_name=gsub('\\$','',column_name))
df %>%  mutate(column_name=gsub('\\%','',column_name))

Why is this? I understand that $ is a special symbol that needs an escape character to be read as a character, but why does it need two escape characters?

Further, % doesn’t seem to be such a special character. In fact, my code on this project worked perfectly fine without any escape characters for that part, let alone two!

Thanks in advance for your help!

1 Like

Without knowing the R language, it is difficult to say. Consider that is we wish to make \ a printable character, since it is a special character, we would need to escape it. The output would be something like,

\$
\%

You are correct that % doesn’t need to be escaped.
Unless I am missing some subtlety, gsub('%', '', column_name) works fine.

Regarding the need for two backslashes in gsub('\\$', '', column_name), it is because of having to respect valid syntax for both (R) strings AND regular expressions (regex).

$ and % can be included in R strings without needing to be escaped. For example, the following is a valid string in R and doesn’t require any characters (neither the % nor the $) to be escaped.

"After 20% increase in costs, price is $32.46"

However, $ is a meta-character in regex and has a special meaning. if we want to specify a pattern in which the literal regular dollar symbol $ is to matched (and not treated as a meta-character), then we need to escape the meta-character. So, we want the string denoting the regex pattern to match to be "\$"

But the R interpreter will look at "\$", see the quotes and recognize that this is a string. But in R strings, the backslash is interpreted as being an escape sequence. You are likely familiar with some of the more common escape sequences such as \n for newline, \t for tab. There are other escape sequences as well which are valid e.g. \b, \", ‘\r’ (For a more detailed listing of valid escape sequence, see R: Quotes).

If we attempt to use the backslash with characters that are not valid escape sequences e.g. \g, \h, \k, \%, \$, then the interpreter will throw an error regarding an "unrecognized escape in character string".

By using double-backslash in

gsub('\\$', '', column_name)

First, the R interpreter looks at the string '\\$'. Since \\ is a valid escape sequence, so effectively the string becomes '\$'
Then '\$' is passed as the pattern string to the gsub function. \$ is valid in regex and (is considered as escaping the meta-character) is treated as specifying the dollar sign character $

x <- "After 20% increase in costs, price is $32.46"
y <- gsub('$', '', x)
# y will be: "After 20% increase in costs, price is $32.46"
# '$' is a valid R string. 
# In regex, $ is treated as a meta-character and so doesn't match the dollar sign.

x <- "After 20% increase in costs, price is $32.46"
y <- gsub('\$', '', x)
# Error because '\$' is not a valid escape sequence in R strings.

x <- "After 20% increase in costs, price is $32.46"
y <- gsub('\\$', '', x)
# y will be: "After 20% increase in costs, price is 32.46"
# This works.

Yes, the following will work. % doesn’t need to be escaped.

x <- "After 20% increase in costs, price is $32.46"
y <- gsub('%', '', x)
# "After 20 increase in costs, price is $32.46"


x <- "After 20% increase in costs, price is $32.46"
y <- gsub('\%', '', x)
# Error because '\%' is not a valid escape sequence in R strings.

x <- "After 20% increase in costs, price is $32.46"
y <- gsub('\\%', '', x)
# "After 20 increase in costs, price is $32.46"

From Regular Expression Character Escaping,

However, the specifications for some regular expression implementations (POSIX for example), state in their documentation that when you escape a character that doesn’t need to be escaped, the result will be ‘undefined behaviour’. Usually the behaviour that most implementations default to is just to interpret the character literally, but you should keep this in mind because relying on undefined behaviour can get you into trouble.

Other links:

As an alternative to escaping, you could make use of the fixed parameter of the gsub function:

x <- "After 20% increase in costs, price is $32.46"
y <- gsub('$', '', x, fixed=TRUE)
# "After 20% increase in costs, price is 32.46"

From grep function - RDocumentation,

fixed:      logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.

1 Like

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.