FAQ: Data Cleaning in R - Dealing with Multiple Files

This community-built FAQ covers the “Dealing with Multiple Files” exercise from the lesson “Data Cleaning in R”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Learn R

FAQs on the exercise Dealing with Multiple Files

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Hi everyone,

I have a short question about the pattern parameter in the list.files() function.
In this lesson, to assing the list of files, one needs to type

student_files <- list.files(pattern = 'exams_.*csv')

Usually, I would assume that * would be the wildcard symbol, so that

student_files <- list.files(pattern = 'exams_*.csv')

would make more sense to me. Googling about wildcard characters for the pattern parameter in list.files() didn’t quite help me.
So why is it that the upper line of code works with the files exams_0.csv, exams_1.csv, …, exams_9.csv, while the line of code which makes more sense to me doesn’t?

Thanks,
Fready

I asked myself the exact same question and did not find an answer either.

I found it out by myself by now.

The string for the pattern parameter is a so called regular expression (regex).
A regex can contain stuff like ‘\d’, which would stand for any digit from 0-9.
The dot . is indeed the wildcard symbol, and an asterisk followed by either a character or an expression means that this expression can be repeated arbitrarily many times, even 0 times.
For an actual dot, one has to put a backslash before the dot.

For example, the regex pattern ‘a(ha)*’ would match the string ‘aha’, but also the strings ‘ahahahahahaha’ and ‘a’.

The pattern ‘exams_.*csv’ thus means: Find any filename that contains the substrings ‘exams_’ and, somewhere after that, ‘csv’.
The string ‘yo_mama_so_stupid_she_failed_all_exams_twice_42_csv’ would match the pattern as well, and it doesn’t even contain a dot.

Instead of *, one can also use +, which would mean an arbitrary number of repetitions, but at least one occurence. Thus the pattern ‘a(ha)+’ would not match ‘a’ anymore, but still ‘aha’ and ‘ahahahahahaha’.
As I mentioned before, \d inside a pattern string stands for any digit, so

student_files <- list.files(pattern = 'exams_\d+\.csv')

This means: Find any filenames that contain the substring ‘exams_’, directly followed by one or more digits (given through \d+), directly followed by the substring ‘.csv’ (remember the backslash before the dot). However, the filename ‘bexams_000.csvvv’ will most likely also match this pattern, since regular expressions need extra specifications to require that a string begins or ends with a certain expression.

I refer to the following links:
https://www.regular-expressions.info/
https://automatetheboringstuff.com/chapter7/

I’ve read the latter one today, this is why I know about this now, and could give myself an answer. It’s a bit annoying that no one else answered us in (more than) a week…

Cheers!

I am almost wondering if this is a typo on CodeAcademy. Because if you look on say, StackOverflow, it appears the * wildcard functions like it does in other languages where it expects characters to be in that position.

Wish @codeacademy would offer clarity.