FAQ: Data Cleaning in R - Dealing with Multiple Files

This community-built FAQ covers the “Dealing with Multiple Files” exercise from the lesson “Data Cleaning in R”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Learn R

FAQs on the exercise Dealing with Multiple Files

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Hi everyone,

I have a short question about the pattern parameter in the list.files() function.
In this lesson, to assing the list of files, one needs to type

student_files <- list.files(pattern = 'exams_.*csv')

Usually, I would assume that * would be the wildcard symbol, so that

student_files <- list.files(pattern = 'exams_*.csv')

would make more sense to me. Googling about wildcard characters for the pattern parameter in list.files() didn’t quite help me.
So why is it that the upper line of code works with the files exams_0.csv, exams_1.csv, …, exams_9.csv, while the line of code which makes more sense to me doesn’t?

Thanks,
Fready

7 Likes

I asked myself the exact same question and did not find an answer either.

1 Like

I found it out by myself by now.

The string for the pattern parameter is a so called regular expression (regex).
A regex can contain stuff like ‘\d’, which would stand for any digit from 0-9.
The dot . is indeed the wildcard symbol, and an asterisk followed by either a character or an expression means that this expression can be repeated arbitrarily many times, even 0 times.
For an actual dot, one has to put a backslash before the dot.

For example, the regex pattern ‘a(ha)*’ would match the string ‘aha’, but also the strings ‘ahahahahahaha’ and ‘a’.

The pattern ‘exams_.*csv’ thus means: Find any filename that contains the substrings ‘exams_’ and, somewhere after that, ‘csv’.
The string ‘yo_mama_so_stupid_she_failed_all_exams_twice_42_csv’ would match the pattern as well, and it doesn’t even contain a dot.

Instead of *, one can also use +, which would mean an arbitrary number of repetitions, but at least one occurence. Thus the pattern ‘a(ha)+’ would not match ‘a’ anymore, but still ‘aha’ and ‘ahahahahahaha’.
As I mentioned before, \d inside a pattern string stands for any digit, so

student_files <- list.files(pattern = 'exams_\d+\.csv')

This means: Find any filenames that contain the substring ‘exams_’, directly followed by one or more digits (given through \d+), directly followed by the substring ‘.csv’ (remember the backslash before the dot). However, the filename ‘bexams_000.csvvv’ will most likely also match this pattern, since regular expressions need extra specifications to require that a string begins or ends with a certain expression.

I refer to the following links:
https://www.regular-expressions.info/
https://automatetheboringstuff.com/chapter7/

I’ve read the latter one today, this is why I know about this now, and could give myself an answer. It’s a bit annoying that no one else answered us in (more than) a week…

Cheers!

20 Likes

I am almost wondering if this is a typo on CodeAcademy. Because if you look on say, StackOverflow, it appears the * wildcard functions like it does in other languages where it expects characters to be in that position.

Wish @codeacademy would offer clarity.

Thanks for this helpful reply, I wondered the same thing myself.

Hello everyone! I need your help.

I am trying to read in multiple JSON files (approx 100) using this function. I cannot get it to work!!
Can anyone advise on what I’m doing wrong.

My code is:
df <- list.files(“path”, pattern="*.json", full.names=TRUE)

dflist <- lapply(df, fromJSON)

I keep getting this error:

Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage
8aaeb38572",“type”:“custom”} {“name”:“DrugUsed”,“ts”:1605020
(right here) ------^

I am out of my depth here, but perhaps your files contain the json in a specific format.
These links may or may not be pertinent:

1 Like

Thanks for this! It’s gave me a new direction to follow. Very new to R and these files are driving me mental :grimacing:

If anyone is looking for a quick and dirty solution I used this today and it worked OK for what I need at the moment:

Library(geojsonR)
merge_files(INPUT_FOLDER = “path”, OUTPUT_FILE = “your filename.json”)

1 Like

I thought the * was a wildcard, too!

Within the explanation of this module, there is a link with the text “regular expression” which takes you to “How to Clean Data with Python”'s regular expression module:
https://www.codecademy.com/courses/practical-data-cleaning/lessons/nlp-regex-conceptual/exercises/introduction
and there is the usual cheat sheet:
https://www.codecademy.com/learn/practical-data-cleaning/modules/data-cleaning-with-pandas/cheatsheet
that explains how codecademy uses them.

hello,

I have a question about the following lines:

student_files ← list.files(pattern = ‘exams_.*csv’)

print(student_files)

df_list ← lapply(student_files, read_csv)

why is the print(student_files) necessary here?

thanks in advance!
Marleen

I don’t think it is necessary, but it is useful. It allows us to see the filenames of the files that we matched. If we omit or comment out print(student_files), then our code will still work.
If you look at the screenshot, it is convenient to see the files we matched.
zscreenshot

Thank you! This is super helpful. I’m surprised the material didn’t address this, since I imagine most people would assume the asterisk is the wildcard symbol.