FAQ: Data Cleaning with Pandas - Dealing with Multiple Files

This community-built FAQ covers the “Dealing with Multiple Files” exercise from the lesson “Data Cleaning with Pandas”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Practical Data Cleaning

FAQs on the exercise Dealing with Multiple Files

You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Having gone through the Pandas path, my question is: why don’t we just use concat() instead of glob()? What are the limitations of using one instead of the other

Hi, what is glob.glob() means? why we can’t just use glob()

[name of library].[name of the method]
the glob library has no alias, so it is using its value.
if it was “import glob as test”, the call would be test.glob

hey :slight_smile:
in this exercise, the regex * refers to everything that comes after *
quite like “random%” on SQL, that would match “random7627hckg” etc.
it has been confusing for me, since the cheetsheet says the regex * is A MULTIPLIER.
Id love to understand that clearly.
Thanks!

8 Likes

got the same question. * doesent seem to work like in the previous lessons :confused:

2 Likes

I have a recommendation regarding the lesson’s description on the glob module.

glob can open multiple files by using regex matching to get the filenames:

Shouldn’t it be “by glob-style matching”? It’s similar to regex but there are some important differences that can be pointed out by just seeing the examples, for example, the dots in the filenames aren’t treated as wildcards but as literal dots, which happens with glob-style matching instead of regex. Please take a look into this description in 3. Dealing with Multiple Files.

5 Likes

Can you clarify code below please?

import glob

files = glob.glob("file.csv")*

df_list = []
for filename in files:
** data = pd.read_csv(filename)**
** df_list.append(data)**

df = pd.concat(df_list)

print(files)

1 Like
import glob

files = glob.glob("file*.csv")

df_list = []
for filename in files:
  data = pd.read_csv(filename)
  df_list.append(data)

df = pd.concat(df_list)

print(files)

As I understand this code * means occurrence of preceding letter 0 or more time. So file name ‘file*.csv’ search for filee.csv, fil.csv wait a second and also why is here we’re not escaping the functionality of .(period)
I am taking about regexp since it clearly says that we are creating regexp and searching through files. But instead it is doing different they say it search for file1.csv, file2.csv

1 Like

Nothing in the glob module mentions regular expressions, in fact it uses unix style pattern matching-


So file*.csv would match file1.csv, file2.csv, see the following for a very very brief mention of how this works-

Seems the text is just a little incorrect at present, the python docs have the correct details.
2 Likes

Why do we need concatenate our DataFrames placed in df_list into one DataFrame students while there are no difference between df_list and students?

import codecademylib3_seaborn
import pandas as pd
import glob

student_files = glob.glob("exams*.csv")
df_list = []

for files in student_files:
  data = pd.read_csv(files)
  df_list.append(data)
students = pd.concat(df_list)
print(students)

I mean pd.concat should update our indexes but it does not. To bring such result to our code we need use option students = pd.concat(df_list, ignore_index=True)

Note that the regex expression \d will not work within the glob() function (as in glob.glob(“exams\d.csv”).

Glob does not use regex but does use some of the same things. replacing \d with [0-9] will return exams1.csv through exams9.csv.