FAQ: Getting Started with Natural Language Processing - Text Preprocessing

This community-built FAQ covers the “Text Preprocessing” exercise from the lesson “Getting Started with Natural Language Processing”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Natural Language Processing

FAQs on the exercise Text Preprocessing

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

What exactly does the get_part_of_speak(token) argument do in this situation?


In all seriousness, I have to ask … wat?


get_part_of_speak(token) returns the type of the token.
In this context, it returns whether the token is a verb (‘v’), a noun (‘n’), an adjective (‘a’) or an adverb (‘r’).
this helps the lemmetizer() to trim off the word, based on the type of the word.

for example:

print (lemmatizer.lemmatize (“stripes”, ‘v’))


print (lemmatizer.lemmatize (“stripes”, ‘n’))



Is there a more step-by-step tutorial available for total NLP beginners? This course jumps in at a relatively advanced level.


So I’ve Googled what re.sub() does and found a very good explanation to this in Python documentation, which is pretty cool. But i still have questions about this particular line of code here:

cleaned = re.sub('\W+', ' ', text)

According to python documentation,

Regular expressions use the backslash character ( '\' ) to indicate special forms or to allow special characters to be used without invoking their special meaning.

…but why is ‘W’ regarded as a special character here? And I literally find no connection of it with the text that we’ve been provided with in the excersize. Can someone explain, please?

Extra question: so, as I understood from Codecademy’s explanation, stemming and lamitizing are both supposed to clean the text and make it more understandable. But why on Earth in this excersize do they turn all wrong? For instance, text’s “many” turns to be “mani” after stemming etc… And in the end, when you read the original text from the exercize, it’s all correct, so why do we need to change it? What do I misunderstand here?

Well, on the first page of this NLP lesson it is said:

Don’t worry if you don’t understand much of the content right now — you don’t need to at this point! This lesson is an introductory overview of some of the main topics in NLP and you’ll get to dive deeper into each topic on its own later.

So i assumed the main point of this lesson is just to showcase you how this thing works