FAQ: Getting Started with Natural Language Processing - Text Preprocessing

This community-built FAQ covers the “Text Preprocessing” exercise from the lesson “Getting Started with Natural Language Processing”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Natural Language Processing

FAQs on the exercise Text Preprocessing

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

What exactly does the get_part_of_speak(token) argument do in this situation?

6 Likes

In all seriousness, I have to ask … wat?

3 Likes

get_part_of_speak(token) returns the type of the token.
In this context, it returns whether the token is a verb (‘v’), a noun (‘n’), an adjective (‘a’) or an adverb (‘r’).
this helps the lemmetizer() to trim off the word, based on the type of the word.

for example:

print (lemmatizer.lemmatize (“stripes”, ‘v’))

strip

print (lemmatizer.lemmatize (“stripes”, ‘n’))

stripe

5 Likes

Is there a more step-by-step tutorial available for total NLP beginners? This course jumps in at a relatively advanced level.

4 Likes

So I’ve Googled what re.sub() does and found a very good explanation to this in Python documentation, which is pretty cool. But i still have questions about this particular line of code here:

cleaned = re.sub('\W+', ' ', text)

According to python documentation,

Regular expressions use the backslash character ( '\' ) to indicate special forms or to allow special characters to be used without invoking their special meaning.

…but why is ‘W’ regarded as a special character here? And I literally find no connection of it with the text that we’ve been provided with in the excersize. Can someone explain, please?


Extra question: so, as I understood from Codecademy’s explanation, stemming and lamitizing are both supposed to clean the text and make it more understandable. But why on Earth in this excersize do they turn all wrong? For instance, text’s “many” turns to be “mani” after stemming etc… And in the end, when you read the original text from the exercize, it’s all correct, so why do we need to change it? What do I misunderstand here?

1 Like

Well, on the first page of this NLP lesson it is said:

Don’t worry if you don’t understand much of the content right now — you don’t need to at this point! This lesson is an introductory overview of some of the main topics in NLP and you’ll get to dive deeper into each topic on its own later.

So i assumed the main point of this lesson is just to showcase you how this thing works

In regular expression, the \W character class matches any non-word character (i.e., any character that is not a letter, digit, or underscore). The + quantifier matches one or more occurrences of the previous character or character class, in this case, one or more non-word characters.

The re.sub() function is used to replace all occurrences of a pattern in a string with a specified replacement. In this case, the pattern \W+ matches one or more non-word characters, and the replacement is a single space ' '. Therefore, the re.sub('\W+', ' ', text) statement replaces all non-word characters in the text string with a single space.

The purpose of this statement is to remove all non-word characters and replace them with a space, effectively cleaning up the text and preparing it for further text processing tasks such as tokenization or part-of-speech tagging.

Now coming to your second question

In natural language processing (NLP), reducing words to their base or root form is an important step in text preprocessing. The reason for this is that it can help to simplify the text data and reduce the dimensionality of the feature space.

For example, consider a text dataset that contains the following three sentences:

  • The cat is jumping over the fence.
  • The cats are jumping over the fences.
  • The jumped cat will not jump again.

Although these sentences have different grammatical structures and contain different words, they all share a common meaning: a cat or cats jumping over a fence or fences. By reducing each word to its base or root form, we can simplify the text data and represent each sentence as a set of common base or root words, such as “cat”, “jump”, and “fence”.

Reducing words to their base or root form can also help to address the problem of sparsity in text data, where many words occur only once or a few times in the dataset. By reducing words to their base or root form, we can group together words that have a similar meaning or function, and thus reduce the number of unique words in the dataset.

Furthermore, reducing words to their base or root form can also help to address the problem of word variation, where the same word can appear in different forms due to factors such as tense, plurality, or case. By reducing words to their base or root form, we can group together words that have a similar meaning, even if they appear in different forms.

Therefore, reducing words to their base or root form is an important step in NLP text preprocessing, as it can help to simplify the text data, reduce dimensionality, address sparsity and variation problems, and improve the accuracy of downstream NLP tasks such as text classification or sentiment analysis.