What does .re, .stem. and .sub do?

# regex for removing punctuation!
**import re** 
# nltk preprocessing magic
import nltk 
from nltk.tokenize import word_tokenize # Break text into words 
**from nltk.stem** import PorterStemmer #Present (Stemming) 
**from nltk.stem** import WordNetLemmatizer #Root (Lemmatization)
# grabbing a part of speech function:
from part_of_speech import get_part_of_speech

text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed."

cleaned = **re.sub**('\W+', ' ', text)
tokenized = word_tokenize(cleaned)

stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]

## -- CHANGE these -- ##
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized] # Break down words into roots 

print("Stemmed text:")
print("\nLemmatized text:")

When you ask a question, don’t forget to include a link to the exercise or project you’re dealing with!

If you want to have the best chances of getting a useful answer quickly, make sure you follow our guidelines about how to ask a good question. That way you’ll be helping everyone – helping people to answer your question and helping others who are stuck to find the question and answer! :slight_smile:

They’re part of the standard re module for regular expressions. You can look up the module and these specific functions in the docs-

1 Like