What does .re, .stem. and .sub do?
# regex for removing punctuation!
**import re**
# nltk preprocessing magic
import nltk
from nltk.tokenize import word_tokenize # Break text into words
**from nltk.stem** import PorterStemmer #Present (Stemming)
**from nltk.stem** import WordNetLemmatizer #Root (Lemmatization)
# grabbing a part of speech function:
from part_of_speech import get_part_of_speech
text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed."
cleaned = **re.sub**('\W+', ' ', text)
tokenized = word_tokenize(cleaned)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]
## -- CHANGE these -- ##
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized] # Break down words into roots
print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)
When you ask a question, don’t forget to include a link to the exercise or project you’re dealing with!
If you want to have the best chances of getting a useful answer quickly, make sure you follow our guidelines about how to ask a good question. That way you’ll be helping everyone – helping people to answer your question and helping others who are stuck to find the question and answer!