What does .re, .stem. and .sub do?
# regex for removing punctuation! **import re** # nltk preprocessing magic import nltk from nltk.tokenize import word_tokenize # Break text into words **from nltk.stem** import PorterStemmer #Present (Stemming) **from nltk.stem** import WordNetLemmatizer #Root (Lemmatization) # grabbing a part of speech function: from part_of_speech import get_part_of_speech text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed." cleaned = **re.sub**('\W+', ' ', text) tokenized = word_tokenize(cleaned) stemmer = PorterStemmer() stemmed = [stemmer.stem(token) for token in tokenized] ## -- CHANGE these -- ## lemmatizer = WordNetLemmatizer() lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized] # Break down words into roots print("Stemmed text:") print(stemmed) print("\nLemmatized text:") print(lemmatized)
When you ask a question, don’t forget to include a link to the exercise or project you’re dealing with!
If you want to have the best chances of getting a useful answer quickly, make sure you follow our guidelines about how to ask a good question. That way you’ll be helping everyone – helping people to answer your question and helping others who are stuck to find the question and answer!