Why do stopwords have the highest tfidf scores?

In one of the exercises in the Build Chatbots with Python course, we are asked to find the tfidf scores for word in a some news articles. Why do stopwords like “the” have the highest scores in the tables but when we find the highest scoring words they appear to have scores of 0 in the table?

For example, “the” has a score of 6.0 in an article but the hint in the exercise says “fare” is the highest scoring word in that article.

Do you have any code or a screenshot for this?
How many times does the word “fare” appear in the document?
Maybe remove the stopwords and then rerun the code(?)

tfidf (term frequency and inverse document frequency) are calculated as such: tf * idf.
tf: how often a word appears in a document
idf: a calculation of how often a word appears across all documents of a corpus. (It should be noted that a high idf score actually penalizes terms that appear more often.)

The calculation is:
log(total number of documents/number of documents with term t)

Which means that the number of times term “t” increases in the documents, the idf decreases, or, the more times a word appears across all documents, the less important it is in the individual document.

So, can’t we infer that the word “the” doesn’t offer much insight about the document but the word “fare” might?

Logarithm: https://en.wikipedia.org/wiki/Logarithm

You can also remove the stopwords by importing the stopwords module from the nltk library. (maybe a future lesson in the chatbots course will go over this. If not, it’s in the DS NLP course).

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

See:
https://pythonspot.com/nltk-stop-words/

code:

import codecademylib3_seaborn
import pandas as pd
import numpy as np
from articles import articles
from preprocessing import preprocess_text

import CountVectorizer, TfidfTransformer, TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

view article

print(articles[0])

preprocess articles

processed_articles =
for article in articles:
processed_articles.append(preprocess_text(article))

initialize and fit CountVectorizer

vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(processed_articles)

convert counts to tf-idf

transformer = TfidfTransformer(norm=None)
tfidf_scores_transformed = transformer.fit_transform(counts)

initialize and fit TfidfVectorizer

vectorizer = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(processed_articles)

check if tf-idf scores are equal

if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
print(pd.DataFrame({‘Are the tf-idf scores the same?’:[‘YES’]}))
else:
print(pd.DataFrame({‘Are the tf-idf scores the same?’:[‘No, something is wrong :(’]}))

get vocabulary of terms

try:
feature_names = vectorizer.get_feature_names()
except:
pass

get article index

try:
article_index = [f"Article {i+1}" for i in range(len(articles))]
except:
pass

create pandas DataFrame with word counts

try:
df_word_counts = pd.DataFrame(counts.T.todense(), index=feature_names, columns=article_index)
print(df_word_counts)
except:
pass

create pandas DataFrame(s) with tf-idf scores

try:
df_tf_idf = pd.DataFrame(tfidf_scores_transformed.T.todense(), index=feature_names, columns=article_index)
print(df_tf_idf)
except:
pass

try:
df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)
print(df_tf_idf)
except:
pass

get highest scoring tf-idf term for each article

for i in range(1, 11):
print(df_tf_idf[[f’Article {i}’]].idxmax())

I took another look at the code and it is true that “fare” has a higher tfidf score than “the.”

Like I said above, wouldn’t that mean that “fare” has greater importance overall b/c it has a higher score?

Also, check out this post about formatting one’s code.

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.