Codecademy Forums

What is scikit-learn's CountVectorizer object?

Question

We learn here that in order to use scikit-learn for naive Bayes, we must use the CountVectorizer object. What is this object and how does it differ from the method we used to implement naive Bayes with Python’s Counter object?

Answer

When we implemented naive Bayes with the Counter data structure, we needed to first prepare the text by dividing the positive and negative reviews into a list of unique words. We then pass this prepared list to the Counter data structure. Scikit-learn enables us to do both these processes in one step with the CountVectorizer object. It implements both tokenization (that is, diving our text documents unique words or tokens) and occurrence counting in a single class. In this way, it makes our work easier.

1 Like

Why did I received nulls in the resulting array after application .transform method on the same data as to used to fit?

printing .transform result (array where each element is the count of times the word from .transform(data) appears in .fit(data)

counter = CountVectorizer()
counter.fit(neg_list + pos_list)
training_counts = counter.transform(neg_list + pos_list)
print(training_counts.toarray())

I expect that each element will be >= 1, since .transform(data) has been built on .fit(data). This means that each word from .transform(data) should appear in .fit(data) at least 1 times.

But it gives me many nulls:

[[0 0 1 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 3 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Why?