What is scikit-learn's CountVectorizer object?


We learn here that in order to use scikit-learn for naive Bayes, we must use the CountVectorizer object. What is this object and how does it differ from the method we used to implement naive Bayes with Python’s Counter object?


When we implemented naive Bayes with the Counter data structure, we needed to first prepare the text by dividing the positive and negative reviews into a list of unique words. We then pass this prepared list to the Counter data structure. Scikit-learn enables us to do both these processes in one step with the CountVectorizer object. It implements both tokenization (that is, diving our text documents unique words or tokens) and occurrence counting in a single class. In this way, it makes our work easier.


Why did I received nulls in the resulting array after application .transform method on the same data as to used to fit?

printing .transform result (array where each element is the count of times the word from .transform(data) appears in .fit(data)

counter = CountVectorizer()
counter.fit(neg_list + pos_list)
training_counts = counter.transform(neg_list + pos_list)

I expect that each element will be >= 1, since .transform(data) has been built on .fit(data). This means that each word from .transform(data) should appear in .fit(data) at least 1 times.

But it gives me many nulls:

[[0 0 1 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 3 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


1 Like

Did you find any answers to your question?

0 don’t make null values. The result you are printing is an numpy array with a shape of 100, 1603. If you print out the len of pos_list and neg_list you would see that the total len for both is 100.
So every array within the array represents one of the reviews in both pos and neg_list. Then every 0 represent a word that was previously trained and that it is not present in that review and every other number represents a word that was trained and exists in the review and the number of times it appears.

In other words each review get transformed as a vector in bigger matrix.

1 Like