What is scikit-learn's CountVectorizer object?


#1

Question

We learn here that in order to use scikit-learn for naive Bayes, we must use the CountVectorizer object. What is this object and how does it differ from the method we used to implement naive Bayes with Python’s Counter object?

Answer

When we implemented naive Bayes with the Counter data structure, we needed to first prepare the text by dividing the positive and negative reviews into a list of unique words. We then pass this prepared list to the Counter data structure. Scikit-learn enables us to do both these processes in one step with the CountVectorizer object. It implements both tokenization (that is, diving our text documents unique words or tokens) and occurrence counting in a single class. In this way, it makes our work easier.