Increasing the amount of data you train a model on will usually improve it’s accuracy (assuming it is high-quality, clean data). However, I’m not sure you understand how all the pieces fit together here. It’s not immediately clear from the lessons on Naive Bayes Classification, so I’ll do my best to break it down here.
You have data that you want to use to train your classification model. This is the only data that should be used in the
.fit() methods. You should never use
.fit() on any data that you are trying to use the model to classify. This includes
CountVectorizer.fit(). You do, however, use
CountVectorizer.transform() on your test data — you just don’t fit it before you transform it.
Part of the reason this is confusing is because in the Naive Bayes Classification lessons, you don’t actually use
train_test_split(). In the lessons, all of your data is used to train the model, and then you only test it on one single review. In the real world, however, you will never be testing it on one single data point. Instead, you will want to test it on a LOT of data in order to get a better understanding of how accurate your model is. Typically, you can get a good gauge on the accuracy of your model if you have a sufficiently large data set for training, and your test set is about 1/4 of the size of your training set. Hence, the 80/20 split.
So why not just use all of your data to fit your model and then find the extra data to test on? Because this gets to be too much work once you start using datasets of any significant size. Imagine your dataset has 1 million data points and you fit your model with the whole thing. Now, if you want to test your model on a test set that is 1/4 of that size, you need to find 250,000 more data points, clean them and label them before you can use them for testing. Now imagine you work for a FAANG company (Facebook, Amazon, Apple, Netflix, Google) — you will have much more than 1 million points in your dataset. It’s a lot easier to just split your data at the beginning, because you already have it ready to go.
Now, getting back to the Naive Bayes Classifier, here are the steps you take:
- Split your data into training data and test data
transform() your training data with
CountVectorizer (you can also use use
.fit_transform() for this step).
transform() your **test data with
MultinomialNB model with the training data vectors and the training data labels
- Use the
.score() methods of your
MultinomialNB model to see the predictions and accuracy of your model.
.predict() takes your test data vectors, and
.score() takes the test data vectors and test data labels.
Hope this helps!