Machine Learning Training Data and Test Data

Hi Codecademy team,

I have a question regarding splitting data into training data and test data using train_test_split() function.

from sklearn.model_selection import train_test_split

train_data, test_data, train_label, test_label = train_test_split(all_text_data, all_text_labels, test_size = 0.2, 
                                                                  random_state = 1)

When this function has successfully split your data and labels (with test size as 0.2), do every single data point in the training data differ in the test data? Meaning that data points in the training set will not appear in the test data.

Kind Regards,
Jimmy

Yes, that is correct. Though, some data points may have the same value for certain attributes.

For example, if your data set is the Titanic passenger list, there are sure to be multiple passengers with the same age, ticket class, etc. The only difference between two data points may be the passenger’s name, or even just the id. However, they are still two distinct data points. Scikit-learn’s train_test_split will allocate each unique data point to either the training data or the test data, but never both.

2 Likes

Hi @el_cocodrilo,

Thank you so much for explaining this to me. This actually brings up my next question.

Prior to using Classifier: Naive Bayes, I am aware that we have to build our count vector using sklearn CountVectorizer. Now, my confusion is which data are we supposed to use to fit in the CountVectorizer?

This particular off-platform project (Twitter) has a section where we want to build our count vector first prior to classifying. The instruction directed me to fit the CountVectorizer with my training data. I particularly disagree with this method as like you and I understand that training data and test data contains different data points. Furthermore, training data contains only 80% of the dataset we split on.

Isn’t this going to make our model’s vocabulary not as comprehensive as if we used the actual dataset?

The code is below:

ny_text = new_york_tweets["text"].tolist()
london_text = london_tweets['text'].tolist()
paris_text = paris_tweets['text'].tolist()

ny_text_label = [0] * len(ny_text)
london_text_label = [1] * len(london_text)
paris_text_label = [2] * len(paris_text)

all_text_data = ny_text + london_text + paris_text
all_text_labels = ny_text_label + london_text_label + paris_text_label

from sklearn.model_selection import train_test_split

train_data, test_data, train_label, test_label = train_test_split(all_text_data, all_text_labels, test_size = 0.2, 
                                                                  random_state = 1)

from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()
counter.fit(train_data) # why using train_data? not all_text_data? 

Kind Regards,
Jimmy

Of course it is. But the point is to test out our model on data that: a) the model was not trained on; and b) we know the correct classification for. This is why we do the 80/20 split of our data. That 20 percent is data that we have already labeled, and we want to see if our model can correctly classify it without ever “seeing” it before.

1 Like

Hi @el_cocodrilo,

Thanks for your prompt response again. Now I get the logic! So in this case, is it safe to say that one of the ways to improve our Naive Bayes model’s accuracy is to increase the size of the data that we feed into the CountVectorizer and from there, we re-create the training data we’re going to Naive Bayes Model?

Kind Regards,
Jimmy

Increasing the amount of data you train a model on will usually improve it’s accuracy (assuming it is high-quality, clean data). However, I’m not sure you understand how all the pieces fit together here. It’s not immediately clear from the lessons on Naive Bayes Classification, so I’ll do my best to break it down here.

You have data that you want to use to train your classification model. This is the only data that should be used in the .fit() methods. You should never use .fit() on any data that you are trying to use the model to classify. This includes CountVectorizer.fit(). You do, however, use CountVectorizer.transform() on your test data — you just don’t fit it before you transform it.

Part of the reason this is confusing is because in the Naive Bayes Classification lessons, you don’t actually use train_test_split(). In the lessons, all of your data is used to train the model, and then you only test it on one single review. In the real world, however, you will never be testing it on one single data point. Instead, you will want to test it on a LOT of data in order to get a better understanding of how accurate your model is. Typically, you can get a good gauge on the accuracy of your model if you have a sufficiently large data set for training, and your test set is about 1/4 of the size of your training set. Hence, the 80/20 split.

So why not just use all of your data to fit your model and then find the extra data to test on? Because this gets to be too much work once you start using datasets of any significant size. Imagine your dataset has 1 million data points and you fit your model with the whole thing. Now, if you want to test your model on a test set that is 1/4 of that size, you need to find 250,000 more data points, clean them and label them before you can use them for testing. Now imagine you work for a FAANG company (Facebook, Amazon, Apple, Netflix, Google) — you will have much more than 1 million points in your dataset. It’s a lot easier to just split your data at the beginning, because you already have it ready to go.

Now, getting back to the Naive Bayes Classifier, here are the steps you take:

  1. Split your data into training data and test data
  2. fit() and transform() your training data with CountVectorizer (you can also use use .fit_transform() for this step).
  3. transform() your **test data with CountVectorizer
  4. fit() your MultinomialNB model with the training data vectors and the training data labels
  5. Use the .predict() and .score() methods of your MultinomialNB model to see the predictions and accuracy of your model. .predict() takes your test data vectors, and .score() takes the test data vectors and test data labels.

Hope this helps!

1 Like