Email similarity project

In the following code, shouldn’t counter.fit() be trained only with train_emails.data to avoid data leakage?

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

train_emails = fetch_20newsgroups(
  categories = ['rec.sport.baseball', 'rec.sport.hockey'], 
  subset='train', 
  shuffle=True, 
  random_state=108
)

test_emails = fetch_20newsgroups(
  categories = ['rec.sport.baseball', 'rec.sport.hockey'], 
  subset='test', 
  shuffle=True, 
  random_state=108
)

counter = CountVectorizer()

# Shouldn't the following be trained only with train_emails.data to avoid data leakage?
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()

classifier.fit(train_counts, train_emails.target)

Do you have a link to the lesson/course?

1 Like

Yes, this one:

Supervised Learning II: Advanced Regressors and Classifiers | Codecademy