Machine Learning - Email Simlitarity

Hey there everyone!

Getting through ML skill path, I really got into the whole comparison between different inputs.

On the exercise email similarity, I’d like to scatter points between every pair of news group and see how they match up!

But right now my code got stuck, I don’t know if its an issue with Python 3, but I have no output for 10 minutes already. Is there any reason why I’m not getting anything printed?

Here’s my code:

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()
all_categories = ['alt.atheism', '', '', '', 'comp.sys.mac.hardware', '', '', '', '', '', '', 'sci.crypt', 'sci.electronics', '', '', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
predictions = []
categories =[]
for i in range(len(all_categories)):
  for j in range(len(all_categories)):
    train_emails = fetch_20newsgroups(categories = categories, subset='train', shuffle = True, random_state = 108)
    test_emails = fetch_20newsgroups(categories = categories, subset='test', shuffle = True, random_state = 108)

    counter = CountVectorizer() +
    train_counts = counter.transform(
    test_counts = counter.transform(

    classifier = MultinomialNB(),
    predictions.append(categories[0] + " " + categories[1] + " " + str(classifier.score(test_counts,


Link to the exercise:

I’ve not undertaken that lesson before but perhaps consider using print inside the loop or a more advanced debugging tool to check where you are in the loop. I’d also suggest that since i and j appear to only be used for addressing categories that you use for in iterable instead of the unnecessary indexing (if you are planning to use it then never mind). Based on the ever expanding categories list I can see your run-time going nuts so it might be best to just kill it and consider some of the points below.

Like I say I’ve not done this problem before so these are a few guesses but they might not actually be of use to you-

Is it correct that categories should become larger on every iteration? As it’s written I can see numerous repeating categories added to categories. Should it perhaps be emptied after each iteration? Could you avoid the repeats? If you truly intend to create every possible ordered combination of two lists containing twenty elements (which sounds problematic) then the itertools module has some nice tricks for that (such as n length ordered combinations).

Bear in mind with the given code structure and use of append that is most definitely not what is happening. It gets longer every iteration. Get your combinations sorted before crunching the data. Even if that means commenting out the actual functionality for now. If you’re passing the wrong data it’s already an issue. If it’s still slow then consider something like numpy to vectorise your code if at all possible. For loops and function calls will always be a bit slow in Python.

There’ also a lot of calls to the fetch20newsgroups function. Could you perhaps pull all the data at the start and only pass certain sections of it. Depending on how that function operates it could be a very slow operation. Is that what your emails variable was originally planned for?

I’d also be a little careful about shuffling data when trying to make comparisons. If your dataset is imperfect then you might start picking up a little bit of randomisation rather than the actual comparisons you wanted.


Is it correct that categories should become larger on every iteration?

No definetly not! I think I’ve fixed that! Anyway, I’ve noted for itertools, I’ll check that and finish this project off platform :smiley:
I think I’ll also get rid off comparing columns with itself.

There’ also a lot of calls to the fetch20newsgroups function.

I’ve got an error when trying to fetch it all from the instance I’ve created at the beggining: emails! I’ll check that and try to get through the error.

I’d also be a little careful about shuffling data when trying to make comparisons.

random_state is set to 108, I thought that it would that issue would be dealt with by setting random_state. Should it be set to 1? I thought random_state actually picks a group of pseudorandom numbers, does a higher number actually increases randomness or something else? If so, I think I’ve never actually understood random_state

Thank you so much!!! It’s really gratifying to get so much response from someone :smiley:

Replying to you but I think @codecademy might give us a better answer. Does all our code ran on the platform processes everything on our own computer or on servers? Could it be that processes are shut on the server side if they get past a certain point?