I'm not sure what the code for Mystery Friend in Module 4 is doing

This is in reference to the completed code below. I’ve left comments on parts where I am confused, or parts that give my interpretation of what’s happening. The main part I do

from goldman_emma_raw import goldman_docs
from henson_matthew_raw import henson_docs
from wu_tingfang_raw import wu_docs
# import sklearn modules here:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB 

# Setting up the combined list of friends' writing samples
friends_docs = goldman_docs + henson_docs + wu_docs
# We are assigning each friend a tag, of 1, 2, 3. The magic numbers are in reference to the length of each doc.
friends_labels = [1] * 154 + [2] * 141 + [3] * 166


mystery_postcard = """
My friend,
From the 10th of July to the 13th, a fierce storm raged, clouds of
freeing spray broke over the ship, incasing her in a coat of icy mail,
and the tempest forced all of the ice out of the lower end of the
channel and beyond as far as the eye could see, but the _Roosevelt_
still remained surrounded by ice.
Hope to see you soon.
"""


bow_vectorizer = CountVectorizer()

# We are a creating a feature dictionary mapping, of every word in the combined friends docs
bow_vectorizer.fit(friends_docs)
# We now count the occurance of each word in friends_doc, ie. vectorization
friends_vectors = bow_vectorizer.transform(friends_docs)

# We count the occurance of each word (in relation to the friends_docs dictionary) in the postcard
mystery_vector = bow_vectorizer.transform([mystery_postcard])

# Implementing a Naive Bayes classifier
friends_classifer = MultinomialNB()

# **This is the big part that I don't understand.** Isn't friends_vectors a just a list representing a vectorization? And friends_labels just a big list with [0,..,1,..,2]? How does friends_classifer know whos word belongs to which number?

friends_classifer.fit(friends_vectors, friends_labels)

predictions = friends_classifer.predict(mystery_vector)

mystery_friend = predictions[0] if predictions[0] else "someone else"

print("The postcard was from {}!".format(mystery_friend))

I’m also quite confused about MultinomialNB(). It seems quite hastily introduced. Help would be very appreciated

For your first question, the friends_classifier function is training the model to recognize lists of sentences (the parsed, vectorized text), which have a corresponding numeric label that indicates the author.

I think we can think of the training data as the keys and the training label as values. We just didn’t do the work for assigning the labels ourselves; it’s provided for us at the top of the code. Each author gets a categorical code (1 for goldman, 2 for henson, 3 for wu), then a list is generated that assigns the correct number in an ordered list that represents the number of sentences each author wrote. Then we have a training_set of individual sentences that starts with goldman, followed by henson, followed by wu, and another set of training labels (a series of 1’s equal to the number of goldman sentences, followed by a series of 2’s equal to the number of henson sentences, etc.).

The MultinomialNB model is then trained to look at each sentence and remember that a sentence that looks like the first parsed (and counted) sentence in the set of sentences in friends_vectors (or friends_vectors[0]) will correspond to the label at friends_labels[0].

What the MultinomialNB model is actually doing, under the hood, is determining the probability that a sentence belongs to category 1, 2, or 3 based on the word counts the CountVectorizer calculated.

I found this explanation about applying Bayesian probabilities to natural language processing really helpful.

I think it would have also been helpful if the lesson reminded us or referred us back to the earlier unit where we looked at MultinomialNB as a method for classifying emails that came from different listservs (hockey vs. baseball).