Sentiment Analysis Python Script

Hi all,

Is anyone able to help me understand the code below? The context behind it is to build a model that is used to predict how positive or negative a new word/phrase/sentence is and generate a score.

If NTLK has been imported, why are we needing to specify the positive, negative, and neutral words? The score that is generated doesn’t seem to be correct.

If someone could explain to me each part of this code and help me with the above questions in layman’s terms (as I’m pretty new to all of this) that would be amazing! :slight_smile:


import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import names import sys positive_vocab = [ 'excellent', 'amazing', 'enjoyed', 'oscar', 'outstanding', 'engrossing','funny','intense'] negative_vocab = [ 'batman and superman', 'boring', 'adam sandler', 'bland'] neutral_vocab = ['ok', 'watchable'] def word_feats(words): return dict([(word, True) for word in words]) positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab] negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab] neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab] train_set = negative_features + positive_features + neutral_features classifier = NaiveBayesClassifier.train(train_set) neg = 0 pos = 0 sentence = "it was excellent despite adam sandler" sentence = sentence.lower() words = sentence.split() for word in words: classResult = classifier.classify( word_feats(word)) if classResult == 'neg': neg = neg + 1 if classResult == 'pos': pos = pos + 1 print('Positive: ' + str(float(pos)/len(words))) print('Negative: ' + str(float(neg)/len(words)))

Thanks in advance,
Mel

The Codebyte doesn’t work on my end (I submitted a bug report for it) so I copied your code and ran it in Colab.
The score was:
Positive: 0.3333333333333333
Negative: 0.6666666666666666

Did you get something else? Have you tried switching up the sentence to change the score? Try adding more negative words or more positive words to see how the score changes.
I’m going to take a stab at this and perhaps others can chime in. :slight_smile:

So, you’re importing those NLTK modules to do a lot of the work but they don’t do everything for us. (I’m just now learning NLP myself and so far, it’s pretty cool!). Anyway…

A good place to start is by reading up on the documentation. (which I assume you’ve done).
https://www.nltk.org/api/nltk.classify.html
https://www.nltk.org/howto/corpus.html

You have to create a training model and supply a list of words for each: positive, negative and neutral words/vocabulary. Then you write a function that takes one parameter, words and returns a dictionary if the words fall into the positive, negative or neutral categories. You’re training the ML model based on Bayes’ theorum, in short, the probability of A happening if B has occurred.
See here about Bayes Classifier:
https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c

You then create 2 running totals for the negative words and positive words. Then, a for loop to loop through the cleaned up sentence which will assign a point for each positive word and each negative word and will give you a score based on the classifier.

The final score for the sentence is .66 negative and .33 positive. I mean, it’s not really a glowing review of Adam Sandler. The movie was great, even with the presence of Sandler being in it. I like Adam Sandler, so my review would be different. :slight_smile:

Hiya @lisalisaj

Thank you soo much for your response.

import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import names import sys # insert the training data here positive_vocab = [ 'excellent', 'amazing', 'enjoyed', 'oscar', 'outstanding', 'brilliant', 'good'] negative_vocab = [ 'batman and superman', 'boring', 'adam sandler', 'poor', 'lame', 'asleep'] neutral_vocab = ['ok', 'watchable', 'reasonable'] # insert the 'features' function here def word_feats(words): return dict([(word, True) for word in words]) positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab] negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab] neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab] # train the model here train_set = negative_features + positive_features + neutral_features classifier = NaiveBayesClassifier.train(train_set) # the following code does the prediction: neg = 0 pos = 0 neu = 0 sentence = "Here is my outstanding review with some more text" sentence = sentence.lower() words = sentence.split() for word in words: classResult = classifier.classify( word_feats(word)) if classResult == 'neg': neg = neg + 1 if classResult == 'pos': pos = pos + 1 if classResult == 'neu': neu = neu + 1 print('Positive: ' + str(float(pos)/len(words))) print('Negative: ' + str(float(neg)/len(words))) print('Neutral: ' + str(float(neu)/len(words)))

With this code I get:

Positive: 0.4444444444444444
Negative: 0.3333333333333333
Neutral: 0.2222222222222222

The sentence is “Here is my outstanding review with some more text”

The only positive word I see here is “outstanding”, the rest are neutral? Why is there a negative result of 0.3?

Similarly, with this code below, I get the results:

Positive: 0.4
Negative: 0.6

The sentence is “Here is my outstanding review” - I don’t understand the results at all?

import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import names import sys # insert the training data here positive_vocab = [ 'excellent', 'amazing', 'enjoyed', 'oscar', 'outstanding'] negative_vocab = [ 'batman and superman', 'boring', 'adam sandler'] neutral_vocab = ['ok', 'watchable'] # insert the 'features' function here def word_feats(words): return dict([(word, True) for word in words]) positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab] negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab] neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab] # train the model here train_set = negative_features + positive_features + neutral_features classifier = NaiveBayesClassifier.train(train_set) # the following code does the prediction: neg = 0 pos = 0 sentence = "Here is my outstanding review" sentence = sentence.lower() words = sentence.split() for word in words: classResult = classifier.classify( word_feats(word)) if classResult == 'neg': neg = neg + 1 if classResult == 'pos': pos = pos + 1 print('Positive: ' + str(float(pos)/len(words))) print('Negative: ' + str(float(neg)/len(words)))

Again, any help here to understand what’s going on would be grand :slight_smile:

Mel x

At the moment your classifier is operating by letter/character instead of by word. Is that intentional? Be careful with nested iteration.

It’s worth having a look at what your classifier is working with so consider printing the feature sets you’ve created and perhaps have a wee look at classifier.most_informative_features() which might help you work out how your classifier is operating with the given feature set.

Hiya,

It should be by word.