Data scientist NLP-track portfolio project (can't figure out why the accuracy is high)

Hi,

This project is an effort to integrate my machine learning and data analysis knowledge.

Until now, though, I’m still scratching my head trying to figure out why the following line actually worked to produce a 99% accuracy:

feature_list=[((document_sentiment_dict(d,most_common_words)),c) for (d,c) in tweets_labeled]

The variable “tweets_labeled” is composed of two parts, untokenized tweets and sentiment labeled (i.e. positive/negative). I thought the untokenized tweets should’ve provided little sentiment info as whole chunks.

I tried to tokenize the tweets and got zero accuracy.

Any suggestions, related to or unrelated to the confusing high accuracy, is highly appreciated.

Here’s the link:

Thanks and stay warm. :blush:

You-shan