How do we determine how to perform smoothing in the case of naive Bayes?

Question

Here we decide that adding 1 to the numerator and N, the number of unique words in our dataset, to the denominator is what needs to be done to accomplish “smoothing”. How is this determined?

Answer

Let’s first recall that smoothing is a response to the issue where a review contains a word which is not contained in our dataset. For example, say that the review is “This crib is groovy” and we want to judge whether it is a positive review. Naive Bayes says that we should compute P(review | positive) like this

P("This" | positive) * P("crib" | positive) * P("is" | positive) * P("groovy" | positive)

If the word “groovy” is nowhere to be found in our dataset, then P("groovy" | positive) = 0 and so P("This crib is groovy" | positive) = 0. We don’t want this. What smoothing allows us to do is give a baseline, non-zero, default value for such words. What is a reasonable value? The smoothing presented here makes the assumption that a reasonable value is 1/N where N is the number of unique words in our dataset. You can think of this as meaning that this unmatched word can have the meaning of any of the given words and so contributes proportionally to any of the potential classifications, for us positive and negative.

10 Likes

But here when there is a new unique word or typo that is present in a review then the baseline would be 1/(N+total_pos) , not 1/N. Can please someone explain this?

7 Likes

In practice (e.g. in a business or research project where we care a lot about catching every word in a text we’re trying to classify), surely we can use tools like regex to detect the correct spelling as well as common misspellings? How might such “dictionaries of misspellings” be incorporated?

3 Likes

and for words that already existed, their p(word/positive) changed too!

Correct, but the words that already exist will have a probability (word_in_pos+1) times higher than those not existing, thus they will be more weighted in the calculation than the non existing ones.