Question
Here we decide that adding 1
to the numerator and N
, the number of unique words in our dataset, to the denominator is what needs to be done to accomplish “smoothing”. How is this determined?
Answer
Let’s first recall that smoothing is a response to the issue where a review contains a word which is not contained in our dataset. For example, say that the review is “This crib is groovy” and we want to judge whether it is a positive review. Naive Bayes says that we should compute P(review | positive)
like this
P("This" | positive) * P("crib" | positive) * P("is" | positive) * P("groovy" | positive)
If the word “groovy” is nowhere to be found in our dataset, then P("groovy" | positive) = 0
and so P("This crib is groovy" | positive) = 0
. We don’t want this. What smoothing allows us to do is give a baseline, non-zero, default value for such words. What is a reasonable value? The smoothing presented here makes the assumption that a reasonable value is 1/N
where N
is the number of unique words in our dataset. You can think of this as meaning that this unmatched word can have the meaning of any of the given words and so contributes proportionally to any of the potential classifications, for us positive
and negative
.