For my Capstone OK Cupid project I thought it would be interesting to see if male sexual orientation correlated to body types and other lifestyle factors, including drinking, drugs and smoking.
But I don’t trust the results I am getting in return.
I am fairly certain I know why, but I don’t what to do.
For example, my Multinomial Naive Bayes Classifier produces an accuracy score of 98%, with precision and others scores cut and pasted below. Most are unexpectedly high except for bisexuals.
I doubt these results provide any real insights. Instead, they reflect the fact that 87% of the population I am trying to analyze identifies as “straight.” Roughly ten percent identify as “gay”, Three percent as “bisexual.”
The multinomial logistics classifier results in 100% for all scores. When I search for K in the K Neighbors classifier, the output says the best K is zero. That means I am overfitting, further suggesting my training data is the problem. I am assuming analyzing one category (straight) that dwarfs all the others by 72% is skewing the results.
Do you agree? What can I do about it?
Your feedback as to the problem and how to solve it would be much appreciated.
The scores from the Multinomial Naive Bayes Classifier mention above.
Thanks again,
The accuracy of model on training data is: 98.0%
precision recall f1-score support
0 1.00 1.00 1.00 2661
1 0.66 0.30 0.41 64
2 0.87 0.97 0.91 303
accuracy 0.98 3028
macro avg 0.84 0.75 0.77 3028
weighted avg 0.98 0.98 0.98 3028