FAQ: Naive Bayes Classifier - Using scikit-learn

This community-built FAQ covers the “Using scikit-learn” exercise from the lesson “Naive Bayes Classifier”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Data Science

FAQs on the exercise Using scikit-learn

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

1 Like

What model should I use to assess the adequacy of my CV to 100 job descriptions?

I have my CV and dataset from 100 job descriptions.
I wanna know if my CV is adequate to these job descriptions. On the basis of this information I hope to correct my CV and increace the number of views by employers.

If to use Naive Bayes formula: P(A|B) = P(B|A) * P(A) / P(B)
I think, that:
A - my_CV_text
B - vacancies_data
and further, I stuck :frowning:

I also thought to apply Naive Bayes Classifier, but:
Classifier requires labels, to lcassify if it is ‘bad’ or ‘good’, but my data can’t have labels. I don’t want to know whether my CV is ‘bad’ or ‘good’, but to know the probability of occurencies of the words in my CV in data_set.

How can I do this?
Thank you.

I tried to create my training_labels list with the following list comprehension:

training_labels = [0 if x <= 1000 else 1 for x in range(2000)]

however, I get an error saying that training_labels should be a list of 34079 0 s followed by 34079 1 s. These values don’t match the project, advice?

In step 2, a list that has 1000 0 s followed by 1000 1 s is required (I don’t know why 34079 came out). But the list created by your code would have 1001 0 s (x = 0, …, 1000) followed by 999 1 s (x = 1001, …, 1999).

You should simply multiply a list of [0] by the number you want and a list of [1] by the same number. We did that in the exercise we just did.

Please let us download the pickle files so we can actually use the full dataset. I could only get your smaller dataset of 50 reviews for both negative and positive. Training a model with only these 100 reviews leads to a terrible SKLearn model which performs worse than the python only model.