This community-built FAQ covers the “Using scikit-learn” exercise from the lesson “Naive Bayes Classifier”.
Paths and Courses
This exercise can be found in the following Codecademy content:
FAQs on the exercise Using scikit-learn
Join the Discussion. Help a fellow learner on their journey.
Ask or answer a question about this exercise by clicking reply () below!
Agree with a comment or answer? Like () to up-vote the contribution!
Need broader help or resources? Head here.
Looking for motivation to keep learning? Join our wider discussions.
Learn more about how to use this guide.
Found a bug? Report it!
Have a question about your account or billing? Reach out to our customer support team!
None of the above? Find out where to ask other questions here!
What model should I use to assess the adequacy of my CV to 100 job descriptions?
I have my CV and dataset from 100 job descriptions.
I wanna know if my CV is adequate to these job descriptions. On the basis of this information I hope to correct my CV and increace the number of views by employers.
If to use Naive Bayes formula: P(A|B) = P(B|A) * P(A) / P(B)
I think, that:
A - my_CV_text
B - vacancies_data
and further, I stuck
I also thought to apply Naive Bayes Classifier, but:
Classifier requires labels, to lcassify if it is ‘bad’ or ‘good’, but my data can’t have labels. I don’t want to know whether my CV is ‘bad’ or ‘good’, but to know the probability of occurencies of the words in my CV in data_set.
How can I do this?
I tried to create my training_labels list with the following list comprehension:
training_labels = [0 if x <= 1000 else 1 for x in range(2000)]
however, I get an error saying that
training_labels should be a list of 34079
0 s followed by 34079
1 s. These values don’t match the project, advice?
In step 2, a list that has 1000
0 s followed by 1000
1 s is required (I don’t know why 34079 came out). But the list created by your code would have 1001
0 s (x = 0, …, 1000) followed by 999
1 s (x = 1001, …, 1999).
You should simply multiply a list of  by the number you want and a list of  by the same number. We did that in the exercise we just did.
Please let us download the pickle files so we can actually use the full dataset. I could only get your smaller dataset of 50 reviews for both negative and positive. Training a model with only these 100 reviews leads to a terrible SKLearn model which performs worse than the python only model.