So I am doing this Book Recommender project on the Machine Learning Skillpath (Machine Learning II ) where we use the Surprise Library to build a model that predicts users book ratings based on the K - Nearest Neighbours algorithm.
I noticed a problem that even occurs in the solution code. Let me give you the code first and then explain the problem:
import pandas as pd
import codecademylib3
book_ratings = pd.read_csv('goodreads_ratings.csv')
#1. Print dataset size and examine column data types
print(book_ratings.head())
print(book_ratings.info())
#2. Distribution of ratings
print(book_ratings.rating.value_counts())
#3. Filter ratings that are out of range (reviews: 1-5, so 0 has to be filtered out)
book_ratings = book_ratings[book_ratings.rating != 0]
print(book_ratings.rating.value_counts())
#4. Prepare data for surprise: build a Suprise reader object
from surprise import Reader
reader = Reader()
#5. Load `book_ratings` into a Surprise Dataset
from surprise import Dataset
rec_data = Dataset.load_from_df(book_ratings[['user_id',
'book_id',
'rating']],
reader)
#6. Create a 80:20 train-test split and set the random state to 7
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(rec_data, test_size = 0.2, random_state = 7)
#7. Use KNNBasice from Surprise to train a collaborative filter
from surprise import KNNBasic
book_recommender = KNNBasic()
book_recommender.fit(trainset)
predictions = book_recommender.test(testset)
# looking at prediction for first user in testset:
print(predictions[0])
"""
Output:
user: 1210d65590e6916c7066139055a29116 item: 9867814 r_ui = 5.00 est = 3.83 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
"""
#8. Evaluate the recommender system on the testset
from surprise import accuracy
accuracy.rmse(predictions)
#9. Prediction on a user who gave the "The Three-Body Problem" a rating of 5
print(book_recommender.predict('8842281e1d1347389f2ab93d60773d4d', '18007564').est)
# Output: 3.8250739644970415
print(book_recommender.predict(uid = "d089c9b670c0b0b339353aebbace46a1", iid = "7686667", r_ui = 3))
"""
Output:
user: d089c9b670c0b0b339353aebbace46a1 item: 7686667 r_ui = 3.00 est = 3.83 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
"""
So the problem occurs in exercise 7 and 9 because if we look at the output the value of “was_impossible” is True, seemingly because the predict and test functions do not recognize the user and item ids.
The solution code will give the same output and I don’t understand why this hasn’t been taken care of. I googled a little and found out that the cause could be the id inputs that should be raw ids but are computed as inner ids here in the code. But I can’t figure out how to fix this problem without changing the code too much.
Can someone help?