Project: Build a Book Recommender System ML

Link to project: https://www.codecademy.com/journeys/data-scientist-ml/paths/dsmlcj-22-machine-learning-ii/tracks/dsmlcj-22-supervised-learning-ii-sv-ms-rm-nb/modules/learn-recommender-systems-df367a68-d49a-4b23-8e9e-796b838b4772/projects/build-a-book-recommender-system

So I am doing this Book Recommender project on the Machine Learning Skillpath (Machine Learning II ) where we use the Surprise Library to build a model that predicts users book ratings based on the K - Nearest Neighbours algorithm.

I noticed a problem that even occurs in the solution code. Let me give you the code first and then explain the problem:

import pandas as pd
import codecademylib3

book_ratings = pd.read_csv('goodreads_ratings.csv')

#1. Print dataset size and examine column data types
print(book_ratings.head())
print(book_ratings.info())

#2. Distribution of ratings
print(book_ratings.rating.value_counts())

#3. Filter ratings that are out of range (reviews: 1-5, so 0 has to be filtered out)
book_ratings = book_ratings[book_ratings.rating != 0]
print(book_ratings.rating.value_counts())

#4. Prepare data for surprise: build a Suprise reader object

from surprise import Reader
reader = Reader()

#5. Load `book_ratings` into a Surprise Dataset

from surprise import Dataset
rec_data = Dataset.load_from_df(book_ratings[['user_id',
                                              'book_id',
                                              'rating']],
                                reader)

#6. Create a 80:20 train-test split and set the random state to 7
from surprise.model_selection import train_test_split

trainset, testset = train_test_split(rec_data, test_size = 0.2, random_state = 7)

#7. Use KNNBasice from Surprise to train a collaborative filter

from surprise import KNNBasic
book_recommender = KNNBasic()

book_recommender.fit(trainset)

predictions = book_recommender.test(testset)

# looking at prediction for first user in testset: 
print(predictions[0])
"""
Output:
user: 1210d65590e6916c7066139055a29116 item: 9867814    r_ui = 5.00   est = 3.83   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
"""

#8. Evaluate the recommender system on the testset
from surprise import accuracy
accuracy.rmse(predictions) 

#9. Prediction on a user who gave the "The Three-Body Problem" a rating of 5
print(book_recommender.predict('8842281e1d1347389f2ab93d60773d4d', '18007564').est)
# Output: 3.8250739644970415

print(book_recommender.predict(uid = "d089c9b670c0b0b339353aebbace46a1", iid = "7686667", r_ui = 3))
"""
Output:
user: d089c9b670c0b0b339353aebbace46a1 item: 7686667    r_ui = 3.00   est = 3.83   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
"""

So the problem occurs in exercise 7 and 9 because if we look at the output the value of “was_impossible” is True, seemingly because the predict and test functions do not recognize the user and item ids.

The solution code will give the same output and I don’t understand why this hasn’t been taken care of. I googled a little and found out that the cause could be the id inputs that should be raw ids but are computed as inner ids here in the code. But I can’t figure out how to fix this problem without changing the code too much.

Can someone help?

I have similar issues with this project. And an explanation from an experienced coder would be appreciated.

I don’t think it’s an issue with the predict and test functions not recognising the user ids and item ids, as surprise does map raw ids to inner ids. I’m thinking more along the lines that a KNN based algorithm is not really suitable for this dataset.

I’ve added my thoughts below, which may help you figure things out.

  1. When making predictions for an individual user, I believe the book id should be an integer (not a string) as per the pandas dataframe.

  2. The prediction / est of 3.83 is the global mean of the trainset (trainset.global_mean). So this is probably a default return value used whenever “was_impossible” : True

  3. As this is based on a KNN algorithm, then for any users who have rated only one book, and they are the only ones to have rated that book, then there would not be any nearest neighbors to make predictions from. This would possibly explain ‘was_impossible’: True, ‘reason’: ‘Not enough neighbors.’

  4. If a book was in the testset, but not the trainset, then there would be no mapping. Which could explain ‘was_impossible’: True, ‘reason’: ‘User and/or item is unknown.’

Perhaps a Matrix Factorization algorithm would be more suitable.

Thank you for your thoughts.

I absolutely agree that someone professional needs to explain this issue. Sometimes I find it frustrating that there is no opportunity to contact someone experienced directly if such issues occur because here in the forum you often won’t get a sufficient answer.

I think your thoughts regarding the KNN issue do make sense because looking at the value counts of the different user - and book ids there are a lot of ids only appearing once so the probability of a specific user or book not having any connection with the other data from the dataset could be high.
However - if we think of a specific row in the dataframe as a point in a coordinate system wouldn’t there somehow always be a nearest neighbour at some point to this datapoint?
I think my main problem is that I do not quite understand what the load_from_df function does to the Pandas dataframe and how exactly the KNNBasic() method uses the KNN algorithm.

Can you explain this to me?

I don’t think the load_from_df function does anything special to the pandas dataframe. Whilst I can’t find any method to output it, the surprise docs show an example which looks like a standard df.

The KNNBasic algorithm is user-user based by default. The default minimum number of neighbours for aggregation is 1, If there are not enough neighbours (i.e. no neighbours), the prediction is set to the global mean - hence the 3.83 for almost all predictions.

Using KNN algorithm assumptions as I understand them, the users would be the rows & each unique book_id would become the columns or features.

The algorithm would calculate the distance between ratings for every other occurrence of the same book_id / feature in order to determine the nearest neighbours for a particular user. Therefore user-user neighbours only exist when other users have rated the same book or books as the user we are making predictions for.

Even if the algorithm finds nearest neighbours for a user, it can only predict / recommend based on what those neighbours have rated.

I tested this by making predictions using the following:

user_id : 7492e21ed92c8daa5e4fa2ac651b06da & cf27371864d743514657a1c229c7c80b
book_id : 18007564 & 20256737

Both users have rated 18007564 so neighbours exist. However, the only recommendation that can be made from this is for book_id 20256737 (the only other book rated by the nearest neighbour).

Looking at the base data for this project, it seems far too sparsely populated to be of any real use.

There are 2,259 users who only rated one book each - even if this book was rated by more than one user & is therefore a nearest neighbour, you can’t recommend anything else from a user who rated only one book.
There are 2,249 books with only one user rating - therefore there are no item-item neighbours for these books.

I am only a fellow student, this is just how i see things.

Thank you very much. If things work as you say your explanation on why the KNNBasic can’t find the item or user ID seems very likely.