Model based Collaborative Filtering (CB) of Recommendation System using SGD
Please refer the GitHub repository for the complete code. In this repository, I have prepared the self paced case study of Netflix Movie Recommendation System.
Predicting what content a user wants today is important for a lot of sites on the internet in order to be competitive. Sites such as Netflix, Amazon and YouTube have developed highly sophisticated systems for recommending new and relevant content for users. These systems are one of the most valued assets of these companies as demonstrated by the Netflix sponsored competition with a prize of one million dollars to improve on their system. These types of systems are collectively known as recommender systems.
Recommender systems can take multiple different approaches to achieve comparable results. An important task of a recommender system is to make predictions of how a user might like an item based on the user’s and other users’ past behavior. The two most common techniques deployed are content-based and collaborative filtering. These two both have their pros and cons. Content-based filtering compare item attributes and make recommendations by finding items that are similar to what a user previously liked. Collaborative filtering on the other hand tries to find users that have similar taste and predict based on how these users interacted with different items
Model-based collaborative filtering uses machine learning algorithms to create a model based on training data and then use the model to make predictions. This blog will mainly focus on implementations of matrix factorization using stochastic gradient descent
Cold start problem is often encountered for a case of new user. As it depends on past behavior/actions of a user- it would be even a bigger problem to predict the expected.
Content-based filtering
Content based filtering (CB) is a method for recommender systems where the primary source of information is content known about the items in the system. This can be tags, inherent attributes such as size, length, color, name and so on. In the context of movies, content could be meta information about the movie such as director, actors, genre and release date.
CB recommender systems typically make recommendations by finding items that are similar to ones the user has liked in the past. Predictions are typically made by building a profile for the user preferences based on the content of the items that the user previously liked, rated, or interacted with. New items are compared to the user’s profile and given a relevancy score for the user. While having an accurate user profile can be very effective, this approach has some limitations if the items content does not have sufficient amount of features.
Because these systems will only recommend items that are similar to the ones already rated, a user can miss out on items that the user may like even though they are not very similar to what they have interacted with before. Items that may be wildly different may have some sort of usefulness together that is not directly linked to their features. Cold start is a problem but not as big as with Collaborative filtering as the system can make predictions agnostic to how many users have used the system
Matrix Factorization
Matrix factorization (MF) is a technique which computes a latent factor model of a system based on user-item interaction. These user-item interactions can be represented as a matrix with users on one axis and items on the other. In most movie recommender systems interactions are ratings by users for movies, but can be different data as well, such as implicit feedback, temporal effects, and/or confidence levels. This rating matrix is typically very sparse in real world applications as users usually only rates a fraction of all the movies in the system. MF has been showed to be able to make very good predictions even on very sparse matrices. The matrix factorization approach reduces the dimensions of the rating matrix r by factorizing it into a product of two latent factor matrices, p for the users and q for movies
f is the number of features extracted, u the number of users and i number of items (in our case, movies). Each row ‘p’ is a vector of features for a user ‘u’ and each row ‘q’ is a vector of features for an item i. The product of these vectors creates an estimate of the original rating.
There are multiple ways of factorizing a matrix into multiple components, used in many areas of machine learning and statistics, but most methods do not work when there are missing values in the matrix. If it could be done, not only would the observed values be estimated but all the missing values would be predicted. One approach is to impute the missing values, but doing so could distort the observed data due to the sparseness of the original matrix
Matrix factorization with bias
To improve the predictions use a bias for movie ‘i’, called ‘bi’, bias for user ‘u’, called ‘bu’, and global rating average ‘µ’ to model the rating ‘rui’. ‘q’ and ‘p’.
Here the rating is broken into components where movie and user bias represent the deviation of the movie and user from the global average. Thus p_u.dot(qT_i) becomes only the interaction between the movie and the user, e.g. things the movie and user have in common (the collaborative part).
SGD formula
Here, in this case- we will ignore the intercept of users. We will use standard SGD formula to predict the bias terms which will be helpful to compute predicted ratings. Then we can find the mean squared error values for every epochs.
NOTE: ‘w’ is corresponding to bias term here.
In this blog, we will use randomized singular value decomposition to three vectors such as U, Sigma and VT and then apply SGD to calculate bias values in a iterative method.
In the above code, we created sparse matrix using dataset.
Algorithm to calculate predicted ratings
Calculating predicted ratings for each movie and then calculating MSE values. This is running for every epochs that’ll be helpful for plotting.
Iterating over each data points to predict the ratings.
Why we are using Epochs?
Whenever we want to optimize a gradient descent problem, we often focus on reaching convergence i.e. local minima. As GD or SGD has a parameter called learning rate, it doesn’t guarantee that it will converge at one iteration. Hence we get local minima after iterating over and over with batches of inputs. The number of iteration is referred as Epochs.
As per privacy policy and agreement, unable to share the complete code. But here’s the output of the code (using the above algorithm) which will be helpful for getting the local minima (Kindly refer the plot below).
Now using matplotlib, let us plot the MSE errors for each epoch.
As we can see after 50 epochs, the MSE is linear which states the predicted ratings at epoch 50 could be identified as the correct one to move further.
Now, we can compute confusion matrix to validate the percentage of misclassified points which will be helpful for further minimizing using different matrix factorization algorithm (if needed).
Conclusion
Consider using a real time recommendation system where we need to predict movie ratings but having a huge set of movies such as 20k or more- Just imagine how many movies a user can watch in a span of 10–20 days or even a month. At most 50???
Hence most of the rating cells will be having missing values in its matrix form of user and movie. Hence in this case stochastic gradient descent is useful to minimize the cost of the computation by not depending upon missing or null values. It works by looping through each rating in the training data, tries to predict the rating and calculates a prediction error.
Thanks for your valuable time for reading this article.
If you like this article- please like, share and follow my channel as it will encourage me to post varieties of concepts, problems on ML and DL…