Alternating Least Squares Method for Collaborative Filtering

Recommender Systems

An Informal Definition

Recommender systems is a family of methods that enable filtering through large observation and information space in order to provide recommendations in the information space that user does not have any observation, where the information space is all of the available items that user could choose or select and observation space is what user experienced or observed so far.

Why it is necessary?

We have more options and choices that what we used to have and this increase in options and choices will even increase in the near future. What to eat, which movie to watch, what book to read are the questions that we find ourselves to answer all the time. If you just consider the infromation space that has the answer of these questions let alone the answering the questions in the first place, you could find yourself in an immmense decision domain, which may cripple your decision making.

Traditionally, this questions are answered with peer recommendations(word of mouth, forums, blog posts or reviews) or expert advice(columnist, librarian and recommendation of someone who has domain expertise). Traditional methods are good but limited in the observation spaces that recommenders have to begin with. Your peers could only read so many books, could only visit so many restaurans, could only watch so many movies. Second, they are biased towards their preference(naturally) where when you want to make a decision, you want to be biased towards yourself in order to maximize the decision outcome. Third, a person may not have access to these traditional methods. She may not have peers who share somehow same taste in music and movies, she may not have access reading expert advices.

Due to these shortcomings, computer based recommender systems provide a much better alternative to the user. Not only they do not have these shortcomings of the traditional methods, but also they could mine the historical information of the user and demographics information which may result in a more accurate and finely-tuned recommendation for a problem that what traditional methods could offer.

Industry Usage

In industry, considering the usage, the most widely known recommender system could be attributed to Amazon. Purchase history, using browser history and user history, they provided recommendations to the user for a variety selection of goods and products. Currently, many companies that sell a selection of products and have access to user information, employ a recommender system that promotes futher purchases to the user.

Richness of the Ecosystem

Difference business and product needs and a variety of algorithms that could be used for recommender systems yielded a rich set of methods that could be used for recommendations.

Connection to Information Retrieval

This subsection of machine learning methods also have connections with information retrieval and actually the problem could be formulated as an information retrieval problem as well. Consider Google, the links(items) are brought to the first page to the users based on query information, location, user history and so on. Therefore, most of the algorithms invented in information retrieval could be adopted to recommendation systems with minimal changes. The reverse may not hold true in general, though.

In this post, I will focus mainly on Collaborative Filtering in these set of algorithms.

Collaborative Filtering

Definition

Collaborative Filtering(CF) is a subset of algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for. Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before. CF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)

Intuition

  • Personal tastes correlate for a given domain and information space.
  • Users agreed before are more likely to agree in future and some particular item.
  • To approximate the rating of the user, use users who have somehow similar taste.

Advantages over content based methods

  • No need to know about item content
    • Content of the item does not necessarily tell the whole story. What features(length, category, actor/actresses) of movie of an item is important for a particular user? No information is necessary for CF.
  • "Item cold-start" problem is avoided
    • When the item information is not available, we do not need it as long as some users purchased the item.
  • User interest may change over time.
    • Focusing on content solely does not provide any flexibility on the user side. We need to take the user preference change into consideration as well.
  • Explainability
    • Rather than depending the content of the item, you could say plaubible things to the user "You will like this product because other users bought the product you just bought also bought this particular product", which gives expanation and support to the user.
  • Can capture subtle things.
    • This is generally true for latent factor models. If most of the users are buying two unrelated items, then it is likely for another user who shares the same taste of other users is likely to buy that unrelated thing, which brings serendipity to the table.

Disadvantages over content based methods

  • Sparsity of user preference and Scalability in computation
    • User preference sparsity makes it hard to find users who share similar taste
    • Large sparse matrices make hard to do computation in general.
  • Privacy Concerns
    • Some users may not want their preference and taste available on the website even in an implicit form.
  • Synonym issue needs to be handled properly and normalized
    • Film-movie corresponds to the same thing, if one user buys comedy and the other buys "funny movies", they should be treated as same in order to improve both recommendations and not separate the things that are inherently together.
  • Gray Sheep
    • There may be people whose taste and preference do not consistently agree with other users' preference. Recommendations for these users may not make sense as their taste is quite distinct from others.
  • Shilling Attack
    • People who like one brand may give consistently high scores to that brand and give poor scores to the competitors no matter what their real experience are.

I will introduce Alternating Least Squares method (original paper(pdf)) and its evaluation using MovieLens dataset.

Alternating Least Square Formulation for Recommender Systems

We have users \(u\) for items \(i\) matrix as in the following:
\[ Q_{ui} = \cases{ r & \text{if user u rate item i} \cr 0 & \text{if user u did not rate item i} } \] where \(r\) is what rating values can be. If we have \(m\) users and \(n\) items, then we want to learn a matrix of factors which represent movies. That is, the factor vector for each movie and that would be how we represent the movie in the feature space. Note that, we do not have any knowledge of the category of the movie at this point. We also want to learn a factor vector for each user in a similar way how we represent the movie. Factor matrix for movies \(Y \in \mathbb{R}^{fxn}\) and factor matrix(each movie is a column vector) for users \(X \in \mathbb{R}^{mxf}\)(each user is a row vector). However, we have two unknown variables. Therefore, we will adopt an alternating least squares approach with regularization. By doing so, we first estimate \(Y\) using \(X\) and estimate \(X\) by using \(Y\). After enough number of iterations, we are aiming to reach a convergence point where either the matrices \(X\) and \(Y\) are no longer changing or the change is quite small. However, there is a small problem in the data. We have neither user full data nor full items data, (suprisingly) this is also why we are trying to build the recommendation engine in the first place. Therefore, we may want to penalize the movies that do not have ratings in the update rule. By doing so, we will depend on only the movies that have ratings from the users and do not make any assumption around the movies that are not rated in the recommendation. Let's call this weight matrix \(w_{ui}\) as such: \[ w_{ui} = \cases{ 0 &\text{if } q_{ui} = 0 \cr 1 & \text{ else} } \] Then, cost functions that we are trying to minimize is in the following: \[ J(x_u) = (q_u - x_u Y) W_u (q_u - x_u Y)^T + \lambda x_u x_u^T \] \[ J(y_i) = (q_i - X y_i) W_i (q_i - X y_i)^T + \lambda y_i y_i^T \] Note that we need regularization terms in order to avoid the overfitting the data. Ideally, regularization parameters need to be tuned using cross-validation in the dataset for algorithm to generalize better. In this post, I will use the whole dataset. Solutions for factor vectors are given as follows: \[ x_u = (Y W_u Y^T + \lambda I)^{-1} Y W_u q_u \] \[ y_i = (X^T Wi X + \lambda I)^{-1} X^T W_i q_i \] where \(W_u \in \mathbb{R}^{nxn}\) and \(W_u \in \mathbb{R}^{mxm}\) diagonal matrices. The algorithm is pretty much of it. In the regulaization, we may want to incorporate both factor matrices in the update rules as well if we want to be more restrictive. That may generalize better, though.

The dataset is from MovieLens, contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

tag_headers = ['user_id', 'movie_id', 'tag', 'timestamp']
tags = pd.read_table('data/tags.dat', sep='::', header=None, names=tag_headers)

rating_headers = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('data/ratings.dat', sep='::', header=None, names=rating_headers)

movie_headers = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/movies.dat',
                       sep='::', header=None, names=movie_headers)
movie_titles = movies.title.tolist()
In [2]:
movies.head()
Out[2]:
movie_id title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy

5 rows × 3 columns

In [3]:
ratings.head()
Out[3]:
user_id movie_id rating timestamp
0 1 122 5 838985046
1 1 185 5 838983525
2 1 231 5 838983392
3 1 292 5 838983421
4 1 316 5 838983392

5 rows × 4 columns

In [4]:
tags.head()
Out[4]:
user_id movie_id tag timestamp
0 15 4973 excellent! 1215184630
1 20 1747 politics 1188263867
2 20 1747 satire 1188263867
3 20 2424 chick flick 212 1188263835
4 20 2424 hanks 1188263835

5 rows × 4 columns

In [5]:
df = movies.join(ratings, on=['movie_id'], rsuffix='_r').join(tags, on=['movie_id'], rsuffix='_t')
del df['movie_id_r']
del df['user_id_t']
del df['movie_id_t']
del df['timestamp_t']

Joined the tables into one dataframe and then remove the redundant columns.

In [6]:
df.head()
Out[6]:
movie_id title genres user_id rating timestamp tag
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1 5 838983525 politics
1 2 Jumanji (1995) Adventure|Children|Fantasy 1 5 838983392 satire
2 3 Grumpier Old Men (1995) Comedy|Romance 1 5 838983421 chick flick 212
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 1 5 838983392 hanks
4 5 Father of the Bride Part II (1995) Comedy 1 5 838983392 ryan

5 rows × 7 columns

In [7]:
rp = df.pivot_table(cols=['movie_id'],rows=['user_id'],values='rating')
rp.head()
Out[7]:
movie_id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
user_id
1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ...
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...

5 rows × 10681 columns

In [8]:
rp = rp.fillna(0); # Replace NaN
rp.head()
Out[8]:
movie_id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
user_id
1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ...
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...

5 rows × 10681 columns

In [9]:
Q = rp.values
In [10]:
Q
Out[10]:
array([[ 5.,  5.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  3.,  4.,  4.]])
In [11]:
Q.shape
Out[11]:
(303, 10681)

We have 303 users and 10681 items(movies) in our \(Q\) matrix.

In [12]:
W = Q>0.5
W[W == True] = 1
W[W == False] = 0
# To be consistent with our Q matrix
W = W.astype(np.float64, copy=False)
In [13]:
W
Out[13]:
array([[ 1.,  1.,  1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  1.,  1.,  1.]])

Parameters

In [14]:
lambda_ = 0.1
n_factors = 100
m, n = Q.shape
n_iterations = 20
In [15]:
X = 5 * np.random.rand(m, n_factors) 
Y = 5 * np.random.rand(n_factors, n)
In [16]:
def get_error(Q, X, Y, W):
    return np.sum((W * (Q - np.dot(X, Y)))**2)
In [17]:
errors = []
for ii in range(n_iterations):
    X = np.linalg.solve(np.dot(Y, Y.T) + lambda_ * np.eye(n_factors), 
                        np.dot(Y, Q.T)).T
    Y = np.linalg.solve(np.dot(X.T, X) + lambda_ * np.eye(n_factors),
                        np.dot(X.T, Q))
    if ii % 100 == 0:
        print('{}th iteration is completed'.format(ii))
    errors.append(get_error(Q, X, Y, W))
Q_hat = np.dot(X, Y)
print('Error of rated movies: {}'.format(get_error(Q, X, Y, W)))
0th iteration is completed
Error of rated movies: 15819.8521631

In [18]:
plt.plot(errors);
plt.ylim([0, 20000]);
In [28]:
def print_recommendations(W=W, Q=Q, Q_hat=Q_hat, movie_titles=movie_titles):
    #Q_hat -= np.min(Q_hat)
    #Q_hat[Q_hat < 1] *= 5
    Q_hat -= np.min(Q_hat)
    Q_hat *= float(5) / np.max(Q_hat)
    movie_ids = np.argmax(Q_hat - 5 * W, axis=1)
    for jj, movie_id in zip(range(m), movie_ids):
        #if Q_hat[jj, movie_id] < 0.1: continue
        print('User {} liked {}\n'.format(jj + 1, ', '.join([movie_titles[ii] for ii, qq in enumerate(Q[jj]) if qq > 3])))
        print('User {} did not like {}\n'.format(jj + 1, ', '.join([movie_titles[ii] for ii, qq in enumerate(Q[jj]) if qq < 3 and qq != 0])))
        print('\n User {} recommended movie is {} - with predicted rating: {}'.format(
                    jj + 1, movie_titles[movie_id], Q_hat[jj, movie_id]))
        print('\n' + 100 *  '-' + '\n')
#print_recommendations()

This method does not produce good results for most of the users, so I did not print out the results as you might be imagine from the mean squared error graph

Weighted Alternating Least Squares Approach

In [26]:
plt.plot(weighted_errors);
plt.xlabel('Iteration Number');
plt.ylabel('Mean Squared Error');
In [20]:
weighted_errors = []
for ii in range(n_iterations):
    for u, Wu in enumerate(W):
        X[u] = np.linalg.solve(np.dot(Y, np.dot(np.diag(Wu), Y.T)) + lambda_ * np.eye(n_factors),
                               np.dot(Y, np.dot(np.diag(Wu), Q[u].T))).T
    for i, Wi in enumerate(W.T):
        Y[:,i] = np.linalg.solve(np.dot(X.T, np.dot(np.diag(Wi), X)) + lambda_ * np.eye(n_factors),
                                 np.dot(X.T, np.dot(np.diag(Wi), Q[:, i])))
    weighted_errors.append(get_error(Q, X, Y, W))
    print('{}th iteration is completed'.format(ii))
weighted_Q_hat = np.dot(X,Y)
#print('Error of rated movies: {}'.format(get_error(Q, X, Y, W)))

It is important to give feedback to the users and I printed out the results in below. (I could not output the results in the same way ipython notebook displays, so in order to see the bottom, scroll quite a bit).

User 2 watched Copycat, Powder, Now and Then, It takes Two, (( Crime | Drama | Mystery), (Drama | Fantasy | Mystery), (Comedy | Drama), ( Comedy | Family | Romance ). and our collaborative filter recommends Volver (Comedy | Crime | Drama ). Not so bad, huh?

Note that there are some users who do not have a specific taste in the movies, so the recommender seems to output random movies for those users.

Note also that, in a production system or a website the prediction rating(or confidence) may not be meaningful at all. Therefore, although it is important to have confidence for the recommendations that will be displayed to the user, the

In [29]:
print_recommendations(Q_hat=weighted_Q_hat)
User 1 liked Toy Story (1995), Jumanji (1995), Grumpier Old Men (1995), Waiting to Exhale (1995), Father of the Bride Part II (1995), Heat (1995), Sabrina (1995), Tom and Huck (1995), Sudden Death (1995), GoldenEye (1995), American President, The (1995), Dracula: Dead and Loving It (1995), Balto (1995), Nixon (1995), Cutthroat Island (1995), Casino (1995), Sense and Sensibility (1995), Four Rooms (1995), Ace Ventura: When Nature Calls (1995), Money Train (1995), Get Shorty (1995)

User 1 did not like 


 User 1 recommended movie is Where the Sidewalk Ends (1950) - with predicted rating: 2.61178326612

----------------------------------------------------------------------------------------------------

User 2 liked Copycat (1995), Powder (1995), Now and Then (1995), It Takes Two (1995)

User 2 did not like Persuasion (1995), Babe (1995), Carrington (1995)


 User 2 recommended movie is Volver (2006) - with predicted rating: 4.56932148577

----------------------------------------------------------------------------------------------------

User 3 liked Dead Presidents (1995), Restoration (1995), Mortal Kombat (1995), To Die For (1995), How to Make an American Quilt (1995), Seven (a.k.a. Se7en) (1995), Pocahontas (1995), When Night Is Falling (1995), Guardian Angel (1994), Lamerica (1994), Big Green, The (1995), Georgia (1995), Kids of the Round Table (1995), Home for the Holidays (1995), Postman, The (Postino, Il) (1994), Confessional, The (Confessionnal, Le) (1995), Indian in the Cupboard, The (1995), Don't Be a Menace to South Central While Drinking Your Juice in the Hood (1996), Two if by Sea (1996), Lawnmower Man 2: Beyond Cyberspace (1996), Two Bits (1995), French Twist (Gazon maudit) (1995), Friday (1995), From Dusk Till Dawn (1996), Fair Game (1995), Kicking and Screaming (1995), Misérables, Les (1995), Bed of Roses (1996)

User 3 did not like Mighty Aphrodite (1995), Mr. Holland's Opus (1995)


 User 3 recommended movie is Larry the Cable Guy: Health Inspector (2006) - with predicted rating: 2.64590027641

----------------------------------------------------------------------------------------------------

User 4 liked Screamers (1995), Crossing Guard, The (1995), Juror, The (1996), White Balloon, The (Badkonake sefid) (1995), Things to Do in Denver When You're Dead (1995), Antonia's Line (Antonia) (1995), White Squall (1996), Black Sheep (1996), Nick of Time (1995), Journey of August King, The (1995), Vampire in Brooklyn (1995), Hate (Haine, La) (1995), Unforgettable (1996), Happy Gilmore (1996), Bridges of Madison County, The (1995), Nobody Loves Me (Keiner liebt mich) (1994), Muppet Treasure Island (1996), Catwalk (1996), Headless Body in Topless Bar (1995), Braveheart (1995), Taxi Driver (1976)

Hybrid Systems

If we want to have the both content and collaborative filtering's good worlds, then we could adopt a hybrid approach. They could be combined in the following ways:

  1. Cascade (Content + Collaborative Filtering)
    • After content based filtering, we also apply collaborative filtering to refine the recommendations even more.
  2. Mix & Match
    • If some users (gray sheeps) do not necessarily agree with other users, try to use their history and content of the items that they bought rather than depending on the collaborative filtering. If a user consistently agrees on other users, try to favor collaborative filtering to that user.
  3. Switch
    • For a given user and item pair, estimate the confidence of both approaches and choose the best one in terms of confidence.
  4. Weighting
    • Weight both approaches accordingly(based on evaluation) and then recommend the item to the user.

What type of methods do you use or know for recommendation system?

What type of systems that you know are better than Alternating Least Squares in terms of performance, mean square error or other advantages? Please, let me know in the comments.

comments powered by Disqus