A few months ago, while skimming the Slashdot RSS feed, my attention was drawn to a post on the Netflix Prize.
Geeks will (should!) already have heard about this, but in short: the Netflix Prize is an online competition, in which each team needs to come up with a better movie rating prediction algorithm than the one Netflix is using (named CineMatch), better meaning achieving a 10% lower error rate. In case you're wondering why this is so interesting: Netflix is offering $1M to the first team able to do so.
However, the prize money wasn't the main reason for drawing my attention. At first, I figured "If it's worth $1M, it must be damn near impossible, right?". Well, maybe not...
The Slashdot post referred to an article in the New York Times, which briefly touched on the main statistical technique the top teams are mostly relying on, Singular Value Decomposition.
Singular Value Decomposition (SVD) tries to determine so called factors, in this case one factor for each user and movie. The idea is that each factor corresponds to an underlying trend in the data, which may (or may not) have an intuitive meaning, e.g. the amount of action in a movie or how much a user likes/dislikes romantic movies.
Each factor consist of a set of parameters, i.e. real values (e.g. there is 0.6 action in movie X, on a scale between 0 and 1). Combining the parameters of a user and movie, by summing the pair-wise products, should yield an accurate estimate of the user rating.
Because I've been using a strongly related technique in my Phd research, i.e. Principal Components Analysis (PCA), I figured I might actually have a small chance at coming up with a decent algorithm. So, I started looking into details.
Netflix provides a huge data set (~700MB) of user ratings for a set of movies. In total, there are just over 100 million user ratings from 480,109 different users, for 17,770 different movies.
The evaluation of the algorithm a brave soul comes up with is done by seeing how accurately they can predict user ratings for a quiz, i.e. a list of user/movie pairs with unknown corresponding user rating.
To avoid a brute force attack, each user is limited to one submission of an answer to the quiz each day, and only an overall error rate is returned.

After playing around with this for a while, I realized that using a technique like SVD or PCA is probably not the best way to tackle this.
The reason is simple: the sparsity of the data. If you think about the data set as a 480,189x17,770 matrix, then you quickly realize that the provided set of user ratings only fills in ~1.1% of the matrix elements.
So, a lot of data is missing, and SVD will only be able to come up with (very rough) estimates of the factors, based on the available user ratings.
Although the SVD technique matches the problem intuitively, and most teams using it seem to come up with algorithms that are doing pretty well, I don't think that there's enough data available for this technique to yield really good results.
In fact, the top teams are combining this sort of technique with various other methods to be able to achieve good predictions.
Although I have no faith in the SVD technique mentioned above, that doesn't mean I don't think I can come up decent algorithm... It just means I won't be using something like SVD. I'll go deeper into what I do want to rely on in a later post.
For now, I should try and find some time to actually experiment with the ideas I have for this. Right now, some of the top teams are already able to outperforms the Netflix algorithm by over 9.5%, see the Leaderboard for more details.
So it's only a matter of time until someone actually is able to really beat CineMatch, and collect the $1M prize...