The Netflix Prize: why I think SVD won't cut it
A few months ago, while skimming the Slashdot RSS feed, my attention was drawn to a post on the Netflix Prize.
Geeks will (should!) already have heard about this, but in short: the Netflix Prize is an online competition, in which each team needs to come up with a better movie rating prediction algorithm than the one Netflix is using (named CineMatch), better meaning achieving a 10% lower error rate. In case you're wondering why this is so interesting: Netflix is offering $1M to the first team able to do so.
However, the prize money wasn't the main reason for drawing my attention. At first, I figured "If it's worth $1M, it must be damn near impossible, right?". Well, maybe not...
The Slashdot post referred to an article in the New York Times, which briefly touched on the main statistical technique the top teams are mostly relying on, Singular Value Decomposition.
Singular Value Decomposition (SVD) tries to determine so called factors, in this case one factor for each user and movie. The idea is that each factor corresponds to an underlying trend in the data, which may (or may not) have an intuitive meaning, e.g. the amount of action in a movie or how much a user likes/dislikes romantic movies.
Each factor consist of a set of parameters, i.e. real values (e.g. there is 0.6 action in movie X, on a scale between 0 and 1). Combining the parameters of a user and movie, by summing the pair-wise products, should yield an accurate estimate of the user rating.
Because I've been using a strongly related technique in my Phd research, i.e. Principal Components Analysis (PCA), I figured I might actually have a small chance at coming up with a decent algorithm. So, I started looking into details.
Netflix provides a huge data set (~700MB) of user ratings for a set of movies. In total, there are just over 100 million user ratings from 480,109 different users, for 17,770 different movies.
The evaluation of the algorithm a brave soul comes up with is done by seeing how accurately they can predict user ratings for a quiz, i.e. a list of user/movie pairs with unknown corresponding user rating.
To avoid a brute force attack, each user is limited to one submission of an answer to the quiz each day, and only an overall error rate is returned.

After playing around with this for a while, I realized that using a technique like SVD or PCA is probably not the best way to tackle this.
The reason is simple: the sparsity of the data. If you think about the data set as a 480,189x17,770 matrix, then you quickly realize that the provided set of user ratings only fills in ~1.1% of the matrix elements.
So, a lot of data is missing, and SVD will only be able to come up with (very rough) estimates of the factors, based on the available user ratings.
Although the SVD technique matches the problem intuitively, and most teams using it seem to come up with algorithms that are doing pretty well, I don't think that there's enough data available for this technique to yield really good results.
In fact, the top teams are combining this sort of technique with various other methods to be able to achieve good predictions.
Although I have no faith in the SVD technique mentioned above, that doesn't mean I don't think I can come up decent algorithm... It just means I won't be using something like SVD. I'll go deeper into what I do want to rely on in a later post.
For now, I should try and find some time to actually experiment with the ideas I have for this. Right now, some of the top teams are already able to outperforms the Netflix algorithm by over 9.5%, see the Leaderboard for more details.
So it's only a matter of time until someone actually is able to really beat CineMatch, and collect the $1M prize...
Comments
The Netflix Prize: why I think SVD won't cut it
Thank you for it post because you informing us nicely so I just want to add here that the qualifying data set contains over 2.8 million triplets user, movie, date of grade, los angeles mortgage loans with grades known only to the jury. A participating team's algorithm must predict grades on the entire qualifying set, but they are only informed of the score for half of the data, the quiz set. The other half is the test set, and performance on this is used by the jury to determine potential prize winners. checking account Only the judges know which ratings are in the quiz set, and which are in the test set—this arrangement is intended to make it difficult to hill climb on the test set. Submitted predictions are scored against the true grades in terms of root mean squared error (RMSE), los angeles real estate and the goal is to reduce this error as much as possible. Note that while the actual grades are integers in the range 1 to 5, submitted predictions need not be. Thanks again :)
Although I already answered
Although I already answered your questions on the Haskell mailing list, let me answer them again here...
I was using the lastest version of GHC, i.e. 6.10.1.
Considering runtime options, I was using
+RTS -A128M -s, to lower the activity of the garbage collector, and to collect some runtime statistics (time, memory usage, ...).As I mentioned on the list, your implementation being slower is probaby a combination of things, i.e. constructing the
IntMapin a different way (usingsingletonandunion, instead ofemtpyandinsertlike I'm doing), and usingUrrinstead ofUArray.My code to read and parse the data is available at boegel.kejo.be/files/Netflix_read-and-parse_24-02-2009.hs.
Let me know if you manage to figure out what caused the difference in memory usage and overall performance.
What version of GHC have you
What version of GHC have you used?
Did you set some custom values for the RTS?
My implementation (http://haskell.mperillo.ath.cx/netflix-0.0.1.tar.gz) has severe problems with garbage collector, on GHC 6.8.2.
Post new comment