Skip to Content

netflix prize

  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.

Using Haskell to win the Netflix Prize

As mentioned in my last post, I've been thinking on competing for the Netflix Prize. To spice things up a bit, I decided to implement my algorithm in Haskell. Yes, I'm crazy.

Haskell is a functional programming language, with various cool features such as explicit purity, lazy evaluation, higher order functions and a strong static type system. If you're unsure what all those words mean, you should find out more about Haskell. Trust me, you'll fall in love with it.

The last time I did something remotely serious with Haskell was almost 4 years ago, i.e. writing a ray tracer, HRay, as a part of my Masters' thesis in Computer Science.
I missed implementing stuff in it ever since.

I decided to go with Haskell for this project for various reasons, most importantly because I wanted to freshen up my Haskell-fu and learn more about Haskell (what the hell are monads?!?).
That way, I'll at least gain something from it, just in case I don't actually win the $1M.

The last couple of days, I've been trying to figure out what the best data type would be to handle the massive amount of data, i.e. over 100M user ratings for 17,770 movies from over 480K users.
A first, very naive, attempt looked something like this:

  1. -- map of user ID to movie ID/rating pairs
  2. type Data = Map Int [(Int,Word8)]

Although this might seem like a nice way to go, it performs horribly in terms of memory usage. The reason is simple: normal lists have a huge amount of overhead for each element.
I was told by the nice helpful folks at #haskell that it takes at least 24 bytes of cruft for each list element on a 64-bit system, and that's ignoring the use of pairs as list elements. If you have over 100M of these elements to keep track of, that's a lot of cruft to drag along with you.

After careful thought, and taking into account input from various smart people, I ended up with something like this:

  1. -- map of movie IDs to pairs of UArrays,
  2. -- containing user IDs and ratings
  3. type Data = IntMap (UArray Int Int, UArray Int Word8)

This is a lot more memory efficient, for various reasons:

  • It avoids a lot of overhead coming from the map, because now I only have ~18K keys (the movie IDs), instead of over 480K keys. Also, IntMap should be more efficient than Map Int, at least performance-wise.
  • By using unboxed immutable arrays (UArray), I make sure that each user ID/rating only takes as much space as needed, i.e. 4 bytes (Int) and one byte (Word8) respectively. So, a lot less cruft to drag along.
  • Making sure everything is kept strict, by using UArray (and not Array) and applying seq in the right places, avoids large thunks in memory. Laziness is a nice feature, but it can seriously bite you in the ass in terms of memory usage.

Using this data type, I'm able to read in the entire Netflix data set (ignoring the time stamps) in 26 minutes on my old rusty PowerBook G4, using only 700M of memory. I'll do the real number crunching on our iMac though, don't worry.

Although I'm aware that the memory usage can be pushed down further, I decided to stick with this data type.
I'm glad I managed to get this far using clean Haskell code, i.e. without using any 'nasty' hacks such as allocating memory myself instead of trusting GHC do to it for me.

Next up is writing some analysis code, to figure out if what I have in mind will actually work.
Let the games begin!

The Netflix Prize: why I think SVD won't cut it

A few months ago, while skimming the Slashdot RSS feed, my attention was drawn to a post on the Netflix Prize.

Geeks will (should!) already have heard about this, but in short: the Netflix Prize is an online competition, in which each team needs to come up with a better movie rating prediction algorithm than the one Netflix is using (named CineMatch), better meaning achieving a 10% lower error rate. In case you're wondering why this is so interesting: Netflix is offering $1M to the first team able to do so.

However, the prize money wasn't the main reason for drawing my attention. At first, I figured "If it's worth $1M, it must be damn near impossible, right?". Well, maybe not...

The Slashdot post referred to an article in the New York Times, which briefly touched on the main statistical technique the top teams are mostly relying on, Singular Value Decomposition.

Singular Value Decomposition (SVD) tries to determine so called factors, in this case one factor for each user and movie. The idea is that each factor corresponds to an underlying trend in the data, which may (or may not) have an intuitive meaning, e.g. the amount of action in a movie or how much a user likes/dislikes romantic movies.
Each factor consist of a set of parameters, i.e. real values (e.g. there is 0.6 action in movie X, on a scale between 0 and 1). Combining the parameters of a user and movie, by summing the pair-wise products, should yield an accurate estimate of the user rating.

Because I've been using a strongly related technique in my Phd research, i.e. Principal Components Analysis (PCA), I figured I might actually have a small chance at coming up with a decent algorithm. So, I started looking into details.

Netflix provides a huge data set (~700MB) of user ratings for a set of movies. In total, there are just over 100 million user ratings from 480,109 different users, for 17,770 different movies.
The evaluation of the algorithm a brave soul comes up with is done by seeing how accurately they can predict user ratings for a quiz, i.e. a list of user/movie pairs with unknown corresponding user rating.
To avoid a brute force attack, each user is limited to one submission of an answer to the quiz each day, and only an overall error rate is returned.

After playing around with this for a while, I realized that using a technique like SVD or PCA is probably not the best way to tackle this.
The reason is simple: the sparsity of the data. If you think about the data set as a 480,189x17,770 matrix, then you quickly realize that the provided set of user ratings only fills in ~1.1% of the matrix elements.
So, a lot of data is missing, and SVD will only be able to come up with (very rough) estimates of the factors, based on the available user ratings.

Although the SVD technique matches the problem intuitively, and most teams using it seem to come up with algorithms that are doing pretty well, I don't think that there's enough data available for this technique to yield really good results.
In fact, the top teams are combining this sort of technique with various other methods to be able to achieve good predictions.

Although I have no faith in the SVD technique mentioned above, that doesn't mean I don't think I can come up decent algorithm... It just means I won't be using something like SVD. I'll go deeper into what I do want to rely on in a later post.

For now, I should try and find some time to actually experiment with the ideas I have for this. Right now, some of the top teams are already able to outperforms the Netflix algorithm by over 9.5%, see the Leaderboard for more details.
So it's only a matter of time until someone actually is able to really beat CineMatch, and collect the $1M prize...

Syndicate content