Using Haskell to win the Netflix Prize

As mentioned in my last post, I've been thinking on competing for the Netflix Prize. To spice things up a bit, I decided to implement my algorithm in Haskell. Yes, I'm crazy.
Haskell is a functional programming language, with various cool features such as explicit purity, lazy evaluation, higher order functions and a strong static type system. If you're unsure what all those words mean, you should find out more about Haskell. Trust me, you'll fall in love with it.
The last time I did something remotely serious with Haskell was almost 4 years ago, i.e. writing a ray tracer, HRay, as a part of my Masters' thesis in Computer Science.
I missed implementing stuff in it ever since.
I decided to go with Haskell for this project for various reasons, most importantly because I wanted to freshen up my Haskell-fu and learn more about Haskell (what the hell are monads?!?).
That way, I'll at least gain something from it, just in case I don't actually win the $1M.
The last couple of days, I've been trying to figure out what the best data type would be to handle the massive amount of data, i.e. over 100M user ratings for 17,770 movies from over 480K users.
A first, very naive, attempt looked something like this:
Although this might seem like a nice way to go, it performs horribly in terms of memory usage. The reason is simple: normal lists have a huge amount of overhead for each element.
I was told by the nice helpful folks at #haskell that it takes at least 24 bytes of cruft for each list element on a 64-bit system, and that's ignoring the use of pairs as list elements. If you have over 100M of these elements to keep track of, that's a lot of cruft to drag along with you.
After careful thought, and taking into account input from various smart people, I ended up with something like this:
-- map of movie IDs to pairs of UArrays, -- containing user IDs and ratings
This is a lot more memory efficient, for various reasons:
- It avoids a lot of overhead coming from the map, because now I only have ~18K keys (the movie IDs), instead of over 480K keys. Also,
IntMapshould be more efficient thanMap Int, at least performance-wise. - By using unboxed immutable arrays (
UArray), I make sure that each user ID/rating only takes as much space as needed, i.e. 4 bytes (Int) and one byte (Word8) respectively. So, a lot less cruft to drag along. - Making sure everything is kept strict, by using
UArray(and notArray) and applyingseqin the right places, avoids large thunks in memory. Laziness is a nice feature, but it can seriously bite you in the ass in terms of memory usage.
Using this data type, I'm able to read in the entire Netflix data set (ignoring the time stamps) in 26 minutes on my old rusty PowerBook G4, using only 700M of memory. I'll do the real number crunching on our iMac though, don't worry.
Although I'm aware that the memory usage can be pushed down further, I decided to stick with this data type.
I'm glad I managed to get this far using clean Haskell code, i.e. without using any 'nasty' hacks such as allocating memory myself instead of trusting GHC do to it for me.
Next up is writing some analysis code, to figure out if what I have in mind will actually work.
Let the games begin!
Comments
Hi there, I'm having trouble
Hi there, I'm having trouble reading your site in the Avant browser (the font is extremely tiny). I've tried raising the font size from the settings option but that didn't work either. Any tips on what I should do? (Oh ya, I'm on Windows 7) - weight loss tips for men
Post new comment