Skip to Content

efficient memory usage

  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.
  • : Function ereg() is deprecated in /customers/kejo.be/kejo.be/httpd.www/modules/logotool/logotool.module on line 327.

Using Haskell to win the Netflix Prize

As mentioned in my last post, I've been thinking on competing for the Netflix Prize. To spice things up a bit, I decided to implement my algorithm in Haskell. Yes, I'm crazy.

Haskell is a functional programming language, with various cool features such as explicit purity, lazy evaluation, higher order functions and a strong static type system. If you're unsure what all those words mean, you should find out more about Haskell. Trust me, you'll fall in love with it.

The last time I did something remotely serious with Haskell was almost 4 years ago, i.e. writing a ray tracer, HRay, as a part of my Masters' thesis in Computer Science.
I missed implementing stuff in it ever since.

I decided to go with Haskell for this project for various reasons, most importantly because I wanted to freshen up my Haskell-fu and learn more about Haskell (what the hell are monads?!?).
That way, I'll at least gain something from it, just in case I don't actually win the $1M.

The last couple of days, I've been trying to figure out what the best data type would be to handle the massive amount of data, i.e. over 100M user ratings for 17,770 movies from over 480K users.
A first, very naive, attempt looked something like this:

  1. -- map of user ID to movie ID/rating pairs
  2. type Data = Map Int [(Int,Word8)]

Although this might seem like a nice way to go, it performs horribly in terms of memory usage. The reason is simple: normal lists have a huge amount of overhead for each element.
I was told by the nice helpful folks at #haskell that it takes at least 24 bytes of cruft for each list element on a 64-bit system, and that's ignoring the use of pairs as list elements. If you have over 100M of these elements to keep track of, that's a lot of cruft to drag along with you.

After careful thought, and taking into account input from various smart people, I ended up with something like this:

  1. -- map of movie IDs to pairs of UArrays,
  2. -- containing user IDs and ratings
  3. type Data = IntMap (UArray Int Int, UArray Int Word8)

This is a lot more memory efficient, for various reasons:

  • It avoids a lot of overhead coming from the map, because now I only have ~18K keys (the movie IDs), instead of over 480K keys. Also, IntMap should be more efficient than Map Int, at least performance-wise.
  • By using unboxed immutable arrays (UArray), I make sure that each user ID/rating only takes as much space as needed, i.e. 4 bytes (Int) and one byte (Word8) respectively. So, a lot less cruft to drag along.
  • Making sure everything is kept strict, by using UArray (and not Array) and applying seq in the right places, avoids large thunks in memory. Laziness is a nice feature, but it can seriously bite you in the ass in terms of memory usage.

Using this data type, I'm able to read in the entire Netflix data set (ignoring the time stamps) in 26 minutes on my old rusty PowerBook G4, using only 700M of memory. I'll do the real number crunching on our iMac though, don't worry.

Although I'm aware that the memory usage can be pushed down further, I decided to stick with this data type.
I'm glad I managed to get this far using clean Haskell code, i.e. without using any 'nasty' hacks such as allocating memory myself instead of trusting GHC do to it for me.

Next up is writing some analysis code, to figure out if what I have in mind will actually work.
Let the games begin!

Syndicate content