A couple of days ago a Slashdot post entitled "Supercomputer Advancement Slows?" caught my attention. It concerns an IEEE Spectrum article on Next-Generation Supercomputers, which is well worth the read imho.
In short, the article mentions various reasons why supercomputers won't break the exaflop (1,000,000,000,000,000,000 operations per second) limit anytime soon. The major concerns are well-known, i.e. power usage, cooling, cost, physical footprint, etc. Besides this, also the degree of use (actual vs peak flops), the huge memory/storage requirements concerns and the need for fault-tolerance were touched upon in the article.
First of all, I'm not sure whether I fully agree with the conclusion of the article. Although some of the problems mentioned seem like they're too hard to handle right now, we've seen some amazing things being accomplished the past couple of decades. Also, I kind of had a "the earth is flat"-feeling when reading the article, if you know what I mean. I might be wrong, though. When YouTube started to gain momentum a couple of years ago I felt like it would never work out because people wouldn't want to put videos of themselves online for everyone to see. Boy, was I wrong...
Nevertheless, the reason I'm bringing up this Slashdot post is because I feel the author(s) of the IEEE Spectrum article missed something.
During my PhD I "wasted" a couple of centuries of computing time on the university HPC infrastructure myself. Also, in the last couple of months since I've become part of the HPC-team at Ghent University, I've worked with scientists from various fields who run experiments on our (currently rather modest) HPC infrastructure. This experience with HPC systems from the point of view of the end users made me realize there is another important aspect which contributes to a successful HPC infrastructure, or supercomputer (if you insist): the users. Yes, them.
Even if you have a massive beast of a system, with a state-of-the-art network and storage infrastructure, the best processors money can buy and no budget limitations to pay for operational cost, it's the users that will determine whether or not it all pays off. Users need to know what they are doing, how to efficiently use the system, and how to avoid doing downright useless stuff on it. You won't believe how much computing time gets wasted by typos in scripts or quick-lets-submit-this-because-its-Friday-afternoon-and-I-need-a-beer experiments.
Frankly, I have no idea how they handle this at really large supercomputer sites, like the ones in the Top500 list. I hope they only start the really big experiments after thorough preparation, testing smaller-scale stuff first and making damn sure they've done the best they can to optimize the experiments. Otherwise, why even bother breaking the exascale limit?