Archive for May, 2010

How to turn many-core chips into a blessing for single threaded applications?

Wednesday, May 26th, 2010

The main argument against aggressively scaling up the number of cores in modern many-core processors is the large body of non-parallelized applications that simply cannot make use of more than just a few (or maybe just one…) cores. Server chip vendors spend a tremendous amount of effort on innovating novel ways for executing these applications faster, without increasing core frequency or busting the power budget: scouting threads, on-chip DRAM, out of order execution are all geared towards executing single-threaded applications faster (beside addressing to certain extent the memory wall as well).

There’s one technique that has been explored – and abandoned – in the context of speeding up execution of single threaded applications. Speculative execution, value and branch prediction and run-ahead execution were all promising great benefits and ended up in achieving very little. After lots of trial and error, the computer architecture community largely abandoned the idea and returned to more “down-to-earth” ideas – even though studies have shown that indeed there is a wealth of parallelism that is left unexplored in single threaded applications, even at the level of few hundred instructions.

As I argued in a previous post, the fundamental underlying reason for this failure is the yawning gap between the wealth of semantic information in the programmer’s head an what the hardware eventually can see. In many ways, computer architects were the Sherlock Holmes and Hercule Poirot of computers, trying the design ways to figure out what the programmer might have wanted to achieve, from a hopelessly sequential stream of very low level instructions (inherently so, due to the nature of the von Neumann computing architecture). There efforts were inevitably bound to fail – and fail they did, as there’s really a limited amount of information a computing system can extract from machine (or close to machine: C/C++/Java) code.

So, is there another way? I believe there is and that’s the subject of research I’m working on with professor Per Stenström at Chalmers university in Gothenburg. The basic idea is not to parallelize applications, but offer a formal way for the programmer to convey semantic information that the hardware can exploit to make the execution more efficient. We plan to do this through contracts: constraints, code snippets, hints – both deterministic and probabilistic – that can direct the hardware towards executing the software in a certain specific context. The hardware may deploy speculation (driven by contracts), run-ahead execution, value speculation – but all driven by the semantic information provided by the programmer.

In a paper due to be presented at the Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures (PESPMA), co-located with ISCA 2010, we present how this method can speed up Huffman decoding up to 8x on 64 cores, without touching the algorithm itself and keeping the power budget constant.  The more cores you can throw at it, the larger the speedup – and you can use simpler cores, as the size of the problem keeps shrinking.

So, what kind of semantic information did the programmer provide? The insight is that if the depth of the Huffman tree is N, then within N consecutive bits there will certainly be at least one new compressed code start (finding these positions is what makes Huffman decoding hard to parallelize). If we are happy with a probabilistic approach (say, with 95%  probability, there is a code start), then N is significantly reduced – we saw values hovering around 8. Thus, what hint the programmer gives is as follows: split up the stream into several chunks; for each chunk, start parallel speculative execution on N consecutive bits; let me know when you’re done – I’ll provide a code snippet to choose the right result.

There’s still a lot of work left in this area, but the results are more than encouraging. Pushing it to the limit, this may be a method that allows us to take advantage of many-many simple cores even for sequential applications, without the need to rewrite these. If we can achieve this, I believe we removed a major roadblock when it comes to using many-core chips.

Obviously, there may be several other ways to strengthen the semantic link between software and hardware and I know several universities are working on such methods. However, any significant redesign of computer architecture, any significant departure from the von Neumann model will have an enormous cost and frankly, I don’t see the willingness in the industry to go there.

But, as always, time will tell – and it’s pretty damn difficult to make predictions, especially about the future ( (c) Churchill).

Üni, a zongoramüvész

Saturday, May 8th, 2010

A szomszéd kislány születésnapján Üni felfedezte a zongorát. Nem is volt rest, elöbb fergeteges négykezest játszottak az ünnepelt hugával (Idával, civilben Üni óvodatársa), valahogy így:

majd Üni szólózott egyet. A végén Ida is becsatlakozott, felgyorsítandó a tempót 😉

From iPad with love

Thursday, May 6th, 2010

I never owned an iMac, iPhone, iPod or any other Apple gadget. In fact I never ever touched an iMac or an iPod; I confess to having played around with an iPhone on a couple of brief occasions.
All that changed yesterday when I walked into the Apple shop on 5th Avenue in New York and, after playing around with an iPad for 5 minutes, I walked away as the proud owner of a brand new 16Gb, Wi-Fi iPad (actually two, one for a colleague of mine).
Why did I buy it? What’s my use-case for it?
Beside the geek in me, this is a truly cool, fast and usable device. It’s a beautiful piece of industrial design that simply works and does so flawlessly. The browser is fast and accurate, the mail client neat and the productivity tools – like the document editor – simply usable for serious stuff as well. What I will use it for? As a second internet device and as a typing device when I’m travelling without access to a power outlet (to work on the many-core programming book – but more about this in an other post).
Beside its design and usability, the main selling point was power efficiency. After about six hours of on-off usage, the battery is still 91% full – previous tests have shown it running for over 11 hours, something I never saw on such a computer-like device, just enough to get me safely through an intercontinental flight with one or two stepovers.
For me, this is the main reason why the iPad will be hugely succesful and has the potential to be a game-changer. It’s a perfect proof of concept of how low power chips have come to match the performance of traditional general purpose chips, at the fraction of the cost and power and it will not stop here: there are several start-ups working on ultra-low power chips usable in servers. One little secret of modern chips is that a lot of the logic and power is spent on bridging the gap between the speed of the processor and that of memory. With a slower processor, that’s no major issue; by slowing down, you need less power and yet less power because less power is needed to bridge a smaller gap. Obviously, such an approach will fit applications with sufficient amount of parallelism, not inherently sequential ones. I’ll return to the subject of these applications in a post sooon.

PS. Perhaps you guessed: this post was written on the iPad. DS