Solve The Real Problem

Discussions about professional software development with a focus on building real, solid, performing, reliable, monitorable large-scale online services in asynchronous C++. You know, good solid breakfast food.

Wednesday, August 23, 2006

Multicore, Singlethreaded, Megascale

The talk recently is all about how we're getting really close to the clock speed "barrier" and the best way to scale up our software is to go multicore. This is really scary, but not because we're nearing engineering limits, and not because we're going multicore, and not because we're going to have to write software a different way. It's scary because everyone is talking about this as if it is something truly new and we as a community have no tools to address it. This set me to thinking . . .

But first, a word about scale

At the mothership, we've had the problem of having to do way more than any single computer could do for ages. To support truly tens-of-millions of online users at once requires more than just a tricked-out computer. You start taking problems and breaking them down into pieces based not only on functionality, but also based on how they scale. And different pieces of a system scale in different ways. Some pieces are true workhorses, and you can farm them off in the ideal manner, as completely independent units each of which is equally suited to being used for some processing. Other pieces need to work together in huge dynamically-hashed clusters that can each handle a certain partition of the problem space. Others can be organized around large-scale redundant databases of various forms. And still other pieces scale in other ways.

These forms of scalability are another dimension of building block that you need when building large-scale high-performance systems. You don't just decide that you need a video codec, a Session class and a user database. You decide that you need a farmed array of video codecs controlled by a farmed array of session-processors which interfaces with a dynamically-hashed cache cluster in front of a redundant database cluster or partition set. The task of design goes beyond just picking the algorithms or objects and includes consideration of how each needs to scale and how you can address those scale problems over the lifetime of the running system.

Systems do not grow linearly

Because scale needs change, you need to build so that you can add hardware and/or software (and thus capital and operating cost) to beef up one part of your system without having to force everything else to scale together. With real systems, you find that one part (user login, perhaps) takes a bigger hit than others. You want to build software that lets you add only the hardware (and software) you need. Sticking with the example, if your login piece takes 10 or 100 times the hits that each other individual pieces do, why would you design something that requires you to scale all the other pieces up 10 or 100 times just to meet the login load? You'll end up running way too much hardware and software, and that's just a waste of money and human time.

Same Problem, Smaller Datacenter

So back to the talk about the multicore "paradigm shift". I agree that it is an interesting research topic to investigate how to optimally take advantage of massively parallel hardware with a shared memory and hundreds or thousands of processors. But news reports all carry the same message about programming such a machine: "No one is ready for it." And by no one, computing science researchers are including themselves.

So does this mean that for the next ten or so years software speeds must freeze while this research happens and the best and brightest minds apply themselves to this newly-important research area? Definitely not. Keep in mind the lessons of scale from those who have already had to go beyond this single computer (let alone single CPU) to solve their problems. Sun's mantra of "the network is the computer" is how this kind of stuff is done. You create many semi-independent pieces of software that run together on a network to create a single large processing entity, or, really, "a computer". This meta-computer is the one you use to get the real work done. In the same way that you don't write a program using only libc, you don't design a massive system using only a normal computer. The components of this large system are the pieces of software running all over the network, and together they are your meta-program.

Through much experience and much thought, my colleagues and I have come to the simple conclusion that virtually all the programs the average person needs to write need only be single-threaded to be simple, scalable, correct, understandable, high-performing, highly-available, and maintainable1. So, we write all2 of our software as single threaded applications. If we need to take advantage of multiple CPUs, we run multiple instances. Sometimes they are on the same multi-CPU box, and sometimes they aren't. Since we're dealing with a meta-computer anyway, who cares?3

1 I'll save elaboration on this huge topic for future blog entries. This isn't really the focus of this posting, so if you must, pretend I said "multi-threaded" and step away from the guillotine. In the mean time, find me a real, large scale piece of multi-threaded software or system that has no concurrency issues, even under extreme, crippling load. :)

2 Yes, I said "all". Including the ones that do "the math".

3 And, since we're writing proper single-threaded applications, most of them use little CPU under heavy load anyway, but again, I digress.

So, if we need to multicore because we're stuck at a certain clock speed, let's not loose sight of the fact that the certain clock speed we're stuck at is around 4 GHz! We can write very coarse-grained large-problem software that doesn't even come close to chewing up that kind of power. Surely we're not going to go to a hundred or a thousand 33 MHz processors in these new massively parallel machines? And even if we do drop the clock speed by a factor of 2 or 5 or 10, we're still talking about pretty powerful CPUs. So the only real difference between this parallel machine and today's meta-computers is that the former share a bunch of memory chips. My experience teaches me that just because you have a shared memory doesn't mean you need to use it all the time. It's a classic case of premature optimization being the root of all evil. Sure, we might be able to do better once we know more about the research area. But we don't know how much better, and we shouldn't exclude alternative approaches to proposed massively parallel programming languages that even the experts can't fully envision today if for no other reason than that we shouldn't let perfect stand in the way of very good.

It is definitely the same problem in a smaller datacenter. Instead of it taking two football fields to house 10,000 CPUs, maybe it all fits in a couple of racks of massively parallel multi-core machines. But the programming problems of dividing the system up into areas of responsibility based on functionality and scale are the same. And we don't need brand-new computing science to deal with it or programming languages that inherently support uber-sophisticated implicit parallel algorithms that look just like normal code. That's the same folly that network file systems and threads suffer from: they try to make "new" higher-level problems look like old lower-level ones.

We already have the tools that let processors communicate with each other: networks. A single piece of single-threaded network software that receives, processes, and sends events is a very generic concept that can be (and is, ultimately) used to model a traditional single processor system, threads, a network of nodes, or a parallel machine. We all know this to be true as professional programmers and computing scientists. Why do we continually fight it and try to use less general models when there's a simple one that just works, even for the machines of tomorrow? Then, when tomorrow's tomorrow has come and we're learned even more about these machines by experience, we can do as we always should do and encapsulate the fully-understood problem in an elegent, abstract solution, whatever that turns out to be.


  • At 9:21 PM, Anonymous Anonymous said…

    What is interesting is that this approach works on the small scale as well. Just today I explained how an application could be split into multiple processes to remove the "need" for multiple threads in a windows application. The "need" came from a really bad interface to a device. By having a seperate process responsible for driving the device, the main application could have a clean local tcp connection to it and send simple asynchronous commands back and forth.


Post a Comment

<< Home