Solve The Real Problem: August 2006

Wednesday, August 23, 2006

Multicore, Singlethreaded, Megascale

The talk recently is all about how we're getting really close to the clock speed "barrier" and the best way to scale up our software is to go multicore. This is really scary, but not because we're nearing engineering limits, and not because we're going multicore, and not because we're going to have to write software a different way. It's scary because everyone is talking about this as if it is something truly new and we as a community have no tools to address it. This set me to thinking . . .

But first, a word about scale

At the mothership, we've had the problem of having to do way more than any single computer could do for ages. To support truly tens-of-millions of online users at once requires more than just a tricked-out computer. You start taking problems and breaking them down into pieces based not only on functionality, but also based on how they scale. And different pieces of a system scale in different ways. Some pieces are true workhorses, and you can farm them off in the ideal manner, as completely independent units each of which is equally suited to being used for some processing. Other pieces need to work together in huge dynamically-hashed clusters that can each handle a certain partition of the problem space. Others can be organized around large-scale redundant databases of various forms. And still other pieces scale in other ways.

These forms of scalability are another dimension of building block that you need when building large-scale high-performance systems. You don't just decide that you need a video codec, a Session class and a user database. You decide that you need a farmed array of video codecs controlled by a farmed array of session-processors which interfaces with a dynamically-hashed cache cluster in front of a redundant database cluster or partition set. The task of design goes beyond just picking the algorithms or objects and includes consideration of how each needs to scale and how you can address those scale problems over the lifetime of the running system.

Systems do not grow linearly

Because scale needs change, you need to build so that you can add hardware and/or software (and thus capital and operating cost) to beef up one part of your system without having to force everything else to scale together. With real systems, you find that one part (user login, perhaps) takes a bigger hit than others. You want to build software that lets you add only the hardware (and software) you need. Sticking with the example, if your login piece takes 10 or 100 times the hits that each other individual pieces do, why would you design something that requires you to scale all the other pieces up 10 or 100 times just to meet the login load? You'll end up running way too much hardware and software, and that's just a waste of money and human time.

Same Problem, Smaller Datacenter

So back to the talk about the multicore "paradigm shift". I agree that it is an interesting research topic to investigate how to optimally take advantage of massively parallel hardware with a shared memory and hundreds or thousands of processors. But news reports all carry the same message about programming such a machine: "No one is ready for it." And by no one, computing science researchers are including themselves.

So does this mean that for the next ten or so years software speeds must freeze while this research happens and the best and brightest minds apply themselves to this newly-important research area? Definitely not. Keep in mind the lessons of scale from those who have already had to go beyond this single computer (let alone single CPU) to solve their problems. Sun's mantra of "the network is the computer" is how this kind of stuff is done. You create many semi-independent pieces of software that run together on a network to create a single large processing entity, or, really, "a computer". This meta-computer is the one you use to get the real work done. In the same way that you don't write a program using only libc, you don't design a massive system using only a normal computer. The components of this large system are the pieces of software running all over the network, and together they are your meta-program.

Through much experience and much thought, my colleagues and I have come to the simple conclusion that virtually all the programs the average person needs to write need only be single-threaded to be simple, scalable, correct, understandable, high-performing, highly-available, and maintainable¹. So, we write all² of our software as single threaded applications. If we need to take advantage of multiple CPUs, we run multiple instances. Sometimes they are on the same multi-CPU box, and sometimes they aren't. Since we're dealing with a meta-computer anyway, who cares?³

¹ I'll save elaboration on this huge topic for future blog entries. This isn't really the focus of this posting, so if you must, pretend I said "multi-threaded" and step away from the guillotine. In the mean time, find me a real, large scale piece of multi-threaded software or system that has no concurrency issues, even under extreme, crippling load. :)

² Yes, I said "all". Including the ones that do "the math".

³ And, since we're writing proper single-threaded applications, most of them use little CPU under heavy load anyway, but again, I digress.

So, if we need to multicore because we're stuck at a certain clock speed, let's not loose sight of the fact that the certain clock speed we're stuck at is around 4 GHz! We can write very coarse-grained large-problem software that doesn't even come close to chewing up that kind of power. Surely we're not going to go to a hundred or a thousand 33 MHz processors in these new massively parallel machines? And even if we do drop the clock speed by a factor of 2 or 5 or 10, we're still talking about pretty powerful CPUs. So the only real difference between this parallel machine and today's meta-computers is that the former share a bunch of memory chips. My experience teaches me that just because you have a shared memory doesn't mean you need to use it all the time. It's a classic case of premature optimization being the root of all evil. Sure, we might be able to do better once we know more about the research area. But we don't know how much better, and we shouldn't exclude alternative approaches to proposed massively parallel programming languages that even the experts can't fully envision today if for no other reason than that we shouldn't let perfect stand in the way of very good.

It is definitely the same problem in a smaller datacenter. Instead of it taking two football fields to house 10,000 CPUs, maybe it all fits in a couple of racks of massively parallel multi-core machines. But the programming problems of dividing the system up into areas of responsibility based on functionality and scale are the same. And we don't need brand-new computing science to deal with it or programming languages that inherently support uber-sophisticated implicit parallel algorithms that look just like normal code. That's the same folly that network file systems and threads suffer from: they try to make "new" higher-level problems look like old lower-level ones.

We already have the tools that let processors communicate with each other: networks. A single piece of single-threaded network software that receives, processes, and sends events is a very generic concept that can be (and is, ultimately) used to model a traditional single processor system, threads, a network of nodes, or a parallel machine. We all know this to be true as professional programmers and computing scientists. Why do we continually fight it and try to use less general models when there's a simple one that just works, even for the machines of tomorrow? Then, when tomorrow's tomorrow has come and we're learned even more about these machines by experience, we can do as we always should do and encapsulate the fully-understood problem in an elegent, abstract solution, whatever that turns out to be.

Monday, August 14, 2006

Desirable System Qualities

This is a brief discussion of general desirable system qualities that we seek when designing systems. These ideas took this solid form in September 2003 when I was aiming to document how I think about design as part of an effort to spread new ideas within the mothership. Reading over it today, it's interesting to note how if you're in something for the long haul and you want to keep people interested past the hype, you need quality. For online services (especially, but not exclusively), the following are definitely the kinds of qualities you need to aim for and achieve to earn long-term success.

Systems have different jobs to do and different requirements. However, there is a set of qualities that all systems can seek to offer.

Scalability

Because it is often impossible to know the number of clients or eventual load requirements of a system, we must design in a way the allows system to grow to meet future needs.

Different pieces of any system scale differently. Some areas of responsibility are more costly than others or are used more often and these will need to "scale more" than other pieces. Key to dealing with this is a system design allows for this growth by decoupling the system's components via well-described interfaces across various hardware platforms and instances. Said another way, well-defined division of responsibility is essential to offering systemic scalability.

Performance

All systems have performance requirements, and many systems need to perform in realtime or near realtime. Meeting performance requirements often translates to a need for scalability: by distributing responsibility, more computing power can be applied to a given task. Meeting realtime performance requirements means distributing areas of responsibility in order to control the time need to perform those responsibilities.

Performance can be thought of as being achieved via algorithmic and systemic means. Components need to choose good algorithms to perform work within their areas of responsibility. Systems need to connect the right components using the right interfaces to offer the right information at the right time.

High Availability

All systems have components that fail; designing systems that do not fail as a consequence of a few component failures is key to offering high availability. This provides another reason for clear divisions and distributions of responsibility, as any responsibility that is not distributed, and thus resides in one component, leaves the system susceptible to failure of that component.

When parts of a system fail, replacements must be pressed into use manually or automatically. Manual replacement is acceptable in some circumstances, but it must be supported by good detection and reporting of problems within contexts that quickly and easily identify failed components. Automatic replacement is something that needs to be part of the design: components must expect other components to fail, and they must deal with that failure realistically.

Some systems require near zero information loss. Other systems can lose some information provided they "cut over" to backup components. Real systems require both to some degree within various areas of responsibility. Knowing when loss of availability is acceptable is just as important as knowing when it is not. Knowing which information is "system-critical" and which information can be lost is also important. Systems need to preserve the information that matters and they need to recover from failure to continue to provide service as much as is necessary.

Levels of Abstraction

To actually carry out its duties, the system must know the intricate details of the systems with which it interfaces and the tasks it performs in its lowest-level components. To be useful and successful, it must know how to connect components to meet the system's overall goals and carry out high-level tasks (and implement high-level interfaces).

Through well-defined interfaces, components must offer to other components functionality that falls into their area of responsibility. By taking responsibility for an area, the component agrees to do "real work" in that area, thereby simplifying it for the rest of the system via an interface. A component must not expose unnecessary detail to other components; doing so would limit the system's ability to have these qualities. A component must communicate in terms of its area of responsibility as seen from the outside. Well-defined interfaces designed to match the external view of that area of responsibility support this requirement.

Aggregates of components may form subsystems (as seen from the bottom) and super-components (as seen from the top). The role of any particular entity within the system is matter of perspective. That is, all entities are both components and systems at the same time.

This highlights that there is no real difference between the algorithmic and systemic approaches to meeting performance needs. Systemic approaches in a subsystem are merely algorithmic approaches in a super-component, and vice versa.

Simplicity

Systems must battle complexity. No single component of the system should be so big that it can not be understood by its builders or diagnosed by its operators. Said another way, each component should have a clear area of responsibility made concrete by the well-defined interfaces it implements. Maintaining clear areas of responsibility through well-defined interfaces and levels of abstraction not only supports the other system qualities described here, it also limits the complexity of the system, and in turn, its design and implementation. One can drill down from systems to components, then view those components as systems and drill down again.

If the system offers the other qualities described here, simplicity may not be automatic, but it will be within reach.

Thursday, August 10, 2006

On Named Architectures

Ok, let me begin by saying that as a software architect, I don't pay much heed to buzz on things like Service-oriented Architecture (SOA) and the precise definitions of which pieces you need before you get to apply the peel-off sticker to your system. But, I happened to be reading The Register this morining and I came across this snippet from a report on the Gartner Hype Cycle:

Event-driven Architecture (EDA) is an architectural style for distributed applications, in which certain discrete functions are packaged into modular, encapsulated, shareable components, some of which are triggered by the arrival of one or more event objects.

That caught my eye, partly because I've never heard of the term "Event-driven Architecture" in capitalized form and partly because it was buried in a report talking about things as nebulous as Web 2.0 and Collective Intelligence. But mostly, my reaction was one of "hey, that's what I've been doing for ages", followed by the quick realization of "how could you not build a significantly large system that wasn't 'event driven'?"

So what is it?

Now, I needed to know exactly what was meant by this phrase, because surely an EDA can't just be "a system with events". So off to Google it is. Some quotes of what I found. These are just "top ten" search results.

Event-driven architecture is more a set of guidelines than a product. The active ingredient: a small piece of software called an agent that can sit on a particular machine and watch for something to happen.
...
Can't any application do that?
Probably, but most application techniques rely on a request-reply process in which they send a message to other applications asking for some action or data, then have to wait for the reply before doing anything. Rather than wait for one activity, agents can launch several responses, each of which can be completed independently, often with no further action from the agent. Agents can also be told to watch for a range of events that may or may not happen, and may happen in an order that's hard to predict. "You could do this before, but it took a heroic effort," says Ray Schulte, vice president and research team leader for application integration at Gartner Inc. "So that was limited to a few high-payback applications like trading."

Emphasis mine; www.baselinemag.com

Effective organizations use (usually implicit) contracts that specify what should be communicated and when. The VP is responsible for telling the CEO that a plant has burned; the CEO is not responsible for extracting that information from the VP. Pharmaceutical companies are responsible for telling the FDA when drug trials result in deaths; the FDA is not required to poll the pharmaceuticals to obtain this information. The shared model—the shared expectation—about what is to be communicated is an implicit subscription.
...
In SOAs, the person who needs information is responsible for asking the person who has the information: The client invokes a service on a server, and the server has no responsibility other than to respond to service invocations. In EDA, the client who needs information is responsible for updating subscriptions (models, forecasts, and plans) at the server with the information, and the server is responsible for continuously propagating relevant information. Three aspects of EDA contracts make EDA more powerful than SOA:
...
Some might argue that EDA and the Event Web are nothing but well-known push technology. These skeptics miss the key point: Traditional push technologies and pub-sub systems exploit only a tiny fraction of the power of EDA for four reasons:
...

Emphasis mine; www.developer.com

After reading some these and other "articles" on EDA, I realized that this is a typical bad reasoning pattern at work. Virtually all of the documents I found take great pains to point out how SOA is just a simple "publish-subscribe" model and it's not as powerful as EDA because of x, y and z, so you should be using EDA. This reasoning is flawed because it presumes that your only choices are these rigid categories of SOA and EDA. Why are seemingly reasonable people thinking like this, I wondered. There are few points in these articles that need to be addressed.

There's more to life than HTTP

Then, it hit me. These SOA and EDA technologies are all in the all-too-prevelant webbish category, where the biggest (only?) tool in the toolbox is an HTTP request. So, when people are talking about SOA requests, they're really talking about HTTP requests. And they came to the same realization as everyone else I've ever met who used HTTP as their default protocol within their architecture: HTTP doesn't cut it because it is fundamentally request/reply with strict limitations on who is the server and who is the client on a given actual connection. If you want events to flow both ways, you have to do silly things like make requests that just wait until the next event or make HTTP connections in both directions. (I'll bet that the next thing we see [if it doesn't exist already] is that EDA will start adopting SIP as its protocol of choice, since, in some ways, SIP is just "two-way" HTTP.)

The way I would look at such a problem is to say "why did we pick HTTP if it doesn't meet the needs of our application"? And, usually, especially for interfaces where I influence all participants, I don't. And is it "a heroic effort", as claimed in one of the articles I read? No, I just use a communication mechanism appropriate to the application domain.

It's the application, not the architecture

The other major point of confusion for me revolves around the use cases exemplified by the VP needing to tell CEO of the burning factory or the need to tell stock traders when a stock price has changed. Clearly these are requirements of the application domain, whatever that might be. And since we chose our implementation approach based on our application domain requirements, why are we even looking at a rigid architectural structure like SOA or whatever if it doesn't meet our requirements? That's just foolish.

Even so, the requirements in question here are not architectural requirements; they're application requirements. You want to know when things are on fire without asking. You want to use rules based analysis to trigger stock alerts for ten million clients. Sure. Take those requirements, and put together the interfaces and systems you need to achieve that.

But sometimes we don't know all of the requirements up front. Maybe we already have our SOA-ish system and we want to add something to it that the architectural model doesn't support. Now what do we do? Well, the first step is to fire your architect for chosing such a ridiculously rigid approach that makes basic such first-year assumptions as "everything is request-reply" :). But don't hire the next guy that tells you that "everything is shared-model EDA" or whatever, either. As someone who has built and continues to build large, constantly evolving systems with hundreds of parts, I look at these models as complete academic or marketing bollocks.

Build (or buy, if you must) yourself a set of tools so you can create interfaces of all sorts. Then, when it comes to each interface, you can judge whether (today) it has any or all of request-reply, subscribe-publish, unsolicited-events, store-and-forward queuing, bulk data transfer, and so on. Decide on areas of responsibility, make well-defined interfaces between them using the most appropriate protocols and transport and build your system. Don't just always use HTTP because it's all you know; learn some more communications techniques. Would you hire a carpenter who only knew how to use one tool? Fill your toolbox and reap the rewards in terms of producitivy, quality and performance.

Monday, August 07, 2006

Common Approaches versus Creativity

As a software architect, I need to help a number of teams of developers build systems towards a common goal and future. A few years back, I tried to capture what I was aiming for in terms of a balance between commonality and creativity. I think it still holds true, so here it is:

Having a common approach is not meant to discourage individuality. Individuality and creativity are both actively encouraged and sought out. After all, it is from such creativity that our current infrastructure arose. It is important to realize that the current infrastructure is the result of factoring out the common needs of what were originally separate approaches. Effort was then applied to these factored-out needs to create high-quality reusable components that meet "the common need".

Therefore, when an solution has a need, we analyze it to see if it is met by the common infrastructure. When it does, we conform and use the existing reusable components consistently across the team. When it doesn't, we solve it in the specific case, but make note of it. When future solutions arise with similar needs, we may factor out the solutions' specific needs to see what they can contribute to the common infrastructure. Sometimes they can, and we get new high-quality reusable components, and sometimes they're solution-specific (at the moment) and they stay that way.

The philosophy behind this is that the code is owned by the group and not by the individual, because it must be maintained by the group. Therefore it should meet the needs of the group, and therefore conform where appropriate. However, to remain agile, the group must adapt to new and different and conflicting needs. Those needs are best met by harnessing individual (and small-group) creativity which feed back into the common code base.

Evolution of the common approach benefits the larger group. We try to guide the creativity of the individual to support the evolution of the larger group. Evolution of the individual without respect for the group threatens to undermine the group and the common approach. Recognized evolution of the individual in support of the group leads to a better group and expands and updates the common approach.

We encourage justified creativity and individuality in the context of building a common infrastructure. Simply put, the trick is to encourage creativity but to not do things in a different way for no other reason than to be different. A common infrastructure and a common approach both have benefits that should not be ignored.

What goes into a good toolbox?

Let's start with this:

make HOST=i586-mingw32msvc

That's my kind of cross-platform portability. It's got two things going for it:

It's cross-compilation, so you don't need to keep multiple development environments or a single complicated development environment that somehow supports multiple system types.
It implies that your code base "just compiles" and "just works" on different platforms.

My development environment of choice is Linux, which makes sense given my development focus. But nowadays, it's an even more natural choice because you can create cross toolchains for dang-near everything, including Windows. This means I can use emacs, bash, GNU make, perl, and everything in their native environments to target any (C++) environment. There's a real benefit from this in terms of maintaining the toolchain and build system because even though the code you're building could be targetted to run wherever, you know that the build is running on Linux. So perl is there. And the slashes lean left. And you can unlink files while they're open. And so on. So all the scripty-makefiley-configury stuff you bake into your build environment doesn't have to be cross-platform. Just the code does.

But what about the code?

The code is the trickier part, of course. So, as with any problem of this sort, you encapsulate the concept that varies: in this case, the operating system. You could go grab one of those Portable Runtime environments like NSPR or APR, but they always seem to me to look like some other, obscure decievingly-familiar operating system API and the baggage level tends to be high.

It turns out that every operating system I would ever dream of developing for is close enough to POSIX as to make no odds. Because seriously, you've got POSIX (UNIX including Mac) and then you've got Windows which seems to slowly becoming POSIX. So for my money, making things look a POSIX-y as possible is Plenty Good. For Dave and I, this is made easier by the fact that we never intend to use threads, and we fully intended to write our own asynchronous full-performance networking subsystems (for example). Over time, we've refined a suite of libraries full of objects, functions, and other types that wrap up the OS-level APIs, where necessary, as thinly as possible (but not too thinly).

For example, we register a signal handler via the Resource Acquisition Is Initialization idiom to make it simpler and safer to use. That doesn't add any real cost and offers oodles of benefit. An example of wrapping things more would be making a real class to represent IPv4:port address pairs instead of using raw struct ::sockaddr_in types. That's a bit harder to get right, but if you make the right choices about responsibility deliniation, you get there over time.

A much larger example would be wrapping up sockets. Ahh, yes. Every third-year C++ student taking a networking course has written a socket class. You can download or buy dozens, if not hundreds or thousands, of such things, some written by professional programmers. And virtually all of them suck, at least in one way that you really care about. It's because they were probably written, like most such software, to solve some problem quickly so that someone can get on to the real work they wanted to do. Then, once they are written, they are put in the "glad I don't have to think about that again" bucket and forgotten. And now, we are at the nub of the toolbox issue.

Building a software toolbox is its own "real work".

By that, I mean that it is a task unto itself to be savoured and enjoyed. We must see that building the software toolbox is valuable (and fun) because if done properly it can achieve that desired end state where we "don't have to think about that again (for a while)". If you're going to write an asynchronous, single-threaded C++ socket library that supports TLS (or other extra transport layers), full-speed write queue management (with memory relinquishment), asynchronous DNS, and a serialization interface, you have to know some of the issues up front. Namely, the networking ones.

Nobody starts out wanting all those things. And I wouldn't argue that (without previous serious experience building them) you aim to end up with them in any short order. I do argue that a properly designed small library will easily allow such features to be added with minimal API disruption and no performance loss, most of the time. It's a simple balance between two software development rules:

You can't know everything up front.
The more you know up front, the better decisions you can make.

These statements are obviously true. If I know all of the issues surrounding plain sockets, TLS, DNS, and write buffering (for example), I can probably make some good design decisions out of the gate and implement in a fairly straight line. But in order to know all of those issues, I've probably had to have done it or something like it before. So, I should accept that usually, I can't hit all the targets all at once, and I should start out by aiming for something smaller. Most programmers I've worked with reach this conclusion, but I encourage us all to go one step further and embrace the principle of Solve the whole problem.

To solve the whole problem, we must know the whole problem. So when we make the responsible decision to, say, not try to implement all of our complex socket features at once, and instead just focus on getting "the basics" working, let's make sure we really understand the basics. And let's make sure we really think about what kind of interface we want to offer to our users (the first three of whom will be us, us and us as we implement the other features on our TODO list). In this case, that means cracking the Stevens books and really getting to know all about sockets and POSIX blocking primitives. Then, use that knowledge and understanding to offer a powerful interface that keeps easy things easy and make hard things less hard (or full-blown easy if you can manage it). Learn it all so your library's users don't have to. Deal with all the nasty issues so your users don't have to. Own the problem--the whole problem--and solve it. That's how you multiply productivity.

Some Principles

I decided that before I post further thoughts, it's important to talk about four of the important principles Dave and I use to guide our thinking. I've reproduced them here from a post on pliantalliance.org. You may want to visit there to see some of the discussion. The text below was written in July 2003.

Solve the real problem.

Whenever possible, solve the real problem. It's hard to explain it better than the following. [Yes, I'm going to quote a "capital-A" Agile guy. It's the dogma that bothers us, not all of the ideas. I'm not entirely sure where this quote of his idea is originally from.]

Managing Technical Debt

Ward Cunningham sometimes compares cleaning up the design with paying off debts. Going further, he discusses managing the technical debt on the project.

Making hasty additions to the system corresponds to borrowing against the future, taking on debt. Cleaning up the design corresponds to paying off debt.

Sometimes, he points out, it is appropriate to take on debt and make hasty changes in order to take advantage of opportunity. Just as debt accumulates interest and grows over time, though, so does the cost to the project of not cleaning up those hasty design changes.

Cut corners in the design, he suggests, when you are willing to take on the debt, and clean up the design to pay off the debt before interest grows too high.

Solve the whole problem.

This is the corollary to solve the real problem. If you are building a function, a class, a library, an application, or a system, by solving the real problem, you make sure you clearly delineate the areas of responsibility of yourself and others using a well-defined interfaces. By solving the whole problem, you ensure that the well-defined interface best represents the most appropriate division of responsibility. This idea can be represented by the notion that a system component should make the work it does significantly easier for clients if they use the component versus doing it themselves.

The implementation is more flexible than the interface.

If you have well-defined interfaces dividing areas of responsibility, then you “write to the interface”. Doing this properly and taking advantage of the abstraction requires designing the interface to match the division of responsibility (i.e. solving the whole problem) and then writing the code to implement as much of that interface as is necessary (maybe all of it). The wrong thing to do is to write an interface that matches (read “exposes”) your current implementation out of convenience and then be jailed to that implementation forever.

Never let perfect stand in the way of very good.

It may be tempting to wait until you have the time, the knowledge, or the inclination to design the “perfect” solution, especially if you aim to solve the real problem. However, you must temper this instinct with practicality. While it may look like you are holding out for “the perfect solution” (that solves the whole, real problem), what you are really doing is preventing “the very good solution” that can be applied in the interim. We must recognize that perfection can only be approached asymptotically through evolution of design and implementation: in essence by refining our deployed very good solutions to make them more perfect. Solving the real problem should be applied where scope allows and it should guide our path to tell us where perfect is, but we are allowed to get there in more than one step. Very good solutions have value, and value delayed is value lost.

Sunday, August 06, 2006

Solve The Real Problem

First off, thanks to Tim Beck for being so generous as to grant Dave Benoit and me co-founder credit in the Pliant Alliance. If you've ever groaned at "capital-A" Agile software development, you'll want to head over and real the Pliant Alliance blog for some relief. And don't worry, you're not in for Yet Another Way To Do Everything lesson, or at least not a very long one. For example, this is my favourite entry from the FAQ:

How do I do Pliant Software Development?

Pliant Software Development is easy to implement. All you have to do is be willing to change the way you are doing software development if what you are doing now is not working for you.

Basically, the concept is so simple as to be boring: "If what you're doing now works, keep doing it. If not, do something else. " Your ancestors (at least as a whole population) used this principle quite successfully to come up with you. Now you can use it to build software. Huzzah.

I'll be posting musings about software development as *gasp* I actually develop some serious software. I mainly write professional single-threaded asynchronous C++ on POSIX (usually Linux) to do (soft) real-time servers, clients, networking, databases and whatever else needs to happen to get things done. So, expect to see commentary on such software based on my experience building real million-scale highly-available systems, much of which has also been with Dave Benoit (who will join me here soon).

We're working on some good stuff these days building up the kind of toolbox we've used previously to launch dozens of real, successful online services with the mothership. We're doing many of the "if I did this again, I'd have done it this way" kind of design without falling into the classic "Version 2.0" pit of dispair. We follow an evolutionary software development model where we own the same designs and code for years (and hopefully decades soon). We keep them up to date by growing and pruning the software with care over many years, so what we're building now is really "Iteration Step N", which means it comes with "all my lessons learned".

Solve The Real Problem

Wednesday, August 23, 2006

Multicore, Singlethreaded, Megascale

But first, a word about scale

Systems do not grow linearly

Same Problem, Smaller Datacenter

Monday, August 14, 2006

Desirable System Qualities

Scalability

Performance

High Availability

Levels of Abstraction

Simplicity

Thursday, August 10, 2006

On Named Architectures

So what is it?

There's more to life than HTTP

It's the application, not the architecture

Monday, August 07, 2006

Common Approaches versus Creativity

What goes into a good toolbox?

But what about the code?

Building a software toolbox is its own "real work".

Some Principles

Sunday, August 06, 2006

Solve The Real Problem

About Me

My Regular Reading

Previous Posts

Archives