Solve The Real Problem: Desirable System Qualities

This is a brief discussion of general desirable system qualities that we seek when designing systems. These ideas took this solid form in September 2003 when I was aiming to document how I think about design as part of an effort to spread new ideas within the mothership. Reading over it today, it's interesting to note how if you're in something for the long haul and you want to keep people interested past the hype, you need quality. For online services (especially, but not exclusively), the following are definitely the kinds of qualities you need to aim for and achieve to earn long-term success.

Systems have different jobs to do and different requirements. However, there is a set of qualities that all systems can seek to offer.

Scalability

Because it is often impossible to know the number of clients or eventual load requirements of a system, we must design in a way the allows system to grow to meet future needs.

Different pieces of any system scale differently. Some areas of responsibility are more costly than others or are used more often and these will need to "scale more" than other pieces. Key to dealing with this is a system design allows for this growth by decoupling the system's components via well-described interfaces across various hardware platforms and instances. Said another way, well-defined division of responsibility is essential to offering systemic scalability.

Performance

All systems have performance requirements, and many systems need to perform in realtime or near realtime. Meeting performance requirements often translates to a need for scalability: by distributing responsibility, more computing power can be applied to a given task. Meeting realtime performance requirements means distributing areas of responsibility in order to control the time need to perform those responsibilities.

Performance can be thought of as being achieved via algorithmic and systemic means. Components need to choose good algorithms to perform work within their areas of responsibility. Systems need to connect the right components using the right interfaces to offer the right information at the right time.

High Availability

All systems have components that fail; designing systems that do not fail as a consequence of a few component failures is key to offering high availability. This provides another reason for clear divisions and distributions of responsibility, as any responsibility that is not distributed, and thus resides in one component, leaves the system susceptible to failure of that component.

When parts of a system fail, replacements must be pressed into use manually or automatically. Manual replacement is acceptable in some circumstances, but it must be supported by good detection and reporting of problems within contexts that quickly and easily identify failed components. Automatic replacement is something that needs to be part of the design: components must expect other components to fail, and they must deal with that failure realistically.

Some systems require near zero information loss. Other systems can lose some information provided they "cut over" to backup components. Real systems require both to some degree within various areas of responsibility. Knowing when loss of availability is acceptable is just as important as knowing when it is not. Knowing which information is "system-critical" and which information can be lost is also important. Systems need to preserve the information that matters and they need to recover from failure to continue to provide service as much as is necessary.

Levels of Abstraction

To actually carry out its duties, the system must know the intricate details of the systems with which it interfaces and the tasks it performs in its lowest-level components. To be useful and successful, it must know how to connect components to meet the system's overall goals and carry out high-level tasks (and implement high-level interfaces).

Through well-defined interfaces, components must offer to other components functionality that falls into their area of responsibility. By taking responsibility for an area, the component agrees to do "real work" in that area, thereby simplifying it for the rest of the system via an interface. A component must not expose unnecessary detail to other components; doing so would limit the system's ability to have these qualities. A component must communicate in terms of its area of responsibility as seen from the outside. Well-defined interfaces designed to match the external view of that area of responsibility support this requirement.

Aggregates of components may form subsystems (as seen from the bottom) and super-components (as seen from the top). The role of any particular entity within the system is matter of perspective. That is, all entities are both components and systems at the same time.

This highlights that there is no real difference between the algorithmic and systemic approaches to meeting performance needs. Systemic approaches in a subsystem are merely algorithmic approaches in a super-component, and vice versa.

Simplicity

Systems must battle complexity. No single component of the system should be so big that it can not be understood by its builders or diagnosed by its operators. Said another way, each component should have a clear area of responsibility made concrete by the well-defined interfaces it implements. Maintaining clear areas of responsibility through well-defined interfaces and levels of abstraction not only supports the other system qualities described here, it also limits the complexity of the system, and in turn, its design and implementation. One can drill down from systems to components, then view those components as systems and drill down again.

If the system offers the other qualities described here, simplicity may not be automatic, but it will be within reach.

Solve The Real Problem

Monday, August 14, 2006

Desirable System Qualities

Scalability

Performance

High Availability

Levels of Abstraction

Simplicity

1 Comments:

About Me

Previous Posts