Solve The Real Problem

Discussions about professional software development with a focus on building real, solid, performing, reliable, monitorable large-scale online services in asynchronous C++. You know, good solid breakfast food.

Wednesday, February 17, 2010

More Giants' Shoulders

The 2010 issue of Core arrived the other day, and a statement by Steve Blank, one of the contributors, caught my eye:

Each generation assumes it is inventing the future, with no recollection that it's already been done.

To me, that's a reminder that there are very few problems that are truly new. Look for lessons from those who have gone before and see how they can apply to what you're doing now.

For example, I've also been reading Classic Feynman, which truly is "all the adventures of a curious character", as the subtitle says. The stories Feynman tells about his life are fascinating, and the tone and frankness is fun to read. Near the end of the book, there's a reproduction of his (well-known) appendix to the report from the Presidential Commission that looked into the 1986 Challenger disaster. The are two paragraphs that stood out to me. The appear together in the book, but I've commented on each separately below.

The usual way that such engines are designed (for military or civilian aircraft) may be called the component system, or bottom-up design. First it is necessary to thoroughly understand the properties and limitations of the materials to be used (for turbine blades, for example), and tests are begun in experimental rigs to determine those. With this knowledge larger component parts (such as bearings) are designed and tested individually. As deficiencies and design errors are noted they are corrected and verified with further testing. Since one tests only parts at a time, these tests and modifications are not overly expensive. Finally one works up to the final design of the entire engine, to the necessary specifications. There is a good chance, by this time, that the engine will generally succeed, or that any failures are easily isolated and analyzed because the failure modes, limitations of materials, etc., are so well understood. There is a very good chance that the modifications to the engine to get around the final difficulties are not very hard to make, for most of the serious problems have already been discovered and dealt with in the earlier, less expensive, stages of design.

Replace engines with software systems, turbines and fan blades with modularized libraries of components, test rigs with automated unit tests, and you have the basic recipe for successful software development. Start with a good foundation of components that you understand (in terms of both capabilities and limitations), and build on it in a structured, methodical manner to produce more and more capabilities without being swamped by complexity, and thus bugs. This is the philosophy that I and the other founders of Starscale share, and we've seen it succeed in several environments.

The subsequent paragraph is as follows.

The Space Shuttle Main Engine was handled in a different manner—top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes. For example, cracks have been found in the turbine blades of the high pressure oxygen turbopump. Are they caused by flaws in the material, the effect of the oxygen atmosphere on the properties of the material, the thermal stresses of startup or shutdown, the vibration and stresses of steady running, or mainly at some resonance at certain speeds, etc.? How long can we run from crack initiation to crack failure, and how does this depend on power level? Using the completed engine as a test bed to resolve such questions is extremely expensive. One does not wish to lose an entire engine in order to find out where and how failure occurs. Yet, an accurate knowledge of this information is essential to acquire a confidence in the engine reliability in use. Without detailed understanding, confidence cannot be attained.

This so accurately describes the bad software systems I've seen (and in some cases, replaced) that I was taken aback. Think of the over-featured, under-tested, mis-designed, over-generalized, poorly-implemented, bug-ridden monsters you've worked on, and what happens when someone finds a bug that uncovers a fundamental flaw in the design of some subsystem or the entire thing. It's a case of uninformed design, where classic errors have been made, such as presuming everything will work out, and the details can be dealt with later and can't possibly influence the overall system's shape. How can you design something if you don't know what it's even made of?

With software, we can make free copies when testing it as a whole (unlike the engine), but instead we have the costs of seeing through the complexity to actually diagnose a failure. Most often, poorly designed systems have poorly designed diagnostics, and tracking down a failure is a whole lot harder when you don't have something as obvious as a cracked fan blade to start with. Plus, the big risk is the same: that what we've built is no good. Throwing away the design is way more expensive than throwing away a copy in both examples.

If we design with components that we understand, we can attain confidence that it will work and it will work well.