Fault detection and correction

October 2003 is the month that marks the retirement of Concorde, a wonderful technical (and commercial) achievement. It was nostalgic for me because in 1969 I worked on the (abandoned) Boeing supersonic aeroplane, the SST. There was a lot of misunderstanding of the Concorde versus SST debate; the SST was a huge plane, much bigger than a 747, and they were complementary more than competitive. If the SST had succeeded there would probably have been a lot of Concorde’s as well.

As a control engineer I was personally working on the design of digital feedback systems for the pitch axis control, but I was privileged to communicate with other leading edge developers, some of whom were working on fault tolerant systems. The basic technique in an aeroplane is to have three of everything, which are continually monitored and compared; if any component differs from its partners it is taken out of service until it can be repaired. This technology didn’t work at the time because the electronics needed to perform comparisons in 1969 introduced more unreliability than they solved. It is different today and the technology is available and well developed.
Correcting an error when it has occurred requires a lot of redundancy. It is essential in something like an aeroplane or a nuclear power station, but it would be nice in a lot of other systems. Thus over the past 20 years there has been a lot of advances in related techniques for predicting potential failures before they happen. Some systems are extremely complex, so that the large amount of data generated from monitoring sub-systems is very difficult to analyse, but today’s technology is capable of predicting and finding faults deeply embedded in systems.
The exploitation of these techniques is now applied to mechanical structures, factory production lines, chemical plants, etc., all characterised by high capital value. But there are endless lower cost systems that could also benefit such as motor cars, heating systems, refrigerators and other mass produced products. There are plenty of opportunities in the IT field!
Fault tolerance is not new to IT, but it can be taken a lot further yet. Redundant components, in particular memory, were developed for mainframe computers and are now readily available on departmental servers, even to some extent on PCs. RAID disc arrays, in various combinations, provide fault tolerant disc sub-systems. Redundant power supplies are options on servers. There are also techniques to help with correction such as "hot swap" components. Fault tolerant computers were mandatory requirements for mission critical transaction systems, particularly support for ATMs. Tandem and Stratus were the leading lights, developing some splendid technology. Tandem for instance initiated two instances of a procedure on two machines, only one of which executed (no wasted processing); whenthe procedure completed it sent a message to the other. If the dormant procedure however didn’t get a message within a set time delay, something had gone wrong and it then executed its copy.
Network systems and modern mainframes have gone further in using prediction techniques. Network systems monitor traffic and dynamically re-route packets from congested paths to others, optimising throughput. IBM and Cisco have jointly announced an initiative to develop open software targeted at the next generation of "self-healing" systems. Quite how open it will be remains to be seen, but it is a natural progression from my 1969 experience to apply derivatives of the technology to improving the overall operation of complex computer systems. Watch also for software from a spin-off from the University of Chicago called SmartSignal Corp.< BR>

Martin Healey, pioneer development Intel-based computers en c/s-architecture. Director of a number of IT specialist companies and an Emeritus Professor of the University of Wales.