COSC6490A

Why do we need software verification tools?

Software errors are expensive. It has been estimated that in North America perhaps $3 billion a year is lost in “crashes”, with the average cost of a major outage running at $330,000. The airlines estimate that every minute a reservation system is down costs $70,000. But the error rate in programming resists any significant reduction: error estimates for programmers are thought to be about 50 errors for every 1000 lines of code (regardless of programming language), with fully tested market versions running about 3 errors/1000 lines. (Programs written in high-level languages are much shorter than equivalent low-level code, so a high-level language is a much safer programming tool than a low-level language would be; but, at the same time, the use of high-level languages makes feasible the development of much larger, more complex programs than could be attempted with low-level languages, so the error rate/program does not necessarily decrease.) We present a few examples of major software errors to illustrate the very serious consequences which can arise from the smallest of errors in program logic, and the value to be attached to verification tools if they are able to reduce the number of such failures:

The Bank of New York disaster

The following article describes a famous computer “foul-up” triggered by a simple programming error—using a 2-byte counter for the number of messages waiting. The counter had a maximum value of +32,767; adding 1 to this maximum value produced a value which was interpreted by the software as a 0, indicating no messages (in this case, orders for bond trades) were waiting! Note how the original software error became compounded into a further more disastrous error, and the complex way the error effected the environment of interest rates and bond sales.

Quoted from an article on the risks mailing list, now run as the comp.risks Usenet newsgroup:

“Date: Fri 29 Nov 85 00:43:47-EST
….
From the Wall Street Journal, Monday 25 November 1985 [quoted without permission],
A Computer Snafu Snarls the Handling of Treasury Issues by Phillip L. Zweig and Allanna Sullivan Staff reporters of the Wall Street Journal NEW YORK- A computer malfunction at Bank of New York brought the Treasury bond market's deliveries and payments systems to a near- standstill for almost 28 hours Thursday and Friday. … The foul-up temporarily prevented the bank, the nation's largest clearer of government securities, from delivering securities to buyers and making payments to sellers . … The malfunction was cleared up at 12:30 p.m. EST Friday, and an hour later the bank resumed delivery of securities.

But Thursday the bank, a unit of Bank of New York Co., had to borrow a record $20 billion from the Federal Reserve Bank of New York so it could pay for securities received. The borrowing is said to be the largest discount window borrowing ever from the Federal Reserve System. Bank of New York repaid the loan Friday, Martha Dinnerstein, a senior vice president, said. … Bank of New York said that it had paid for the cost of carrying the securities so its customers wouldn't lose any interest. Bank of New York's inability to accept payments temporarily left other banks with $20 billion on their hands. This diminished the need of many banks to borrow from others in the federal funds market. Banks use the market for federal funds, which are reserves that banks lend each other, for short-term funding of certain operations. The cash glut caused the federal funds rate to plummet to 5.5% from 8.375% early Thursday.

… According to Wall Street sources, the malfunction occurred at 10 a.m. Thursday as Bank of New York was preparing to change software in a computer system and begin the days operations. Until Friday afternoon, Bank of New York received billions of dollars in securities that it couldn't deliver to buyers. The Fed settlement system, which officially closes at 2:30 p.m., remained open until 1:30 a.m. Friday in the expectation that technicians would be able to solve the problem.

… it seems that the primary error occurred in a messaging system which buffered messages going in and out of the bank. The actual error was an overflow in a counter which was only 16 bits wide, instead of the usual 32. This caused a message database to become corrupted. The programmers and operators, working under tremendous pressure to solve the problem quickly, accidentally copied the corrupt copy of the database over the backup, instead of the other way around.”

The all-time most expensive single-character error?

Quoted from comp.risks:

“Date: Sat 10 Sep 88 15:22:49-PDT
From: Gary Kremen
Subject: Soviets See Little Hope of Controlling Spacecraft
According to today's (Saturday, September 10, 1988) New York Times, the Soviets lost their Phobos I spacecraft after it tumbled in orbit and the solar cells lost power. The tumbling was caused when a ground controller gave it an improper command. This has to one of the most expensive system mistakes ever. [estimated at $100,000,000]

Date: Tue, 13 Sep 88 15:20:18 PDT
From: Peter G. Neumann
Subject: Soviet Mars Probe
The Soviet Mars probe was mistakenly ordered to ‘commit suicide’ when ground control beamed up a 20 to 30 page message in which a single character was inadvertently omitted. The change in program was required because the Phobos 1 control had been transferred from a command center in the Crimea to a new facility near Moscow. ‘The [changes] would not have been required if the controller had been working the computer in Crimea.’ The commands caused the spacecraft's solar panels to point the wrong way, which would prevent the batteries from staying charged, ultimately causing the spacecraft to run out of power.”

The Ariane 5 failure

Quoted from the report of the Ariane Inquiry Board:

“On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded. … The reason why the active SRI 2 did not send correct attitude data was that the unit had declared a failure due to a software exception. … The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected. …

An underlying theme in the development of Ariane 5 is the bias towards the mitigation of random failure. … The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. The Board has reason to believe that this view is also accepted in other areas of Ariane 5 software design. The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct.

… Identify all implicit assumptions made by the code and its justification documents on the values of quantities provided by the equipment.

… Verify the range of values taken by any internal or communication variables in the software.”

Millions inconvenienced, millions paid in compensation

The following report shows how obscure the source of a major software crash can be, and how unconvincing the inevitable claims are that everything possible is being done to prevent the re-occurrence of such errors. Dijkstra’s famous dictum says it all: “Testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence, never their absence”. And Weinberg’s mottoes sharpen the point: “There's always one more bug, even after that one is removed. … An ounce of prevention is worth a pound of cure, but management won't pay a penny for it.”

Quoted from comp.dcom.telecom:

“Subject: AT&T Crash Statement: The Official Report
Technical background on AT&T's network slowdown, January 15, 1990
At approximately 2:30 p.m. EST on Monday, January 15, one of AT&T's 4ESS toll switching systems in New York City experienced a minor hardware problem. … This required the switch to briefly suspend new call processing until it completed its fault recovery action – a four-to-six second procedure.

… messages were automatically sent to connecting 4ESS switches requesting that no new calls be sent to this New York switch during this … interval. The switches receiving this message made a notation to show that the switch was temporarily out of service.

When the New York switch in question was ready to resume call processing a few seconds later, it sent out call attempts (known as IAMs - Initial Address Messages) to its connecting switches. When these switches started seeing call attempts from New York, they started making adjustments to their programs to recognize that New York was once again up-and running, and therefore able to receive new calls.

A processor in the 4ESS switch which links that switch to the CCS7 network holds the status information mentioned above. When this processor (called a Direct Link Node, or DLN) in a connecting switch received the first call attempt (IAM) from the previously out-of-service New York switch, it initiated a process to update its status map. As the result of a software flaw, this DLN processor was left vulnerable to disruption for several seconds. During this vulnerable time, the receipt of two call attempts from the New York switch – within an interval of 1/100th of a second – caused some data to become damaged. The DLN processor was then taken out of service to be reinitialized.

Since the DLN processor is duplicated, its mate took over the traffic load. However, a second couplet of closely spaced new call messages … hit the mate processor during the vulnerable period, causing it to be removed from service and temporarily isolating the switch from the CCS7 signaling network. The effect cascaded through the network as … other switches similarly went out of service. The unstable condition continued because of the random nature of the failures and the constant pressure of the traffic load in the network providing the call message triggers.

The software flaw was inadvertently introduced into all the 4ESS switches … as part of a mid-December software update. This update was intended to significantly improve the network's performance by making it possible for switching systems to access a backup signaling network more quickly in case of problems with the main CCS7 signaling network. While the software had been rigorously tested in laboratory environments before it was introduced, the unique combination of events that led to this problem couldn't be predicted. …

On Tuesday, we took the faulty program update out … and temporarily switched back to the previous program. … We have since corrected the flaw, tested the change and restored the backup signaling links. We believe the software design, development and testing processes we use are based on solid, quality foundations. All future releases of software will continue to be rigorously tested. We will use the experience we've gained through this problem to further improve our procedures.

… Although nothing can be guaranteed 100% of the time, what happened Monday was a series of events that had never occurred before. With ongoing improvements to our design and delivery processes, we will continue to drive the probability of this type of incident occurring towards zero.”

The AT&T system failure was triggered by a system upgrade. This is a very common hazard. It is quite difficult to establish through prior testing that an upgrade to a complex systems will not have an unexpected and possibly disastrous effect—as in a recent incident at the Canadian Bank of Commerce, which for two days could not record deposits or payments through its automatic teller machines because, as a newspaper report had it, “attempts by the bank to improve one of their computer systems caused problems with another system, throwing everything out of whack.”

Testing is not proof

The AT&T report implicitly connects confidence in the correct functioning of its software with the quality of its testing procedures. For some, testing is even confused with proving correctness as illustrated by the following posting to comp.risks:

“Date: Fri, 18 Jul 1997 14:10:53 -0400 (EDT)
From: “Daniel P. B. Smith”
Subject: Unique definition of “proof of correctness”
*Computer Design*, July 1997, p. 38, describes the IDT-C6, a Pentium-compatible microprocessor designed by Centaur Technology. “IDT claims to have tested the C6 with most modern PC operating systems, including Windows 95, Windows 3.1, NetWare, and Solaris for Intel. _The complexity and pervasiveness of the Windows operating system generally are considered to make up an exacting proof of correctness…_” [emphasis supplied]

Give up, folks… the marketers have won. Once they redefine a term, the engineers _never_ win it back…”

Programmer error in air traffic control software

In the following example, failure to allow for negative values could have had fatal consequences. (Points east of 0° longitude—which is, by definition, the meridian passing through the Royal Observatory at Greenwich, England—can be specified by negative longitudes. In this case, the program apparently ignored the sign of the longitude value.) The program in question was used for air traffic control in the London (England) area by the Civil Aviation Authority (CAA):

“One of the more startling problems concerned the program's handling of the Greenwich meridian. The National Airspace Package, designed by IBM's Federal Systems division, contains a model of the airspace it controls, that is, a map of the airlanes and beacons in the area. But, because the program was designed for air traffic control centres in the US, the designers had taken no account of a zero longitude; the deficiency caused the computer to fold its map of Britain in two at the Greenwich meridian, plonking Norwich on top of Birmingham. …

“It’s axiomatic that software has bugs in it.” said Stan Price. He bought the software for the CAA but is critical of the CAA’s failure to use the latest techniques for producing software. Techniques that are lacking, notes Price, include the use of formal methodologies for designing and writing software.”

Ambulance dispatch system failure

In November 1992, the computerized dispatch system for the London (England) Ambulance Service, broke down suddenly and “locked up altogether”, leaving calls for assistance which had already been stored in the system unavailable, and forcing a taxing and cumbersome return to a manual, paper-based dispatch system. Luckily, the breakdown occurred in early morning hours, so that there were no severe consequences for patients.

The cause of the breakdown? Quoting from the subsequent inquiry report

“The Inquiry Team has concluded that the system crash was caused by a minor programming error. In carrying out some work on the system some three weeks previously the SO programmer had inadvertently left in memory a piece of program code that caused a small amount of memory within the file server to be used up and not released every time a vehicle mobilization was generated by the system. Over a three week period these activities had gradually used up all available memory thus causing the system to crash. This programming error should not have occurred and was caused by carelessness and lack of quality assurance of program code changes. Given the nature of the fault it is unlikely that it would have been detected through conventional programmer or user testing.”

Software that kills

It is remarkable that even the largest and most sophisticated corporations, such as AT&T, which stand to lose so much financially from defective software have invested so little in effective verification tools, but it is even more remarkable that software on which people depend on for their lives (such as the software used to control of the latest “fly-by-wire” models of passenger planes) is often written using low-level programming techniques known to be prone to error, as well as being very difficult to debug, while at the same time more and more reliance is placed on software-based safety mechanisms rather than on simpler but more reliable hardware guards and interlocks. As Robert Baber has remarked, “developing in the traditional way software upon which human life depends is nothing other than high tech Russian roulette.” Moser and Melliar-Smith have commented: “For systems that must be safe, there appears to be no alternative to verification. Testing, fault tolerance, and fault-tree analysis can make important contributions to safety, but they alone are unlikely to achieve the high levels of safety that are necessary for safety-critical applications.”

A clear example of the risks of poor programming and verification techniques is the tragic story of the Therac-25—one in a series of radiation therapy machines developed and sold over a number of years by Atomic Energy Canada Limited (AECL). As a direct result of inadequate programming techniques and verification techniques, at least six patients received massive radiation overdoses which caused great pain and suffering and from which three died. The story is a complicated one; it is presented in comprehensive detail in by Levenson. A few points are noteworthy:

The Therac-25 accidents occurred because of coding errors, not, as is more common for software-caused accidents, because of errors in the requirements or specifications. The software itself was never tested or formally analyzed at the module level; indeed, the initial safety analysis ignored the software altogether, even though key hardware interlocks in earlier models had been replaced by software. Even when problems were noticed, the investigators focused on the hardware, apparently preferring to attempt to “fix” the hardware rather than tackle the daunting task of debugging very complex, poorly documented and poorly structured code.

Levenson and Turner warn against simply blaming tragedies such as the Therac-25 accidents on software errors alone—they argue for a deeper analysis of “organizational, managerial, technical” and even “political, or sociological factors” on the grounds that, as we have already observed, there will always be another bug; so focusing only on eliminating individual software errors may blind us to considering the larger environment which tolerates and unintentionally furthers the occurrence of such errors with their drastic consequences.

True enough, but one of the underlying causes of unsafe software may be the general reluctance to apply formal methods and verification techniques, usually on the grounds that they are too difficult to automate and hence cannot be applied to “real code”. One purpose of this text is to argue by example against the validity of this excuse—to show that through the use of appropriate tools, portions of code can be verified automatically or semi-automatically, or can be generated automatically from logical specifications. Just as we would expect any author who uses a word-processor to use a spelling checker to check for at least those spelling errors which the dictionary can catch, so we should — in the future, if not today — expect programmers to routinely check or generate their code with software verification tools to catch and eliminate at least those errors which the tools can prevent. Even then programmers will not succeed in stamping out that vexatious and stubborn “last bug”, but at least its available hiding places will grow smaller, and both money and lives may be saved.

Table of Contents