verification
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
verification [2007/11/01 01:21] – created franck | verification [2007/11/01 14:13] (current) – franck | ||
---|---|---|---|
Line 34: | Line 34: | ||
bits wide, instead of the usual 32. This caused a message database to become corrupted. The programmers and operators, working under tremendous pressure to solve the problem | bits wide, instead of the usual 32. This caused a message database to become corrupted. The programmers and operators, working under tremendous pressure to solve the problem | ||
quickly, accidentally copied the corrupt copy of the database over the backup, instead of the other way around.” | quickly, accidentally copied the corrupt copy of the database over the backup, instead of the other way around.” | ||
+ | |||
==== The all-time most expensive single-character error? ==== | ==== The all-time most expensive single-character error? ==== | ||
Line 40: | Line 41: | ||
"Date: Sat 10 Sep 88 15: | "Date: Sat 10 Sep 88 15: | ||
- | From: Gary Kremen | + | From: Gary Kremen\\ |
+ | Subject: Soviets See Little Hope of Controlling Spacecraft\\ | ||
According to today' | According to today' | ||
caused when a ground controller gave it an improper command. This has to one of the most expensive system mistakes ever. [estimated at $100, | caused when a ground controller gave it an improper command. This has to one of the most expensive system mistakes ever. [estimated at $100, | ||
Date: Tue, 13 Sep 88 15:20:18 PDT\\ | Date: Tue, 13 Sep 88 15:20:18 PDT\\ | ||
- | From: Peter G. Neumann | + | From: Peter G. Neumann\\ |
Subject: Soviet Mars Probe\\ | Subject: Soviet Mars Probe\\ | ||
The Soviet Mars probe was mistakenly ordered to ‘commit suicide’ when ground control beamed up a 20 to 30 page message in which a single character was inadvertently omitted. The change in program was required because the Phobos 1 control had been transferred from a command center in the Crimea to a new facility near Moscow. ‘The [changes] would not have been required if the controller had been working the computer in Crimea.’ The commands caused the spacecraft' | The Soviet Mars probe was mistakenly ordered to ‘commit suicide’ when ground control beamed up a 20 to 30 page message in which a single character was inadvertently omitted. The change in program was required because the Phobos 1 control had been transferred from a command center in the Crimea to a new facility near Moscow. ‘The [changes] would not have been required if the controller had been working the computer in Crimea.’ The commands caused the spacecraft' | ||
+ | |||
==== The Ariane 5 failure ==== | ==== The Ariane 5 failure ==== | ||
- | Quoted from the report of the Ariane Inquiry Board [Lions, 1996]: | + | Quoted from the report of the Ariane Inquiry Board: |
“On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher | “On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher | ||
Line 63: | Line 66: | ||
... Verify the range of values taken by any internal or communication variables in the software.” | ... Verify the range of values taken by any internal or communication variables in the software.” | ||
- | The Phobos and Ariane failures demonstrate two important lessons: | ||
- | * code cannot be trusted just because it worked in a different context; | ||
- | * there is no connection between the scope of the error and the cost of the resulting failure. | ||
- | |||
- | Of the many recommendations made by the Ariane Inquiry Board, two are directly relevant to the issue of verification. | ||
==== Millions inconvenienced, | ==== Millions inconvenienced, | ||
- | The following report shows how obscure the source of a major software crash can be, and how unconvincing the inevitable claims are that everything possible is being done to prevent the re-occurrence of such errors. Dijkstra’s famous dictum says it all: Testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence, never their absence. And Weinberg’s mottoes sharpen the point:\\ | + | The following report shows how obscure the source of a major software crash can be, and how unconvincing the inevitable claims are that everything possible is being done to prevent the re-occurrence of such errors. Dijkstra’s famous dictum says it all: "Testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence, never their absence". And Weinberg’s mottoes sharpen the point: " |
- | " | + | |
Quoted from comp.dcom.telecom: | Quoted from comp.dcom.telecom: | ||
Line 102: | Line 99: | ||
recent incident at the Canadian Bank of Commerce, which for two days could not record deposits or payments through its automatic teller machines because, as a newspaper report had it, " | recent incident at the Canadian Bank of Commerce, which for two days could not record deposits or payments through its automatic teller machines because, as a newspaper report had it, " | ||
+ | |||
+ | ==== Testing is not proof ==== | ||
+ | |||
+ | The AT&T report implicitly connects confidence in the correct functioning of its software with the quality of its testing procedures. For some, testing is even confused with proving correctness as illustrated by the following posting to comp.risks: | ||
+ | |||
+ | "Date: Fri, 18 Jul 1997 14:10:53 -0400 (EDT)\\ | ||
+ | From: " | ||
+ | Subject: Unique definition of "proof of correctness" | ||
+ | *Computer Design*, July 1997, p. 38, describes the IDT-C6, a | ||
+ | Pentium-compatible microprocessor designed by Centaur Technology. "IDT claims to have tested the C6 with most modern PC operating systems, including Windows 95, Windows 3.1, NetWare, and Solaris for Intel. _The complexity and pervasiveness of the Windows operating system generally are | ||
+ | considered to make up an exacting proof of correctness..._" | ||
+ | |||
+ | Give up, folks... the marketers have won. Once they redefine a term, the engineers _never_ win it back..." | ||
+ | |||
+ | ==== Programmer error in air traffic control software ==== | ||
+ | |||
+ | In the following example, failure to allow for negative values could have had fatal consequences. (Points east of 0° longitude—which is, by definition, the meridian passing through the Royal Observatory at Greenwich, England—can be specified by negative longitudes. In this case, the program apparently ignored the sign of the longitude value.) The program in question was used for air traffic control in the London (England) area by the Civil Aviation Authority (CAA): | ||
+ | |||
+ | "One of the more startling problems concerned the program' | ||
+ | contains a model of the airspace it controls, that is, a map of the airlanes and beacons in the area. But, because the program was designed for air traffic control centres in the US, the | ||
+ | designers had taken no account of a zero longitude; the deficiency caused the computer to fold its map of Britain in two at the Greenwich meridian, plonking Norwich on top of | ||
+ | Birmingham. ... | ||
+ | |||
+ | " | ||
+ | Techniques that are lacking, notes Price, include the use of formal methodologies for designing and writing software." | ||
+ | |||
+ | |||
+ | ==== Ambulance dispatch system failure ==== | ||
+ | |||
+ | In November 1992, the computerized dispatch system for the London (England) Ambulance Service, broke down suddenly and “locked up altogether”, | ||
+ | |||
+ | The cause of the breakdown? Quoting from the subsequent inquiry report | ||
+ | |||
+ | "The Inquiry Team has concluded that the system crash was caused by a minor programming error. In carrying out some work on the system some three weeks previously the SO programmer had inadvertently left in memory a piece of program code that caused a small amount of memory within the file server to be used up and not released every time a vehicle mobilization was generated by the system. Over a three week period these activities had | ||
+ | gradually used up all available memory thus causing the system to crash. This programming error should not have occurred and was caused by carelessness and lack of quality assurance | ||
+ | of program code changes. Given the nature of the fault it is unlikely that it would have been detected through conventional programmer or user testing." | ||
+ | |||
+ | |||
+ | ==== Software that kills ==== | ||
+ | |||
+ | It is remarkable that even the largest and most sophisticated corporations, | ||
+ | interlocks. As Robert Baber has remarked, “developing in the traditional way software upon which human life depends is nothing other than high tech Russian roulette.” Moser and Melliar-Smith have commented: “For systems that must be safe, there appears to be no alternative to verification. Testing, fault tolerance, and fault-tree analysis can make important contributions to safety, but they alone are unlikely to achieve the high levels of safety that are necessary for safety-critical | ||
+ | applications.” | ||
+ | |||
+ | A clear example of the risks of poor programming and verification techniques is the tragic story of the Therac-25—one in a series of radiation therapy machines developed and sold over a number of years by Atomic Energy Canada Limited (AECL). As a direct result of inadequate programming techniques and verification techniques, at least six patients received massive radiation overdoses which caused great pain and suffering | ||
+ | and from which three died. The story is a complicated one; it is presented in comprehensive detail in by Levenson. A few points are noteworthy: | ||
+ | |||
+ | The Therac-25 accidents occurred because of coding errors, not, as is more common for software-caused accidents, because of errors in the requirements or specifications. The software itself was never tested or formally analyzed at the module level; indeed, the initial safety analysis ignored the software altogether, even though key hardware interlocks in earlier models had been replaced by software. Even when problems were | ||
+ | noticed, the investigators focused on the hardware, apparently preferring to attempt to “fix” the hardware rather than tackle the daunting task of debugging very complex, poorly documented and poorly structured code. | ||
+ | |||
+ | Levenson and Turner warn against simply blaming tragedies such as the Therac-25 accidents on software errors alone—they argue for a deeper analysis of “organizational, | ||
+ | |||
+ | True enough, but one of the underlying causes of unsafe software may be the general reluctance to apply formal methods and verification techniques, usually on the grounds that they are too difficult to automate and hence cannot be applied to “real code”. One purpose of this text is to argue by example against the validity of this excuse—to show that through the use of appropriate tools, portions of code can be verified automatically | ||
+ | or semi-automatically, | ||
+ | to check for at least those spelling errors which the dictionary can catch, so we should — in the future, if not today — expect programmers to routinely check or generate their code with software verification tools to catch and eliminate at least those errors which the tools can prevent. Even then programmers will not succeed in stamping out that vexatious and stubborn “last bug”, but at least its available hiding places will grow smaller, and both money and lives may be saved. |
verification.1193880086.txt.gz · Last modified: by franck