[21]
![Forwards to [22]](../all_the_pictures/arrow_right.jpg)
Position Paper for Information Survivability Workshop 1998
Charles C. Howell
Chief Engineer, Mitretek Systems Technology Center
0. Introduction
From the Call for Participation:The primary goal of the workshop is to foster cooperation and collaboration between domain experts and the survivability research community to improve the survivability of critical, real-world systems. Another important goal is to continue to identify and highlight new survivability research ideas that can contribute to the protection of critical infrastructures and critical applications. ... The position paper should clearly indicate how the background or interests of the author(s) would contribute to the goals of the workshop.There are several issues I'd like to discuss that I believe are consistent with the goals of this Workshop.
- I 'd argue that error handling is a crucial and often overlooked aspect of robust critical system design and assessment;
- I have had an opportunity to analyze the error handling design and implementation of a variety of critical systems, and have some conclusions about design and assessment issues for increasing critical infrastructure robustness as a result;
- One of my interests is facilitating the collaboration between domain experts and software engineering experts, and I believe that refining and calibrating error handling is one important opportunity for such collaboration.
1. Error Handling
"In our experience, software exhibits weak-link behavior; failures in even the unimportant parts of the code can have unexpected repercussions elsewhere." [Parnas et al., "Evaluation of Safety-Critical Software", CACM June 1990]Published analyses of the causes of software defects frequently single out error handling as a significant problem. For example,"Any program, now matter how innocuous it seems, can harbor security holes. Who would have guessed that on some machines integer divide exceptions could lead to system penetrations?"
[Cheswick and Bellovin, Firewalls and Internet Security, Addison-Wesley 1994]
- an analysis of software defects in Hewlett-Packard's Scientific Instruments Division identified "error checking" as the third most frequent cause of defects;
- a case study of a fault-tolerant switching system showed 2/3 of the system failures due to design problems were in the error handling portion of the system;
- many of the safety-critical level software failures detected in the final system tests for space shuttle avionics were associated with "exception handling and redundancy management";
- concentration on the nominal or fault-free case has been identified as a major source of software rework costs.
The issues associated with error handling range from the architecture of "systems of systems" and how the error detection, reporting, and recovery protocols will be orchestrated all the way to coding decisions about exception handling and error code checking. A problem with the "choreography" of an error recovery protocol was a key part of the partial collapse of the AT&T Long Lines network in the early 1990s, and a problem with the use of Ada exceptions was a key part of the loss of the first Ariane-5 (and also a brilliant example of the weak link effect Parnas mentions).
It is precisely when a critical infrastructure component is facing a threat to its ability to continue to deliver services (e.g., because of a deliberate attack or a logic error or hardware failure) that the error handling aspect of the design is brought into play. In this sense it is part of what some writers have compared to a software "immune system" (e.g., Stephanie Forrest et al., "Computer Immunology", CACM October 1997). This seems to be a useful analogy, and the concept of auto-immune diseases is a useful extension to the analogy. For example, I have seen examples of fragile error handling mechanisms that resulted in mis-diagnosis of the real threat, incorrect reactions, and cascading failures. Equally important, the error handling mechanism of a critical infrastructure component may itself become the target of attack (to trigger defensive reactions such as load shedding and degraded mode operation).
As I'll describe in the next section, I have a background in assessing the error handling mechanisms of a variety of critical systems (you thought I was kidding about proof by repeated assertion?), and given half a chance I'd like to discuss design and assessment issues related to error handling and the relevance to critical infrastructure protection (e.g., architectural issues of error handling granularity and allocation of responsibility, assessment and testing strategies, etc.).
2. Parse and Tell: Critical Systems I Have Known and Loved
"They constantly try to escape... by dreaming of systems so perfect that no one will need to be good" T. S. Eliot, Choruses from "The Rock", VIAs Principle Investigator of a research project on error handling analysis several years ago, I had an opportunity to directly participate in the design and assessment of error handling mechanisms in several critical systems. After the research project ended I've continued to pursue error handling analysis work for a range of clients. Systems I have analyzed include
- U.S. and Foreign air traffic control systems
- Submarine combat control system
- Weapons system interlock firmware
- Flight Management System avionics
- A Ship Control system
- The Energy Management System for a metropolitan electrical utility
What I've seen in these various systems confirms my belief that error handling mechanisms are crucial to the robustness of critical systems, yet they often are not given adequate attention during requirements refinement, system architecture and design, implementation, and testing. I'd be interested in discussing common lessons learned from analyzing these various systems and others workshop participants have knowledge of with respect to error handling. Clearly, there are substantial differences between the concerns of an air traffic control system and a power grid control system, but there are some surprising recurring issues.
3. Collaboration Between Domain Experts and Software Engineers
"Knowing is not enough, we must apply. Willing is not enough, we must do". GoetheI believe there are two distinct classes of assurance requirements for critical infrastructure systems with significant software components: domain-specific requirements and software implementation requirements. As an illustration: in a fly-by-wire aircraft control system, one domain-specific hazard is to be under-responsive to pilot commands. With a considerably more precise definition of what it means to be under-responsive, we can then identify an assurance requirement: the system developer must make the case that this risk has been mitigated to an acceptable degree (which is of course determined by the customer, the regulatory agency, or law). From this an assurance argument could be formulated, describing what mixture of, say, test demonstration and product analyses will be used to make the case that the system will not be under-responsive.
As an example of a software implementation assurance requirement, consider the requirement that a critical program does not enter an "unknown and unpredictable state". While it cannot be said with certainty that a mishap would result, for critical software such lack of predictability can be considered a hazard. Therefore, one derived assurance requirement would be that the program does not read any un-initialized variables (assuming that the programs are written in a language that permits reading un-initialized variables and does not define default values for such), since the behavior of such programs is unpredictable. The corresponding assurance argument would describe what combination of, say, coding standards, testing, and static analyses will be used to make the case that there are no paths where an un-initialized variable is read.
The picture below illustrates the general need to merge assurance requirements from domain experts and experts in the structural (i.e., domain-independent) aspects of software system construction. My experience has been that identification of requirements for robust error handling is well suited for such a structured collaboration between domain and software experts. Key recurring architectural issues for error handling include decisions about allocation of responsibility among system components and the level of granularity of detection/reporting/recovery actions. Specific ways to facilitate this collaboration might be an interesting discussion for the workshop.
Charles C. Howell
Chief Engineer, Mitretek Systems Technology Center
7525 Colshire Drive, M/S Z553, McLean, VA 22102-7400
Voice: 703 610-1866 Fax: 703 610-1603 howell@mitretek.org
[21]
![Forwards to [22]](../all_the_pictures/arrow_right.jpg)





