|
|
|
|
[32]
![]() Jeffrey M. Voas, Gary E. McGraw, & Anup K. Ghosh
Reliable Software Technologies Corp. 1. What is Survivability? Survivability is not an all or nothing phenomenon. There are varying degrees of surviving. The question is, can survivability be assessed? The need to measure survivability is not unlike the need to measure risk, quality, safety, and so on. Without some way to know how close we are to a goal, it is difficult to know whether any development steps we are taking are hindering or helping a system. Truly knowing how survivable a system is requires two things: 1) a precise definition of what survivability is, and 2) a precise way to measure progress at any intermediate developmental point. A workable definition for "survivability'' has alluded us to date. In fact, survivability has remained a muddled mixture of other ideas including: reliability, fault-tolerance, safety, availability, etc. That means from a measurement standpoint, survivability is no easier to quantify than any of the other characteristics listed above---probably less so, since survivability is a composite of the already ill-defined other things. Ultimately, survivability is a measure of strength against attack combined with an analysis of logical defectiveness. Along these lines, the last two decades of computer science research have focused on: 1) ways of developing logically correct software (formal methods), and 2) testing as a means for demonstrating correctness. But today's distributed systems preclude both of these approaches since real systems either have few (if any) specified formal properties and are generally untestable on top of that. These deficits are a direct result of the complexity and size of today's systems. In the final analysis, developing and demonstrating survivable distributed systems remains an important and unattained research goal. The problem has never been sufficiently solved. The lack of a solution leaves an enormous burden on today's users who need confidence that their electronic transactions are private, safe, and secure. This is especially relevant considered in light of the upsurge in Internet usage. The most prevalent approach to conquering complexity is the "divide and conquer'' approach. The idea is to decompose a large problem into subproblems, and solve the smaller problems independently. Once smaller problems are solved, the sub-solutions can then be integrated into a global solution. Distributed object systems are built using a closely-related design paradigm. The problem with assessing the quality or strength of distributed systems stems both from the interactions between the subcomponents as well as from the many different entry points into the system. We can test one component in isolation until the sun burns out, but such testing will tell us little about what happens when that component is made part of a unified system. It is our contention that the only tractable solutions for the complexity we are faced with-complexity that will only increase with time-is to assess survivability at a macroscopic, not microscopic, level. We believe that survivability can only be improved if deficiencies are measured at a high system level. Testing is used (and should still be used) to identify bugs at lower, sub-system levels. Thus system-level analysis remains unsolved. However, methods for predicting the impact of failures and attacks against subcomponents at the sub-system level can assess the survivability of large-scale distributed systems. Testing methods for developing more survivable components can be enhanced to some degree by formal methods; but methods for assessing system survivability must employ some notion of simulated failures. Fault injection approaches are best suited for this goal. It is our contention that fault injection can be applied at the whole-system level [1]. Most people think of fault injection as a lower-level approach that requires source code. But fault injection works equally well at higher levels of abstraction. In fact, application of fault injection at higher levels helps to ferret out those subcomponents causing the greatest system-level survivability risks. Fault injection simulates failures. Given simulated failures defined at a high enough level, we can assess system survivability. Since this assessment is made with respect to the remainder of the system, it has the potential ultimately to provide a much needed overall view of the entire system. 2. Survivability of Systems of COTS Components To assess the survivability of systems composed of custom-developed, legacy, and COTS components, the methods we advocate examine the resilience of the interfaces between components [2]. Interfaces are the mechanisms for passing information between components. All inputs and outputs of a component define an interface, with the final interface being the mechanism by which outputs are passed to the end-user. Inter-component interfaces are the glue that binds components and are often the weakest links in a complex system. The interface a designer creates for a component reflects the assumptions the component designer makes about the external environment. Distributed object systems can reduce the complexity of creating large applications by componentization. Componentization is the process of dividing a complex problem into manageable parts, each representing a different service that is provided to the system. Componentization can make design, development, and testing simpler by making possible the application of resources on smaller partitions of the problem that can be more easily grasped by component developers. However, while componentization can reduce the complexity of a monolithic system, it correspondingly increases the complexity of interfaces. 2.1 Assessing survivability In light of the problems generating analytical solutions, special constraints related to composing systems of COTS components, and the critical nature of component interfaces (all detailed in [1]), the methodology we advocate simulates failures in components by corrupting component interfaces. Perturbing component interfaces with anomalous events under simulated stress provides valuable information for how the system will react under real stress, i.e., after system release [2]. The survivability of the distributed system can be thought of as a prediction of the ability of the system to tolerate component failures resulting from malicious and non-malicious anomalies within components and from external sources. The assessment of distributed object system survivability can be partitioned into two types of analysis: (1) interface propagation analysis (IPA) and (2) survivability analysis. Interface propagation analysis uses fault perturbation functions to inject anomalous events into component interfaces and to observe how the corruption propagates across components [2]. Survivability analysis dynamically evaluates survivability assertions about a system subjected to fault perturbation in order to determine if the survivability assertions have been violated. Both analyses can be performed concurrently during dynamic execution analysis. 2.2 System-level survivability analysis Survivability relates a prediction that malicious or non-malicious anomalous events resulting from component behavior or external sources will not result in a loss of mission for a given system. If an incorrect output constitutes a loss-of-mission, then IPA can provide the survivability prediction [1]. However, if the survivability of a given system can be expressed as a set of properties relating to the software or the system, then the survivability can be dynamically assessed through analysis. The survivability properties that are expressed as a set of assertions or predicates can be assessed in two distinct environments---component-level and system-level. At the component-level, survivability analysis will be able to capture inputs and outputs from a component under analysis. For example, if the difference between outputs of two subsequent cycles of a sampling control system is greater than a safe threshold amount, the predicate might be coded as:
This predicate computes a time series function on successive outputs from a component to determine if a safe threshold has been exceeded. The second environment in which survivability properties can be assessed is at the system-level. The system-level survivability analysis uses information collected from component-level analysis to determine if the system-wide survivability properties have been violated. For example, if a survivable system is computing system functions in a redundant voting architecture, a system-level survivability condition would require a majority of the system objects to be operating. Using this information, system-wide assertions about survivability readiness can be evaluated to determine if the system as a whole is still in a survivable posture. 3. Conclusions This abstract touches on assessing the survivability of a distributed object system in the presence of component failures. Our position is that behavioral assessment techniques such as fault injection are the best existing approaches for assessing the survivability of complex systems-especially when such systems are not formally specified and are generally untestable. A technical report expanding on these ideas can be found on the Web at http://www.rstcorp.com/papers.html. Acknowledgment The authors are currently being funded by DARPA/ITO in Information Survivability under contract F30602-95-C-0282. THE VIEWS AND CONCLUSIONS CONTAINED IN THIS DOCUMENT ARE THOSE OF THE AUTHORS AND SHOULD NOT BE INTERPRETED AS REPRESENTING THE OFFICIAL POLICIES, EITHER EXPRESSED OR IMPLIED, OF THE DEFENSE ADVANCED RESEARCH PROJECTS AGENCY OR THE U.S. GOVERNMENT. References
[32]
![]() |






![Back to [31]](../all_the_pictures/arrow_left.jpg)
![Forwards to [33]](../all_the_pictures/arrow_right.jpg)