CERT
 
All Research Papers Research Staff Biographies CMU Heinz School CMU School of Computer Science CERT Statistics US-CERT CyLab
 

SAF

Survivable Analysis Framework

Principal Investigators: Robert Ellison and Carol Woody

Problem Addressed
Large systems and particularly systems of systems raise the importance of complexity management. The complexity is an aggregate of technology, scale, scope, and operational and organizational issues. While small system security may have been implemented by a set of point solutions that mitigated specific threats, the mitigation of threats of the magnitude and diversity of those associated with large distributed systems of systems (SoS) requires foundational support.

Separation of concerns is a powerful tactic for managing complexity during design and development. A software architecture may try to maintain separation among security, performance, reliability, and other system quality attributes. However, it is the visibility of these qualities within the operational context as the technology is used to address an organizational need that is of most interest. We frequently have maintained separation among system operations, systems development, and business operations, but that separation was often reflected by the expression “toss it over the wall.” This approach worked well as long as all requirements could be effectively established in advance and evaluated prior to implementation. Business integration requirements and the appearance of technologies such as web services to support that integration for distributed systems challenge these traditional separations. Even organizations with well-established processes are finding the complexity overwhelming. A vast range of legacy technology and processes are being hooked together through bridges of software and people without a thorough consideration of how these connections function under stress and failure. Development is primarily looking at the individual pieces of new functionality, operations is focusing on the infrastructure, and the gray area of business process connectivity is largely ignored, thereby exposing organizations to increased risk of operational failure.

We define survivability as the capability of a system to fulfill its mission, in a timely manner, in the presence of attacks, failures, or accidents. Survivability concentrates initially on the availability aspects of security but also incorporates confidentiality, integrity, and reliability considerations. Availability must be focused on the specific functions and services needed to satisfy a specific organizational mission, which is increasingly dependent on multiple systems. Survivability concentrates on the organizational activity supported by software systems rather than on the individual systems.

This research initially focused on developing assurance analysis methods that are applicable to systems of systems to address the challenge of increased demands for interoperability, integration, and survivability. Having shown the value of the mission focus for analyzing organizational and technology dependencies, this research effort has expanded to address the need for analytical capability of services such as components of a service-oriented architecture (SOA) and the integration of these shared services with organizational mission. In addition, the consideration of quality assurance and exploration of ways in which an integrated view of mission and technology can support the development of a quality assurance case are under development.

Research Approach
Essential work processes increasingly span multiple systems that are geographically distributed and independently managed. The individual systems are useful in their own right, addressing a selected subset of organizational needs. The business demands for adaptability and integration result in a mix of systems and work processes that are constantly changing. Development is evolutionary as functions and purposes are added, removed, and modified with experience. Completion of each individual system activity is no longer sufficient to meet organizational needs, and the measures for success must focus on the complete organizational mission, which extends beyond component systems.

Consider Figure 1, where each circle represents a geographically distributed system and the blue and black lines are business processes that use those systems. The right side of the figure expands one of those systems. For a military example, a circle might be a specific Service system, whereas the work process might be joint activity that requires coordination across the Services. The specific Service system receives both joint and Service-specific requests. A joint Service activity would likely generate a sequence of actions similar to the actions generated for a Service-specific request.

/>

 

Figure 1. System of Systems Resource Contention

We need to take two perspectives in analyzing that diagram: the end-to-end work process and the individual systems. The is-used-by relationship is critical for the system participants. A work process, especially in an SoS environment, could create usage patterns that were not anticipated in the design of a specific system and hence could adversely affect the operation of that system. An individual system may need to take a defensive posture with respect to external requests to protect local resources. In addition, failure of one piece will have an impact on the organizational mission that cannot be evaluated within the context of the individual component.

The success of the end-to-end work process depends on the successful composition of the individual process steps and an acceptable completion. The key relationship for the work process is depends-on. We would like to assure the end-to-end behavior of a work process, but the interoperability capabilities and failure conditions for each component could drastically affect an acceptable outcome if that step is critical to mission success and internal quality choices do not match mission quality needs. The work process thread will need to be analyzed end to end and step by step to identify gaps that could lead to survivability loss. To do this requires the following detailed process thread information: a description of work process success; expected work process quality attributes such as performance and reliability; and scenarios of both expected and unacceptable behavior, which includes the kinds of things that may go wrong and what will happen should they occur. In addition, each work process to be analyzed must be decomposed into required steps with the following types of information about each step: roles in the process, preconditions, functions, postconditions, constraints, and dependencies. Each step may be composed of multiple components (human, software, system, and/or hardware) acting independently or in a coordinated manner.

Systems and systems of systems can create failure states that are difficult to solve. Historically, system failure analysis has sought to identify a single root cause, but for software-intensive systems that involve human interactions a failure may be the result of multiple software, hardware, or human errors. Each error when considered individually would be perceived as minor. Other failures may arise because of emergent behavior. Each system behaves as specified, but the collective behavior is unacceptable. For example, feedback among systems might generate unexpected resource contention. At this stage, our research considers the stresses that might be induced by a work process thread. We initially focus on the interactions among the systems that participate in that thread and the stresses that might be induced by those interactions on the supporting systems. The stress types include

  • Interaction (data): missing, inconsistent, incorrect, unexpected, incomplete, unintelligible, out of date, duplicate
  • Resource: insufficient, unavailable, excessive, latency, inappropriate, interrupted
  • People: information overload, analysis paralysis, fog of war, distraction (rubbernecking), selective focus (only looking for information for positive reinforcement), diffusion of responsibility, spurious correlations
The scenarios of potential problems, especially those with anticipated high impact, will be used to potentially limit the areas of each stress type to a subset of high interest issues for the mission thread stakeholders. For each type of stress, the analysis framework will be applied to identify what is currently in place, what should be in place, and expected step and/or component behavior should survivability be affected. The analysis framework will be applied at a specific point in time to a selected example mission thread. In order to analyze the change in risk over time, an assessment is needed for the existing work process to establish a baseline of current risk.

Survivability concentrates on what can go wrong. The issues considered by the SAF analysis are shown in Figure 2.

Figure 2. SAF Analysis

Expected Benefits
The expansion of the scope and scale of systems induce new stresses. An objective of the initial phase of this project is to identify indicators of stress that may lead to system failures. Indicators that are appropriate to the early life-cycle phases of software development help to change current practice, whereas software failure analysis typically concentrates only on the errors that are derived from testing. The goal is to generate a sufficient number of examples so that patterns emerge. A pattern, for example, may represent ways to reduce complexity by consolidating risk mitigations.
The Survivability Analysis Framework (SAF), with its emphasis on business process threads, also enables better traceability between technology risks and business work processes. It may also enable better traceability of the design decisions to the requirements of multiple organizational levels.

2007 Accomplishments
The SAF was applied to two additional pilot applications beyond the initial work in 2006. One pilot was within the DoD and the second in a large, non-DoD federal agency. The DoD project considered the challenges of information assurance (IA) across a mission thread, looking at ways to appropriately characterize the impact of IA decisions on the organizational mission. The non-DoD pilot evaluated the impact of technology choices made in development on existing organizational processes for alpha and beta test sites. In addition, SAF concepts were presented to researchers and practitioners at the following conferences: System and Software Technical Conference, Computer Security Institute Conference, International Conference on Commercial Off-the-Shelf (COTS)-Based Software Systems, and the Homeland Security: Research * Innovation * Transition Conference.
An example assurance case for security was developed with support from researchers knowledgeable with safety and reliability assurance.

2008 Plans
A description of SAF and the example security assurance case will be published in a technical note in the second quarter of 2008. The objective for further research is to evaluate ways in which the development of SAF information can contribute to an understanding of assurance for software, systems, and information and influence tradeoff decisions that impact mission quality early in the design and development processes. Pilot engagements will be selected that allow the consideration of organizational and technology options early in the system development life cycle to identify ways to influence the quality tradeoff choices for appropriate consideration of survivability risk and realistic usage.


Disclaimers and copyright information

Last updated May 10, 2007