|
|   |
SAF
Survivable Analysis Framework
Principal Investigators: Robert Ellison and
Carol Woody
Problem Addressed
Large systems and particularly systems of systems raise the importance of complexity
management. The complexity is an aggregate of technology, scale, scope, and operational
and organizational issues. While small system security may have been implemented by a set
of point solutions that mitigated specific threats, the mitigation of threats of the
magnitude and diversity of those associated with large distributed systems of systems
(SoS) requires foundational support.
Separation of concerns is a powerful tactic for managing complexity during design and
development. A software architecture may try to maintain separation among security,
performance, reliability, and other system quality attributes. However, it is the
visibility of these qualities within the operational context as the technology is used to
address an organizational need that is of most interest. We frequently have maintained
separation among system operations, systems development, and business operations, but
that separation was often reflected by the expression “toss it over the
wall.” This approach worked well as long as all requirements could be effectively
established in advance and evaluated prior to implementation. Business integration
requirements and the appearance of technologies such as web services to support that
integration for distributed systems challenge these traditional separations. Even
organizations with well-established processes are finding the complexity overwhelming. A
vast range of legacy technology and processes are being hooked together through bridges
of software and people without a thorough consideration of how these connections function
under stress and failure. Development is primarily looking at the individual pieces of
new functionality, operations is focusing on the infrastructure, and the gray area of
business process connectivity is largely ignored, thereby exposing organizations to
increased risk of operational failure.
We define survivability as the capability of a system to fulfill its mission, in a
timely manner, in the presence of attacks, failures, or accidents. Survivability
concentrates initially on the availability aspects of security but also incorporates
confidentiality, integrity, and reliability considerations. Availability must be focused
on the specific functions and services needed to satisfy a specific organizational
mission, which is increasingly dependent on multiple systems. Survivability concentrates
on the organizational activity supported by software systems rather than on the
individual systems.
This research initially focused on developing assurance analysis methods that are
applicable to systems of systems to address the challenge of increased demands for
interoperability, integration, and survivability. Having shown the value of the mission
focus for analyzing organizational and technology dependencies, this research effort has
expanded to address the need for analytical capability of services such as components of
a service-oriented architecture (SOA) and the integration of these shared services with
organizational mission. In addition, the consideration of quality assurance and
exploration of ways in which an integrated view of mission and technology can support the
development of a quality assurance case are under development.
Research Approach
Essential work processes increasingly span multiple systems that are geographically
distributed and independently managed. The individual systems are useful in their own
right, addressing a selected subset of organizational needs. The business demands for
adaptability and integration result in a mix of systems and work processes that are
constantly changing. Development is evolutionary as functions and purposes are added,
removed, and modified with experience. Completion of each individual system activity is
no longer sufficient to meet organizational needs, and the measures for success must
focus on the complete organizational mission, which extends beyond component systems.
Consider Figure 1, where each circle represents a geographically distributed system and
the blue and black lines are business processes that use those systems. The right side of
the figure expands one of those systems. For a military example, a circle might be a
specific Service system, whereas the work process might be joint activity that requires
coordination across the Services. The specific Service system receives both joint and
Service-specific requests. A joint Service activity would likely generate a sequence of
actions similar to the actions generated for a Service-specific request.
/>
Figure 1. System of Systems Resource Contention
We need to take two perspectives in analyzing that diagram: the end-to-end work
process and the individual systems. The is-used-by relationship is critical for the
system participants. A work process, especially in an SoS environment, could create usage
patterns that were not anticipated in the design of a specific system and hence could
adversely affect the operation of that system. An individual system may need to take a
defensive posture with respect to external requests to protect local resources. In
addition, failure of one piece will have an impact on the organizational mission that
cannot be evaluated within the context of the individual component.
The success of the end-to-end work process depends on the successful composition of
the individual process steps and an acceptable completion. The key relationship for the
work process is depends-on. We would like to assure the end-to-end behavior of a work
process, but the interoperability capabilities and failure conditions for each component
could drastically affect an acceptable outcome if that step is critical to mission
success and internal quality choices do not match mission quality needs. The work process
thread will need to be analyzed end to end and step by step to identify gaps that could
lead to survivability loss. To do this requires the following detailed process thread
information: a description of work process success; expected work process quality
attributes such as performance and reliability; and scenarios of both expected and
unacceptable behavior, which includes the kinds of things that may go wrong and what will
happen should they occur. In addition, each work process to be analyzed must be
decomposed into required steps with the following types of information about each step:
roles in the process, preconditions, functions, postconditions, constraints, and
dependencies. Each step may be composed of multiple components (human, software, system,
and/or hardware) acting independently or in a coordinated manner.
Systems and systems of systems can create failure states that are difficult to solve.
Historically, system failure analysis has sought to identify a single root cause, but for
software-intensive systems that involve human interactions a failure may be the result of
multiple software, hardware, or human errors. Each error when considered individually
would be perceived as minor. Other failures may arise because of emergent behavior. Each
system behaves as specified, but the collective behavior is unacceptable. For example,
feedback among systems might generate unexpected resource contention. At this stage, our
research considers the stresses that might be induced by a work process thread. We
initially focus on the interactions among the systems that participate in that thread and
the stresses that might be induced by those interactions on the supporting systems. The
stress types include
- Interaction (data): missing, inconsistent,
incorrect, unexpected, incomplete, unintelligible, out of date, duplicate
- Resource: insufficient, unavailable, excessive,
latency, inappropriate, interrupted
- People: information overload, analysis
paralysis, fog of war, distraction (rubbernecking), selective focus (only looking for
information for positive reinforcement), diffusion of responsibility, spurious
correlations
The scenarios of potential problems, especially those with anticipated high impact,
will be used to potentially limit the areas of each stress type to a subset of high
interest issues for the mission thread stakeholders. For each type of stress, the
analysis framework will be applied to identify what is currently in place, what should be
in place, and expected step and/or component behavior should survivability be affected.
The analysis framework will be applied at a specific point in time to a selected example
mission thread. In order to analyze the change in risk over time, an assessment is needed
for the existing work process to establish a baseline of current risk.
Survivability concentrates on what can go wrong. The issues considered by the SAF
analysis are shown in Figure 2.
Figure 2. SAF Analysis
Expected Benefits
The expansion of the scope and scale of systems induce new stresses. An objective of
the initial phase of this project is to identify indicators of stress that may lead to
system failures. Indicators that are appropriate to the early life-cycle phases of
software development help to change current practice, whereas software failure analysis
typically concentrates only on the errors that are derived from testing. The goal is to
generate a sufficient number of examples so that patterns emerge. A pattern, for example,
may represent ways to reduce complexity by consolidating risk mitigations.
The Survivability Analysis Framework (SAF), with its emphasis on business process
threads, also enables better traceability between technology risks and business work
processes. It may also enable better traceability of the design decisions to the
requirements of multiple organizational levels.
2007 Accomplishments
The SAF was applied to two additional pilot applications beyond the initial work in
2006. One pilot was within the DoD and the second in a large, non-DoD federal agency. The
DoD project considered the challenges of information assurance (IA) across a mission
thread, looking at ways to appropriately characterize the impact of IA decisions on the
organizational mission. The non-DoD pilot evaluated the impact of technology choices made
in development on existing organizational processes for alpha and beta test sites. In
addition, SAF concepts were presented to researchers and practitioners at the following
conferences: System and Software Technical Conference, Computer Security Institute
Conference, International Conference on Commercial Off-the-Shelf (COTS)-Based Software
Systems, and the Homeland Security: Research * Innovation * Transition Conference.
An example assurance case for security was developed with support from researchers
knowledgeable with safety and reliability assurance.
2008 Plans
A description of SAF and the example security assurance case will be published in a
technical note in the second quarter of 2008. The objective for further research is to
evaluate ways in which the development of SAF information can contribute to an
understanding of assurance for software, systems, and information and influence tradeoff
decisions that impact mission quality early in the design and development processes.
Pilot engagements will be selected that allow the consideration of organizational and
technology options early in the system development life cycle to identify ways to
influence the quality tradeoff choices for appropriate consideration of survivability
risk and realistic usage.
Disclaimers and copyright information
Last updated May 10, 2007
|