CERT
Survivable Network Systems: An Emerging Discipline
[Title] [Chapter 1] [Chapter 2] [Chapter 3] [Chapter 4] [Chapter 5] [Bibliography] [Glossary] [DTIC]

Survivable Network Systems: An Emerging Discipline

Abstract. Society is growing increasingly dependent upon large-scale, highly distributed systems that operate in unbounded network environments. Unbounded networks, such as the Internet, have no central administrative control and no unified security policy. Furthermore, the number and nature of the nodes connected to such networks cannot be fully known. Despite the best efforts of security practitioners, no amount of system hardening can assure that a system that is connected to an unbounded network will be invulnerable to attack. The discipline of survivability can help ensure that such systems can deliver essential services and maintain essential properties such as integrity, confidentiality, and performance, despite the presence of intrusions. Unlike the traditional security measures that require central control or administration, survivability is intended to address unbounded network environments. This report describes the survivability approach to helping assure that a system that must operate in an unbounded network is robust in the presence of attack and will survive attacks that result in successful intrusions. Included are discussions of survivability as an integrated engineering framework, the current state of survivability practice, the specification of survivability requirements, strategies for achieving survivability, and techniques and processes for analyzing survivability.

1. Survivability in Network Systems

Contemporary large-scale networked systems that are highly distributed improve the efficiency and effectiveness of organizations by permitting whole new levels of organizational integration. However, such integration is accompanied by elevated risks of intrusion and compromise. These risks can be mitigated by incorporating survivability capabilities into an organization’s systems. As an emerging discipline, survivability builds on related fields of study (e.g., security, fault tolerance, safety, reliability, reuse, performance, verification, and testing) and introduces new concepts and principles. Survivability focuses on preserving essential services in unbounded environments, even when systems in such environments are penetrated and compromised [Anderson 97].

1.1 The New Network Paradigm: Organizational Integration

From their modest beginnings some 20 years ago, computer networks have become a critical element of modern society. These networks not only have global reach, they also have impact on virtually every aspect of human endeavor. Network systems are principal enabling agents in business, industry, government, and defense. Major economic sectors, including defense, energy, transportation, telecommunications, manufacturing, financial services, health care, and education, all depend on a vast array of networks operating on local, national, and global scales. This pervasive societal dependency on networks magnifies the consequences of intrusions, accidents, and failures, and amplifies the critical importance of ensuring network survivability.

As organizations seek to improve efficiency and competitiveness, a new network paradigm is emerging. Networks are being used to achieve radical new levels of organizational integration. This integration obliterates traditional organizational boundaries and transforms local operations into components of comprehensive, network-resident business processes. For example, commercial organizations are integrating operations with business units, suppliers, and customers through large-scale networks that enhance communication and services. These networks combine previously fragmented operations into coherent processes open to many organizational participants. This new paradigm representis a characterized by a shift from bounded networks with central control to unbounded networks., Unbounded networks are characterized by distributed administrative control without central authority, limited visibility beyond the boundaries of local administration, and lack of complete information about theUnbounded networks typically contain characterized by incomplete information and have limited visibility beyond the boundaries of local administration.network. At the same time, organizational dependencies on networks are increasing and risks and consequences of intrusions and compromises are amplified.

1.2 The Definition of Survivability

We define survivability as the capability of a system to fulfill its mission, in a timely manner, in the presence of attacks, failures, or accidents. We use the term system in the broadest possible sense, including networks and large-scale systems of systems.

The term mission refers to a set of very high-level (i.e., abstract) requirements or goals. Missions are not limited to military settings since any successful organization or project must have a vision of its objectives whether expressed implicitly or as a formal mission statement. Judgements as to whether or not a mission has been successfully fulfilled are typically made in the context of external conditions that may affect the achievement of that mission. For example, assume that a financial system shuts down for 12 hours during a period of widespread power outages caused by a hurricane. If the system preserves the integrity and confidentially of its data and resumes its essential services after the period of environmental stress is over, the system can reasonably be judged to have fulfilled its mission. However, if the same system shuts down unexpectedly for 12 hours under normal conditions (or under relatively minor environmental stress) and deprives its users of essential financial services, the system can reasonably be judged to have failed its mission, even if data integrity and confidentiality are preserved.

Timeliness is a critical factor that is typically included in (or implied by) the very high- level requirements that define a mission. However, timeliness is such an important factor that we included it explicitly in the definition of survivability.

The terms attack, failure, and accident are meant to include all potentially damaging events; but these terms do not partition these events into mutually exclusive or even distinguishable sets. It is often difficult to determine if a particular detrimental event is the result of a malicious attack, a failure of a component, or an accident. Even if the cause is eventually determined, the critical immediate response cannot depend on such speculative future knowledge.

Attacks are potentially damaging events orchestrated by an intelligent adversary. Attacks include intrusions, probes, and denial of service. Moreover, the threat of an attack may have as severe an impact on a system as an actual occurrence. A system which that assumes a defensive position because of the threat of an attack may reduce its functionality and divert additional resources to monitoring the environment and protecting system assets.

We include failures and accidents as part of survivability. Failures are potentially damaging events caused by deficiencies in the system or in an external element on which the system depends. Failures may be due to software design errors, hardware degradation, human errors, or corrupted data. Accidents describe the broad range of randomly occurring and potentially damaging events such as natural disasters. We tend to think of accidents as externally generated events (i.e., outside the system) and failures as internally generated events.

With respect to system survivability, a distinction between a failure and an accident is less important than the impact of the event. Nor is it often possible to distinguish between intelligently orchestrated attacks and unintentional or randomly occurring detrimental events. Our approach concentrates on the effect of a potentially damaging event. Typically, for a system to survive, it must react to (and recover from) a damaging effect (e.g., the integrity of a database is compromised) long before the underlying cause is identified. In fact, the reaction and recovery must be successful whether or not the cause is ever determined.

It is important to note that Ourour primary focus in this report is on to helping systems survive the acts of intelligent adversaries. This bias is based on the nature of the organization to which the authors belong. Our Survivable Network Technology Team is an outgrowth of the CERTâ * Coordination Center, which has been helping users respond to and recover from computer security incidents since 1988.

Finally, it is important to recognize that it is the mission fulfillment that must survive, not any particular subsystem or system component. Central to the notion of survivability is the capability of a system to fulfill its mission, even if significant portions of the system are damaged or destroyed. We will sometimes use the term survivable system as a less than perfectly precise shorthand for a system with the capability to fulfill a specified mission in the face of attacks, failures, or accidents. Again, it is the mission, not a particular portion of the system, that must survive.

1.3 The Domain of Survivability: Unbounded Networks

The success of a survivable system depends on the computing environment in which the survivable system operates. The trend in networked computing environments is towards largely unbounded network infrastructures. A bounded system is one in which all of the system’s parts are controlled by a unified administration and can be completely characterized and controlled. At least in theory, the behavior of a bounded system can be understood and all of its various parts identified. In an unbounded system there is no unified administrative control over its parts. We use the term administrative control in the strictest sense, which includes the power to impose and enforce sanctions and not simply to recommend an appropriate security policy. In an unbounded system, each participant has an incomplete view of the whole, must depend on and trust information supplied by its neighbors, and cannot exercise control outside its local domain.

An unbounded system can be composed of bounded and unbounded systems connected together in a network. Figure 1 illustrates an unbounded domain consisting of a collection of bounded systems in which each bounded system is under separate administrative control. Although the security policy of an individual bounded system cannot be fully enforced outside of the boundaries of its administrative control, the policy can be used as a yardstick to evaluate the security state of that bounded system. Of course, the security policy can be advertised outside of the bounded system;, but administrators are severely limited in their ability to compel or persuade outside individuals or entities to follow it. This limitation is particularly true when an unbounded domain spans jurisdictional boundaries, making legal sanctions difficult or impossible to impose.

Figure 1: An Unbounded Domain Viewed as a Collection of Bounded Systems

 

When an application or software-intensive system is exposed to an environment consisting of multiple, unpredictable administrative domains with no measurable bounds, the system has an unbounded environment. An unbounded environment exhibits the following properties:

  • multiple administrative domains with no central authority
  • an absence of global visibility (i.e., the number and nature of the nodes in the network cannot be fully known)
  • interoperability between administrative domains determined by convention
  • widely distributed and interoperable systems
  • users and attackers can be peers in the environment
  • cannot be partitioned into a finite number of bounded environments

The Internet is an example of an unbounded environment with many client-server network applications. A public Web server and its clients may exist within many different administrative domains on the Internet;, yet there exists no central authority that requires all clients to be configured in a way expected by the Web server. In particular, a Web server can never rely on a set of client plug-ins to be present or absent for any function that the server may want to provide.

For a Web server providing a financial transaction (e.g., for a Web-based purchase), the Web server may require that the user install a plug-in on the client to support a secure transaction. However, due to the unbounded nature of the environment, previously installed plug-ins from a competitor may be present on the client that may corrupt, subvert, or damage the Web server during the transaction. For the Web server to be survivable, there must be built-in protection from malicious client interactions and these protections must make no assumptions about the configuration or features of the remote client.

In this example, the Web server and its clients make up the system. The multiple administrative domains are the variety of site domains on the Internet. Many of these domains have legitimate users. Other sites are used for intrusions in an anonymous setting. These latter sites cannot be distinguished by their administrative domain, but only by client behavior. The interoperability between the server and its clients is defined by http (hypertext transfer protocol), a convention agreed upon between the server and clients. The system, comprised of Web servers and clients, is widely distributed both geographically and logically throughout the Internet. Legitimate users and attackers are peers in the environment and there is no method to isolate legitimate users from the attackers. In other words, there is no way to bound the environment to legitimate users using only a common administrative policy.

Unbounded systems are a significant component of today's computing environment and will play an even a larger role in the future. The Internet--a non-hierarchical network of systems, each under local administrative control only--is a primary example of an unbounded system. While conventions exist that allow the parts of the Internet to work together, there is no global administrative control to assure that these parts behave according to these conventions. Therefore, security problems abound. Unfortunately, the security problems associated with unbounded systems are typically underestimated.

1.4 Characteristics of Survivable Systems

A key characteristic of survivable systems is their capability to deliver essential services in the face of attack, failure, or accident.

Central to the delivery of essential services is the capability of a system to maintain essential properties (i.e., specified levels of integrity, confidentiality, performance, and other quality attributes) in the presence of attack, failure, or accident. Thus, it is important to define minimum levels of quality attributes that must be associated with essential services. For example, a launch of a missile by a defensive system is no longer effective if the system performance is slowed to the point that the target is out of range before the system can launch.

These quality attributes are so important that definitions of survivability are often expressed in terms of maintaining a balance among multiple quality attributes such as performance, security, reliability, availability, fault-tolerance, modifiability, and affordability. The Attribute Tradeoff Analysis project at the Software Engineering Institute is using this attribute-balancing (i.e., tradeoff) view of survivability to evaluate and synthesize survivable systems [Kazman 97]. Quality attributes represent broad categories of related requirements, so a quality attribute may contain other quality attributes. For example, the security attribute traditionally includes the three attributes: availability, integrity, and confidentiality.

The capability to deliver essential services (and maintain the associated essential properties) must be sustained even if a significant portion of the system is incapacitated. Furthermore, this capability should not be dependent upon the survival of a specific information resource, computation, or communication link. In a military setting, essential services might be those required to maintain an overwhelming technical superiority, and essential properties may include integrity, confidentiality, and a level of performance sufficient to deliver results in less than one decision cycle of the enemy. In the public sector, a survivable financial system is one that maintains the integrity, confidentiality, and availability of essential information and financial services, even if particular nodes or communication links are incapacitated through intrusion or accident, and that recovers compromised information and services in a timely manner. The financial system’s survivability might be judged by using a composite measure of the disruption of stock trades or bank transactions (i.e., a measure of the disruption of essential services).

Key to the concept of survivability, then, is identifying the essential services (and the essential properties that support them) within an operational system. Essential services are defined as the functions of the system that must be maintained when the environment is hostile or failures or accidents are detected that threaten the system. There are typically many services that can be temporarily suspended when a system is dealing with an attack or other extraordinary environmental condition. Such a suspension can help isolate areas affected by an intrusion and free system resources to deal with its effects. The overall function of a system should adapt to preserve essential services.

We have linked the capability of a survivable system to fulfill its mission in a timely manner to its ability to deliver essential services in the presence of attack, accident, or failure. Ultimately, mission fulfillment must survive, not any portion or component of the system. If an essential service is lost, it can be replaced by another service that supports mission fulfillment in a different but equivalent way. However, we still believe that the identification and protection of essential services is an important part of a practical approach to building and analyzing survivable systems. As a result, we define essential services to include alternate sets of essential services (perhaps mutually exclusive) that need not be simultaneously available. For example, a set of essential services to support power delivery may include both the distribution of electricity and the operation of a natural gas pipeline.

To maintain their capabilities to deliver essential services, survivable systems must exhibit the four key properties illustrated in Table 1:

Key Property

Description

Example

Resistance to attacks

strategies for repelling attacks

user authentication

stochastic diversity of programs

Recognition of attacks and the extent of damage

strategies for detecting attacks (including intrusions) and understanding the current state of the system, including evaluating the extent of damage

recognition of intrusion usage patterns

internal integrity checking

Recovery of full and essential services after attack

strategies for restoring compromised information or functionality, limiting the extent of damage, maintaining or, if necessary, restoring essential services within the time constraints of the mission, restoring full service as conditions permit

replication and reinitialization of data

Adaptation and evolution to reduce effectiveness of future attacks

strategies for improving system survivability based on knowledge gained from intrusions

incorporation of new patterns for intrusion recognition

Table 1: The Key Properties of Survivable Systems

1.5 Survivability as an Integrated Engineering Framework

As a broadly-based engineering paradigm, survivability is a natural framework for integrating established and emerging software engineering disciplines in the service of a common goal. These established areas of software engineering, which are related to survivability, include security, fault tolerance, safety, reliability, reuse, performance, verification, and testing. Research in survivability encompasses a wide variety of research methods, including the investigation of

  • analogs to the immunological functioning of an individual organism
  • sociological analogs to public health efforts at the community level

1.5.1 Survivability and Security

The discipline of computer security has made valuable contributions to the protection and integrity of information systems over the past three decades. However, computer security has traditionally been used as a binary term that suggests that at any moment in time a system is either safe or compromised. We believe that this use of computer security engenders viewpoints that largely ignore the aspects of recovery from the compromise of a system and aspects of maintaining services during and after an intrusion. Such an approach is inadequate to support necessary improvements in the state of the practice of protecting computer systems from attack. In contrast, the term survivable systems refers to systems whose components collectively accomplish their mission even under attack and despite active intrusions that effectively damage a significant portion of the system.

Robustness under attack is at least as important as hardness or resistance to attack. Hardness contributes to survivability, but robustness under attack (and, in particular, recoverability) is the essential characteristic that distinguishes survivability from traditional computer security. At the same time, survivability can benefit from computer security research and practice, and survivability can provide a framework for integrating security with other disciplines that can contribute to system survivability.

1.5.2 Survivability and Fault Tolerance

Survivability requires robustness under conditions of intrusion, failure, or accident. The concept of survivability includes fault tolerance, but is not equivalent to it. Fault tolerance relates to the statistical probability of an accidental fault or combination of faults, not to malicious attack. For example, an analysis of a system may determine that the simultaneous occurrence of the three statistically independent faults (f1, f2, and f3) will cause the system to fail. The probability of the three independent faults occurring simultaneously by accident may be extremely small, but an intelligent adversary with knowledge of the system's internals can orchestrate the simultaneous occurrence of these three faults and bring down the system. A fault-tolerant system most likely does not address the possibility of the three faults occurring simultaneously, if the probability of occurrence is below a threshold of concern. A survivable system requires a contingency plan to deal with such a possibility.

Redundancy is another factor that can contribute to the survivability of systems. However, redundancy alone is insufficient since multiple identical backup systems share identical vulnerabilities. A survivable system requires each backup system to offer equivalent functionality, but significant variance in implementation. This variance thwarts attempts to compromise the primary system and all backup systems with a single attack strategy.

1.6 The Current State of Practice in Survivable Systems

Much of today's research and practice in computer-systems survivability takes a perilously narrow, security-based view of defense against computer intrusions. This narrow view is dangerously incomplete because it focuses almost exclusively on hardening a system (e.g., using firewall technology or an orange book approach to host protection) to prevent a break-in or other malicious attack. This view does little about how to detect an intrusion or what to do once an intrusion has occurred or is under way. This view is also accompanied by evaluation techniques that limit their focus to the relative hardness of a system, as opposed to a system's robustness under attack and ability to recover compromised capabilities.

The architecture of secure bounded systems is built upon the existence of a security policy and its enforcement, which is imposed by the exercise of administrative control. In contrast, an unbounded system has no administrative control with which to impose global-security policy. For instance, on the Internet today the backbone architecture exists independent of security policy considerations because there is no global administrative control.

Affordability is always a significant factor in the design, implementation, and maintenance of systems that support the national infrastructure (e.g., the power grid, the public switched communications networks, and the financial networks) and our national defense. In fact, the trend toward increased sharing of common infrastructure components in the interest of economy virtually ensures that the civilian networked information infrastructure, and its vulnerabilities will always be an inseparable part of our national defense.

Practical, affordable systems are almost never 100% customized, but rather are constructed from commonly available off-the-shelf components with internal structures that are well known. The trend toward developing systems through integration and reuse instead of customized design and coding efforts is a cornerstone of modern software engineering. Unfortunately, the intellectual complexity associated with software design, coding, and testing virtually ensures that exploitable bugs can and will be discovered in commercial and public domain products with internal structures that are available for analysis. When these products are incorporated as components of larger systems, those systems become vulnerable to attack strategies based on the exploitable bugs. Popular commercial and public-domain components offer attackers a ubiquitous set of targets with well-known and typically unvarying internal structures. This lack of variability among components translates into a lack of variability among systems. These systems potentially allow a single attack strategy to have a wide-ranging and devastating impact.

The natural escalation of offensive threats versus defensive countermeasures has demonstrated time and again that no practical systems can be built that are invulnerable to attack. Despite best efforts, there can be no assurance that systems will not be breached. Thus, the traditional view of information systems security must be expanded to encompass the specification and design of system behavior that helps the system survive in spite of active intrusions. Only then can systems be created that are robust in the presence of attack and are able to survive attacks that cannot be completely repelled.

In short, the nature of contemporary system development dictates that even hardened systems can and will be broken. Therefore, survivability must be designed into systems to help avoid the potentially devastating effects of system compromise and failure due to intrusion.

1.6.1 Incident Handling Has Enhanced Survivability

Although applying the term survivability to computer systems is relatively new, the practice of survivability is not. Much of the survivability practice to date has been in the realm of incident response (IR) teams. In fact, the CERT Coordination Center (CERT/CC) has, throughout its history, enhanced system survivability in the Internet community. The CERT/CC provides incident response services (helping organizations respond to and recover from incidents) and publishes and distributes vulnerability advisories (akin to public health notices). Traditionally, the CERT/CC has been concerned about survivability and has been successful in helping sites with risk mitigation and recovery.

The experience of the CERT Coordination Center has shown that how organizations respond to and recover from computer intrusions is at least as important as the steps they take to prevent them. We believe that widespread availability and use of survivable systems by the Internet community and throughout the Internet infrastructure will provide the best hope for the dramatic improvements necessary to transform the Internet into a survivable, networked information system of systems. Survivable systems will help make the Internet a viable medium for the conduct of commerce, defense, and government. This medium will also enable the support of major elements of the national infrastructure (e.g., power grid, public switched network, and air traffic control).

1.6.2 Firewalls Embody the Current State of Practice

Currently, little of the basic technology in security engineering and system integration applies to unbounded systems. Instead, current practice assumes that the capability exists to identify, define, and characterize the extent of administrative control over a system, all access points to that system, and all signals that may appear at those access points. In unbounded systems, such as the current Internet and the future National Information Infrastructure, these boundary conditions cannot be fully determined.

The current state of practice in survivability and security evaluation tends to treat systems and their environments as static and unchanging. However, the survivability and security of systems in fact degrades over time as changes occur in their structures, configurations, and environments, and as knowledge of their vulnerabilities spreads throughout the intruder community.

On the Internet today, the cornerstone of security is the notion of a firewall, a logically bounded system within a physically unbounded one. We assert that bounded-system thinking within unbounded domains leads to security designs and architectures that are fundamentally flawed from a survivability perspective. One notable example is the use of a firewall as the basic security component of the Internet. This approach is severely limited and can be readily circumvented by exploiting the fundamental differences between bounded and unbounded systems. Traditional firewalls are the state of the art for security architectures, but not for survivable systems, because they are passive, filter-only devices. The addition of active components, such as detection and a dynamic-response capability, will allow firewalls to play a role in survivable systems; but current firewalls do not have these capabilities.


[Title] [Chapter 1] [Chapter 2] [Chapter 3] [Chapter 4] [Chapter 5] [Bibliography] [Glossary] [DTIC]