3. Survivability Design and Implementation Strategies
In this section we examine strategies that support the survivability of critical system functions in unbounded networks. Strategies for survivability in networked systems depend on several assumptions and constraints. Although they may seem obvious, these assumptions and constraints must be made explicit. The assumptions differ radically from the implicit assumptions traditionally made for the uniprocessor, multi-processor, and bounded network systems on which most previous research and development has been based.
For unbounded networks, we assume that
- any individual node of the network can be compromised
- survivability does not require that any particular physical component of the network be preserved
- only the essential services of the network as a whole must survive
- for reasons of reliability, design error, user error, and intentional compromise, the trustworthiness of a network node or any node with which it can communicate cannot be guaranteed
In this report, we primarily discuss unbounded networks. The term unbounded has a slightly different meaning depending on the purpose and situation involved. In all cases, unbounded networks relate to three principle characteristics that are present in each definition: a lack of central physical or administrative control, absence of insight or vision into all parts of the network, and no practical limit on growth in the number of nodes in the network.
These assumptions impose the following constraints on the architecture of survivable networks and on the form of feasible survivability strategies:
- There must not be a single point of failure within the network. Essential services are distributed in a manner that is not critically dependent on any particular component or node.
- Global knowledge is impossible to achieve in a distributed system [Halpern84]. There are no all-seeing global oracles. Instead, protocols define the interaction and knowledge shared between nodes.
- Each node must continuously validate the trustworthiness of itself and those with which it communicates.
- Computations within a given node of an bounded network, whether for essential services, communication, or trust validation, must have costs that are less than proportional to the number of nodes in the network.
3.1 Four Aspects of Survivability Solution Strategies
As introduced in Section 2, there are four aspects of the survivability solution which can serve as a basis for survivability strategies. These four aspects are: resistance, recognition, recovery, and system adaptation and evolution. This section summarizes the approaches in each of these four areas.
There are many techniques for dealing with these four aspects. Any or all of the techniques may apply to survivable systems. We do not list all of these techniques but instead categorize them within the broader aspects. Table 2 contains the four aspects of the survivability solution and representative taxonomies of respective strategies.
|
Survivability Aspect |
Taxonomies of Strategies |
|
Resistance |
|
|
Recognition |
|
|
Recovery |
|
|
Adaptation and Evolution
|
|
Table 2: A Taxonomy of Strategies Related to Survivability
3.2 Support of Strategies by the Current Computing Infrastructure
The rapid growth of the Web and other Internet-based applications has encouraged the growth of a computing infrastructure to support distributed applications. While the initial Web efforts concentrated on information publishing, the application domain has expanded to encompass a much wider spectrum of an organization's computing needs. The technical focus of this growth has moved from tools such as Web browsers or servers to the development of a set of Internet-compatible, commercially provided services. Examples of these services are file, print, transaction, messaging, directory, security, and object services such as CORBA (Common Object Request Broker Architecture) and DCOM (Distributed Component Object Model).
The commercially available distributed infrastructures are in the early phases of their development and do not yet directly support system survivability. Recognition is not a supported service and recovery is indirectly supported by a transaction server. Typically, an organization adopts such an infrastructure to lower costs by using a common infrastructure for intranets, extranets, and Internet applications and to simplify application development by capturing the complexity of distributed computing in the infrastructure rather than in each application.
Managing user-profile data is an example of a service that a distributed infrastructure can assume. One general requirement of system survivability is to provide user authentication and manage the authority given to that user for data and systems access. Authentication can be implemented using passwords and authorizations that are validated by access-control lists. However, in many existing systems, such as database applications, access-control lists are maintained by the application.
When system users, data, and applications are geographically distributed, the maintenance of user-profile data in an application is difficult. A shared directory service, which is part of a distributed infrastructure, can provide the data storage capability and a protocol such as LDAP (Lightweight Directory Access Protocol) for application access and replace the application-specific access-control mechanisms. These infrastructure security services can provide the mechanisms for user authentication such as a public key interface, mechanisms to describe access control, and the means to define a security policy. The use of shared services for user authentication and authorization should reduce application and overall system complexity as well as provide the means to define an organizational security policy.
When this strategy is implemented, the system architecture is constrained by the infrastructure-supplied services and the protocols supported. For example, a survivability strategy may be to exchange a primary service with an alternate implementation of that service if the primary service has been compromised. At this stage of infrastructure deployment there is some interoperability supported among services provided by different vendors, however, there is also significant integration of services that makes it difficult or impossible to replace a service, such as a directory service, with one from a different vendor.
Using shared directory services also raises general survivability issues. A widely used infrastructure should develop a robust set of services. However, their wide use develops a large and knowledgeable intruder community and a wide dissemination of information about system vulnerabilities and security solutions. A compromised or inaccessible directory can affect multiple applications and multiple sites.
An essential part of providing system survivability is establishing operational and administrative procedures for system directories so that system administrators can monitor service and provide recovery. The design tradeoff is that implementing monitoring and recovery procedures is less costly using shared components than using an application-specific architecture. Infrastructure services provide generic support for replication and maintenance of consistency across distributed sites. However, achieving overall mission survivability requires not only understanding the impact of compromised access control data and of the design of a recovery policy, but also knowledge of the system’s applications.
Commercially available infrastructure products provide general services that are independent of application domain. Some of the services listed in Figure 3, however, require application-domain knowledge. For example, recognition of an intrusion or maintenance of trust among nodes requires knowledge of expected behavior. A protocol can ensure that information is delivered, but cannot validate the appropriateness of the data. Simple recovery mechanisms can include transaction logs or file restorations; but use of transactions, rollback strategies, and more advanced techniques require domain expertise to identify consistent application states and the impact of compromised data. The successful use of such recovery strategies has been in application-centered products, such as relational database systems that manage relatively homogeneous data structures. Applying such techniques to general distributed-computing systems is more difficult.
3.3 Survivability Design Observations
We can draw a number of observations about the questions and issues that must be addressed concerning system survivability in networked systems.
3.3.1 Survivability Requires Trust Maintenance
An open issue is how to determine the basis of trust and how an individual node of a network contributes to the survivability of the system’s essential services when
- any node can be unreliable or rogue
- there is no global view or global control
- nodes cannot completely trust themselves or their neighbors
Depending on the application, it may be possible through architectural design or dynamic action within the system to increase the reliability, visibility, and control of components or the trustworthiness of participants. The only absolute basis for trust maintenance, however, is the consistency of behavioral feedback from interactions with other nodes and independent verification of claimed actions from nodes not directly involved in the transactions.
A closely related point is the absence of global view and control. If unreliable and untrustworthy components are found to be present in a system, determining whether the critical functions have been compromised may be extremely difficult without global view and control. If global view and control are absent (and, in general, they will be) this condition does not preclude effective survivable-network architectures. In particular, it should be possible for individual nodes to generally contribute to the survivability goals and at worst not interfere with these goals.
Genetic algorithms, for example, achieve their effects through the collective action of the individual participants. These participants, however, cannot measure overall effectiveness or determine whether their contribution is positive. This example suggests that survivability solutions can exist among emergent algorithms that depend on continuous interaction with neighboring nodes but do not require feedback for indications of progress and success.
3.3.2 Survivability Analysis Is Protocol-Based Not Topology-Based
Another implication for networked systems is that the important aspects of their architecture from the viewpoint of survivability relate to the conventions and rules of interaction between neighboring nodes and that the network topology is largely irrelevant. That is, network architectures must be specified, compared, and measured in terms of their interactions and not the topology of their interconnection.
As an example of this kind of analysis, consider the general issue of persistence of state data for a protocol. Should a protocol maintain state information to improve reliability or to perform additional consistency checks? What level of checking should the infrastructure support? J. H. Saltzer and his colleagues examined the FTP (file transfer protocol) and compared approaches that check packets only at the source and destination nodes (end-to-end) to protocols that check reliability on each hop of the communications path [Saltzer90]. The conclusion was that hop-to-hop checking increased complexity and affected performance with little increase in overall reliability.
Kenneth P. Birman discusses such tradeoffs in a more general context [Birman96]. Properties such as reliability and survivability can be enhanced by properties that support fault tolerance or communication guarantees. However, the cost of a property to support, say uniform ordering of events, can be thousands of times more costly than a weaker property that may require the application to handle nonuniform behavior.
Similar arguments can be made when you compare stateless architectures and non-replicated data to maintaining a strong application-level consistency requirement. In the case of stateless architectures and non-replicated data, the server can be restarted and the clients have the responsibility to reconnect. Survivability requires tradeoff analysis between the responsibilities of the servers and the clients and between end-to-end protocol monitoring by the application and general protocol monitoring provided by the infrastructure. For such a recovery strategy, the application level may be the appropriate level in which to analyze application-state and user behavior and select appropriate recovery actions.
3.3.3 Survivability Is Emergent and Stochastic
Survivability goals are emergent properties that are desired for the system as a whole, but do not necessarily prevail for individual nodes of the system. This approach contrasts with traditional system designs in which specialized functions or properties are assured for particular nodes and the composition of the system must ensure that those properties and functional capabilities are preserved for the system as a whole. For survivability, we must achieve system-wide properties that typically do not exist in individual nodes. A survivable system must ensure that desired survivability properties emerge from the interactions among the components in the construction of reliable systems from unreliable components.
Survivability is inherently stochastic. If survivability properties are emergent, they are present only when the number of contributing component nodes of a system is sufficiently large. If the number or arrangement of nodes falls below a critical threshold, the attendant survivability property fails. An example of this type of critical survivability property is connectivity in a communications system.
You can design the architecture of the system to maximize the number of paths between any two nodes; but if enough links are compromised to partition the network, communication between arbitrary nodes will no longer succeed. Thus, survivability properties, algorithms, and architectures should be specified, viewed, and assessed to determine the probability of their success under given conditions of use and not determined as discrete quantities.
3.3.4 Survivability Requires a Management Component
The design of a survivable system also includes management operations and administration. Poor system administration is a frequent source of vulnerabilities at centrally administered sites. In unbounded network systems, system administration must be coordinated across multiple sites. Existing system administration procedures typically assume a bounded environment and full administrative control over the required services. The complexity of infrastructure and the use of services outside an organization's immediate control require expanding the administrative services and providing a monitoring function as part of the infrastructure.
[Title] [Chapter 1] [Chapter 2] [Chapter 3] [Chapter 4] [Chapter 5] [Bibliography] [Glossary] [DTIC]





