[34]
![Forwards to [35]](../all_the_pictures/arrow_right.jpg)
Survivability Architectures: The Control Systems Perspective
Kevin J. Sullivan, John C. Knight, Xing Du, Steve GeistUniversity of Virginia Computer Science Department
Thornton Hall, University of Virginia, Charlottesville, VA 22903
E-mail: sullivan@ Virginia.EDU, Tel. (804) 982-2206
I. Control-Systems-Based Survivability Architectures
The survivability of critical, computerized infrastructure systems has become a major concern of the United States Government and military, and is expected to garner increasing concern from private industry. By survivability, we mean the ability of a critical system to continue to provide service despite significant disturbances, whether natural, accidental or malicious. Defensive system architectural design is one aspect of a comprehensive approach to system survivability. For example, the physical protection of key elements of a system can mitigate a set of risks that would be intolerable otherwise. However, when proactive defensive design measures fail to prevent disturbances, then the detection of and reaction to the disturbance (damage assessment, damage control, recovery, adaptation) might be necessary to enable continued acceptable operation of the system. Typically, a system will be reconfigured at either the level of its operating parameters or even physical organization to mitigate the effects of the disturbance. One challenge to the research community is to invent and evaluate novel system survivability architectures. In this paper we describe the approach that we are taking to the reactive aspect of system survivability-an approach based on the exploitation of a control systems theory perspective.In a nutshell, we frame hierarchical adaptive control as an architectural style for information survivability. A control system is a mechanism that manages the behavior of a monitored system within its environment in order to maintain the acceptable operation of that system. Examples will be familiar to every engineer. An adaptive control system is one that can continue providing control of a system in the face of disruption to elements of the system and control system. For example, an adaptive control system for an avionics application can ensure that a jet remains under control even if it loses part of a wing and some sensors in an air engagement. A hierarchical control system is one in which control actions are determined at a number of levels of a hierarchical system, with low-level control system elements influencing and being influenced by higher levels of control. Tactical decisions might be made close to a controlled system, while strategic decisions are made at a higher level that, on one hand, has information on the global system state, but that, on the other, lacks enough information, computational ability, bandwidth or latency to exert direct control.
In our formulation, the system that is being controlled is the information system that automates an actual infrastructure system. In freight rail transportation, for example, the physical system comprises rails, cars and locomotives. This physical infrastructure is controlled by a complex information systems performing train assembly, dispatch, braking and all manner of functions to meet the system performance, safety and other such objectives. We envision a superimposing a survivability control system atop that information system. Among other things, such a control system would implement intrusion monitoring and response; system-wide fault tolerance; and the managed degradation of service under adverse operating conditions.
In order to explore, develop and evaluate the hierarchical adaptive control concept as an architectural style for survivable infrastructure information systems, we have implemented a simple example on a distributed dynamic model of the United States payment system under attack. This experimental system models the hierarchical structure of the U.S. banking system, with "branch banks" as leaves," "money-center banks" in the middle, and the "Federal Reserve" at the root of a tree. Simulated "checks" are deposited at branch bank, resulting in requests for transfers of funds among accounts. In some cases, transfers are within the same branch; in others, within the same money center bank; and in others, between money center banks. Handling of transfers occurs at the lowest level involving the source and target banks. Our "banks" run simulated intrusion detectors, which are activated by a simulated intruder.

II. A Simple Dynamic Modeling Testbed
The preceding figure illustrates the structure of the dynamic model and a superimposed hierarchical control system. Application nodes are white. The bbij are branch banks. The mcbi are money center banks. And frb models the Federal Reserve system. Elements of the control system are depicted as circles and ovals in gray. Successively higher levels of control appear in successively darker shades. The scope of control of each level of the control system is indicated by the nesting in the diagram.Each bank has a local control system. Such a local control system would be responsible for enforcing any policies concerning the particular bank to which that local control node is attached, e.g., detecting and then reporting potential intrusions into the systems of that bank. Each money center bank has a control node whose scope is the money center bank and subordinate branch banks. This higher control level manages the system rooted at and including the money center bank. This higher level control node communicates with branch bank control nodes. Finally, the system has a control node whose scope is the Federal Reserve's local control node and those of the money center banks. In addition to communicating with both higher and lower level control nodes, each control node provides a user interface at the bank at that control node's level in the hierarchy. This monitoring and control interface reports status to human management, and provides for human-initiated control actions.
Our distributed dynamic model of the banking system is implemented on a distributed message-passing "object layer" running on Windows NT and using TCP for message passing. Each node implements a multithreaded, message-dispatching process, with simulation-specific code implementing banking and control functions, e.g., simulated account databases and functions for transferring value between accounts. This layer supports logical point-to-point communication between nodes, with a reflective capability: all messages to or from a node pass through an associated "shell node" of the same kind. These shells (not depicted in Figure 1) serve several purposes. In particular, they model the largely transparent wrappers that we anticipate being added to real infrastructures systems to provide interfaces between application elements and survivability control systems. Second, they host controls system elements: e.g., intrusion detectors, which we see as complex sensors; and actuators, which, in this context, are procedures for managing parts of the distributed application as well as lower level nodes of the control system. Third, within our dynamic model, these shell nodes provide the interfaces by which the simulated attacker attacks banks. These nodes also embody some other irrelevant implementation details.
Our policy is as follows: if a given fraction of nodes beneath a given node are attacked that node concludes that there is a coordinated attack on its domain, and it puts itself and its subordinated controllers into an "under coordinated attack" status. In our case, this requires shutting down service (rejecting checks). This policy is a "toy," of course. Our objective is to develop and evaluate architectures and mechanisms, not particular policies for the banking system. The information flows required to implement a policy depend on that specific policy. One interesting issue is how to synthesize the necessary information interconnect from the specification of a survivability policy for a given system?
At this time, our control nodes are specified as simple finite state machines. We define abstract-data-type-like interfaces by which "events" are passed into and out of these machines. By restricting the architecture to one based on finite state machines, we hope to preserve a degree of analyzability that would be lost were we to go to a richer computational model. However, should it be necessary to move to a richer model, then the encapsulation of policies behind ADT interfaces will make such a transition trivial.
III. Why Hierarchical and Adaptive?
This hierarchical structure of our control system is natural to support local control, passing of detailed status information up the control hierarchy, aggregation of detailed information at higher levels, and passing of such aggregated information back down the control hierarchy. We anticipate that this kind of information flow will be necessary to implement system-wide survivability policies with reasonable use of computation and communication resources. The structure is intended to enable local control nodes to implement policies that depend on detailed local status information and rough (aggregated) knowledge of the global state of the system in which the local application node is embedded.Our structure also uses hierarchy for abstraction and complexity control. Details of local application nodes are abstracted by local control nodes. Higher level control nodes are specified and implemented in terms of the observable and controllable aspects of control nodes at the next level down the control hierarchy.
A third reason for the use of hierarchy in general and for the use of data abstraction in particular as a means of specifying interfaces to and between control system nodes is to foster evolvability of the policies that the control system implements. The evolvability of survivability policies will be critical to effective "learning" by a system over time. A disciplined approach to the modular design of the control system will be critical as well in building adaptive control systems that can tolerate the loss of some control and controlled nodes.
The adaptive aspect is one that we have only begun to explore. The control theoretic notion of multiple-model adaptive control-in which the control system views the controlled system as being in one of a number of possible distinct operating regimes in which distinct control rules apply-offers an especially intriguing metaphor for survivability architectures. At an implementation level, our middleware provides a heartbeat monitoring capability that we plan to use to connect together control nodes in such a way as to ensure that nodes within the control system have a model of both the controlled and control system. We will implement policies that manage contingencies involving the loss of parent and child control nodes.
IV. Early Insights Emerging from Exploratory Work
One of the first (obvious) things that we discovered when taking the control systems perspective was that the information that would have to be passed within the control system depends on the particular system-wide survivability policy to be enforced. For example, a policy declaring a bank holiday if any bank is attacked requires the propagation only of a boolean value indicating whether any bank is under attack. A more subtle policy would require richer information flows. Passing of information from higher to lower levels nodes is required for lower level nodes to take actions informed by partial knowledge of the global system state. As a speculative aside, attribute grammars might provide an useful notation in which to specify required information flows within these tree structures.A second conclusion-one we haven't yet explored in detail-is that control system dynamics have to be sufficiently faster than the controlled system dynamics to permit time sensitive survivability policies to be enforced. Such a policy might require that a subtree be spliced out of the network before a disturbance within that subtree can propagate to other parts of the application system. Ultimately, the dynamics at this level have to be related to the dynamics of the underlying physical architecture. Disturbances travel faster through the electric power grid than they do through the rail system, for example. The relative dynamics issue remains to be explored, and is quite interesting.
III. Relationship to Traditional Control Theory
Control theory is a discipline of critical importance with a rich and beautiful mathematical structure that is at the heart of traditional engineered system design. Traditional control systems treats controlled systems as sets of differential equations. A control system monitors the system, compares its state (suitably processed) with a specification of the desired behavior, decides on an optimal course of action in light of any possible stochastic aspect of the system and control mechanisms, and feeds information back into the system by adjusting its controllable parameters to ensure that the system continues to operate properly.Control theory provides a highly structured way of thinking about and designing information flows and feedbacks so as to maintain systems in desired states over time. For tradition engineered systems, control theory provides a rich and beautiful set of engineering modeling and analysis techniques based on a sound foundation of advanced mathematical analysis. At present we have in traditional control theory little more than a suggestive metaphor for software engineering. The application of control theory to survivability architectures described in this paper is one thrust in a broader research program (of Sullivan) that seeks to develop the suggestive metaphor as far as possible into new insights and new scientific and mathematical foundations for what today are ad hoc and idiosyncratic concepts and principles in software engineering.
To give a taste of the kind of thinking that can emerge from a basic understanding of control theory, let us characterize our control architecture in traditional control theoretic terms. Our control system reveals survivability as a problem of non-stochastic optimization: The best course of action now is determined by available information without consideration of an uncertain future. More complicated survivability policies will require optimization in which bets are made on future outcomes. The key concept from control theory in this dimension is stochastic optimal control, in which the control system has to select a actions today in the face of uncertainty. In the survivability context, uncertainty will involve such things as the demand for services, future behaviors of adversaries, and uncertain consequences of control actions.
In related work we are pursuing links between ad hoc and idiosyncratic concepts of software design and evolution and the clean but not always directly applicable concepts of control theory. In one project, we are applying concepts from options theory-which in non-arbitrage settings is based on stochastic optimal control, and on optimal stopping in particular-to reason about the timing of design decisions, transitions between investment phases in projects, and the value of modularity [1]. In a second thrust, we appeal to the concept of economic optimization under uncertainty to reason about the nature of software evolvability [2]. These other projects are based on an economic view that monetary value can be taken as the objective. The economic view is particularly important, but not the only one that makes sense within a broader context of transporting structures, insights and techniques from control theory for use in developing the foundations of software engineering. Our options-theoretic work is funded by a new grant from the NSF (CCR-9804078).
Bibliography
- [1]
- K. Sullivan, S. Jha and P. Chalasani, "Software Design Decisions as Real Options," University of Virginia Technical Report, submitted for publication.
- [2]
- K. Sullivan, "The Phenomenology of Software Evolution," International Software Evolution Workshop, Kyoto, Japan, 1998.
[34]
![Forwards to [35]](../all_the_pictures/arrow_right.jpg)





