CERT
Back to [28]   [29]    Forwards to [30]



Title: PNNI Global Routing Infrastructure Protection
Authors: Livio Ricciulli, Sabrina De Capitani di Vimercati, Patrick Lincoln, Pierangela Samarati
Institution: SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025
Funded by: Sprint Communications Co. under contract No. CK5005116JMD

Introduction

    Global routing networks require novel security mechanisms to protect control information spanning multiple untrusted administrative domains. Proper protection of the global data structures necessary for the operation of large distributed networks is necessary because (1) compromises of the routing information can have catastrophic effects for the operation of the network, thus being a good target for denial-of-service attacks and (2) user-level security measures can be effective only if the underlying data transport layers cannot be compromised.
    Several designs based on cryptography have been proposed to secure routing infrastructures, but such designs all rely on a key management infrastructure that must be as large as the routing network itself and that is resilient to faults. The key management problem has not been completely solved even for relatively small networks. Therefore, performing global fault tolerant key management can be considered a substantial obstacle. Another shortcoming of cryptographically secure routing protection designs is that they only address protection against malicious faults, but do not directly address spontaneous accidental failures. In fact, most designs have no direct handling of nonmalicious faults, and thus leave these simpler fault tolerance issues to other network protocols. These limitations result in possible conflicts between the built-in mechanisms that reconfigure the system upon failure and the security mechanisms that overlay the routing protocols. For these reasons, other methods of achieving global network integrity are required.
    Our design overcomes the above-mentioned limitations by protecting the routing infrastructure against both malicious and nonmalicious faults by replicating network resources and using standard Byzantine fault tolerant protocols. We show that replicating both processing and communication resources according to rigorous and well-understood rules derived from fault tolerant distributed system theory enables a system to tolerate large classes of fault or attack scenarios, either intentionally or unintentionally provoked, without relying on cryptography.
    Our design merges ideas from intrusion detection, network management, fault tolerance and database security to result in a distributed fault tolerant system that maps extremely well to modern routing systems like the Private Network-Network Interface (PNNI) or OSPF, and does not require significant modifications to the routing protocol standards. We detail our design and we show how it can be naturally integrated into the PNNI ATM routing infrastructure without requiring significant modifications to the standard.
    The architecture of our PNNI Global Routing Infrastructure Protection (PGRIP) system is composed of four main components:
(1)
An anomaly detection module monitors accesses to the routing database on a carefully chosen subset of the routers. Based on heuristic rules (which may be customized at each router), the anomaly detection module can generate alarms that characterize anomalous changes to the database.
(2)
The alarm propagation engine receives alarms from either the local node or remote nodes and can further propagate the alarm to other remote nodes or its local diagnosis module.
(3)
The diagnosis module cooperates with other diagnosis modules in reaching consensus on what action to take. The most basic form of action is to simply log a diagnosis for further review by an operator; other possible actions are to further propagate the alarm through the alarm propagation module or to feed the diagnosis to the resolver module.
(4)
The resolver always logs results coming from the diagnosis module and in some extreme cases initiates a specialized protocol to remedy the diagnosed fault.

Anomaly Detection

    The anomaly detection module generates an alarm whenever, because of an event that triggers changes to the PNNI database, the state of the database becomes anomalous. Upon any change to the local database, this module interprets a set of heuristic rules that try to catch anomalies that could lead to a compromise of the routing infrastructure's integrity. In addition to supervising the consistency of the database state, this module exploits statistical knowledge accumulated during the operation of the routers to detect database operations that are unexpected.
    The rules that describe when a database change is to be considered suspicious determine a set of conditions with respect to (1) the type of operation, (2) the state of the database and the evaluation of the database history, and (3) a logical expression that combines all the conditions.

Alarm Propagation

    As a result of applying anomaly detection heuristics, routers can generate alarms that are propagated throughout the PNNI routing infrastructure. The PNNI architecture is hierarchical. The routing nodes are arranged in groups that share common views of the state of the network (link states and reachability information) by constantly flooding the peer group with messages (called PNNI Topology State Packets-PTSPs) that synchronize the different nodes' databases.
    The PNNI flooding protocol is such that a router receiving a packet that changes any information in its own routing database automatically relays the change to all the other routers directly connected to it (minus the router from which the change was received or routers residing in other peer groups). PGRIP's anomaly detection and diagnosis requirements map very naturally to this mechanism and can fully exploit this inherent feature. The only requirement for PGRIP's proper operation is to keep the peer group fully connected at all times so that messages flooding the peer group reach every node in the group, even during multiple failures. To meet this requirement, we formulate the first condition under which PGRIP must operate.
    Given a maximum of m simultaneous faults (either malicious or nonmalicious), the entire peer group must be m+1-connected. "N-connected" means that any two routers must have N disjoint paths to communicate with each other. That is, the N paths do not share any nodes in common other than the two routers in question. This condition guarantees that even if the m faults block m routes there exists at least one other route through which the flooding operations can take place. Notice that the m+1-connected requirement does not impose changes to the flooding algorithm but simply imposes a precise and rigorous amount of redundancy in the communication links that can be easily introduced at the planning stage of the network topology.
    Because of the above condition, all routers in the peer group will eventually receive anomalous database updates; therefore, we can freely choose a subset of the nodes in the peer group to contain PGRIP. Routers equipped with PGRIP are responsible for collecting anomalous routing database updates, filtering them, and analyzing them to diagnose the anomaly. Let us call this subset of routers the peer group core group (PGCG) and assume that the PNNI Peer Group Leader (PGL) is one of the nodes in the PGCG. After a distributed diagnosis phase executed among the PGCG nodes, if the PGL is not preempted because it is believed to be nonfaulty (PGRIP also handles the case in which the PGL is found to be faulty), the PGL takes one or more of three actions: (1) log the diagnosis locally and take no further action, (2) pass the diagnosis to the fault resolver module, or (3) use higher-level binding information to flood the alarm in its higher-level PNNI group.

Diagnosis

    A fault diagnosis system (1) should be able to correctly interpret anomaly reports (alarms) so that appropriate action can be taken in case of significant failures (in particular, it should avoid false positive and false negative diagnoses) and (2) should be itself resilient to faults.
    Researchers in the field of distributed network management have long been investigating techniques for performing diagnosis of network malfunctions through alarm correlation. The rationale is that given a set of symptoms represented by a variety of distinct alarm messages, an expert system should be able to correlate the symptoms and diagnose the underlying problem. Although the limitation of fault diagnosis in a Byzantine environment has been long recognized, recent work has demonstrated that, under reasonable fault modeling assumptions, by recording the history of anomalous events, one can construct an algorithm that converges to satisfactorily high levels of accuracy of Byzantine fault diagnosis.
    An integrity protection mechanism is useful only if it can protect itself from faults. PGRIP's approach to making the diagnosis system resilient to faults is based on replication. Securing the diagnosis system through cryptographic means would require a key management hierarchy and nonportable cryptographic algorithms, and would still not protect the system from nonmalicious faults. Instead, by judiciously picking a subset of the routers to cooperate in the fault diagnosis, PGRIP can tolerate a failure in any node in the system, including one or more of the nodes performing the diagnosis. The theoretical results arising from the formulation of Byzantine agreement algorithms can be directly applied to the formulation of a diagnosis system that is able to tolerate m simultaneous faults (either malicious or nonmalicious).
In order to tolerate the correct diagnosis of m malicious faults, there must exist at least 3m+1 routers, in the system, which independently perform the diagnosis.
    The PGCG routers must have additional redundancy in the connectivity among them. In fact, another well-known result from Lamport's Byzantine General Agreement problem translates to the following:
If the 3m+1 PGCG nodes are connected through point-to-point connections (as in a PNNI network), the topology of the 3m+1 nodes must be at least 3m-connected.
    This condition prevents malicious PGCG routers from affecting the consensus by intercepting and changing messages while the routers perform the diagnosis agreement algorithm.

Resolver

    The resolver is activated in the PGL node after the PGCG routers agree on a diagnosis. The resolver module should answer, with very specific countermeasures, only those threats that are particularly severe. The resolver module should be used carefully or not used at all because it can affect the network's operation. If misused, the resolver can introduce instability or side effects that may be worse than the original fault.
    We see the resolver module as being capable of offering additional protocols to the PNNI standard so that (1) routing information can be verified, thus exploiting the redundancy and replication of resources in the PNNI hierarchy and (2) the PGCG can preempt some nodes by cutting them out of the routing hierarchy until an operator can assess and remedy potential integrity compromises.

Conclusion

    We have presented a novel approach for securing global routing infrastructures and have instantiated our ideas for the design of PGRIP. The PGRIP design can be used to augment the current PNNI standard and to offer a high level of integrity protection without requiring significant changes to the standard, and without relying on cryptography. PGRIP handles both malicious and nonmalicious faults in a unified manner and can therefore be used as an additional level of assurance for the proper operation of large communication networks. Sprint Communications has funded our efforts and is interested in further developing the ideas contained in PGRIP and integrating them into the network management plane of their ATM routing infrastructure. PGRIP's effectiveness is intimately tied to its ability of properly detecting and processing anomalies therefore, as the next development phase of our effort, we will reproduce numerous fault scenarios and will accumulate heuristic experience for the specification of anomaly detection and fault diagnosis rules.



Back to the Table of Contents
Back to [28]   [29]    Forwards to [30]