CERT
Back to [31]   [32]    Forwards to [33]



A Survivability Metric for Telecommunications: Insights and Shortcomings

Andrew P. Snow
Georgia State University
Department of Computer Information Systems
asnow@gsu.edu

Abstract


The need for an empirical survivability measure is introduced. ANSI developed a metric, called an outage-index, which provides insights into telecommunication survivability. Examples of insights gained by applying this metric to telecommunication switch outages are given. However, this metric has significant disadvantages and shortcomings, which are discussed. Lastly, the reasons why this metric is deemed too "Carrier-centric" are explained.

1.        Introduction
In order to know how effective designers and operators are in delivering and operating survivable systems, a measure is needed. One way to assess survivability is to study episodes in which the system did not survive. By investigating non-survivable episodes we gain insights into how one might improve survivability, by looking at episode trends and causes. If the episodes are rare, we learn less but might surmise the system enjoys some attractive level of survivability; if the episodes occur with some frequency we learn more, but might surmise that system survivability suffers. However, the latter case offers the chance to improve the survivability of the system, if we can somehow measure survivability. Through empirical measurement of non-survivable episodes, we move from the realm of ‘what if’ to that of ‘what happened’.
In telecommunications, non-survivable episodes are called outages. Each outage is induced by an event, which are often failure events. McDonald was first to propose a survivability metric based upon the magnitude and duration of an outage. This metric, called the ULE (shorthand for user-lost-erlangs), is a logarithmic measure like the Richter scale. McDonald hoped the Richter analogy would give people a ‘feel’ for the impact of an outage. This metric is defined by ULE=log10 (MD), where M is the magnitude of an outage (number of customers affected), and D is the duration of the outage [4]. However, telecommunication outages are often complex and not amenable to such an easy measure. An outage can be more completely defined in terms of a number of important random variables: (1) the arrival time of the event which induced the outage, (2) the duration of the outage, (3) the number of subscribers affected (magnitude), and (4) the type service affected. The latter three attributes are referred to as a service ‘outage-triple’. Some telecommunication outages are severe enough to affect multiple services and in these instances, each service might have a unique duration-magnitude profile [8].

2.        Outage-Index Metric
In response to regulator and legislative interest in large-scale telecommunication outages of the early 1990’s, ANSI T1A1.2 developed a voice communications survivability metric for industry. The outage-index was generalized for voiceband services in wireline, wireless, PCS, CATV, and Satellite networks [1], and also defined for FCC-reportable outages of at least 30,000 affected for at least 30 minutes [2]. The outage-triple approach was adopted, allowing an outage episode to be characterized as a single metric by incorporating service, duration and magnitude weights, as follows:

where j=1,....,N are the services impacted by the outage, WS= Service Weight, WD= Duration Weight, and WM= Magnitude Weight.
For traditional voice in the wire-line Public Switched Telephone Network (PSTN), the service weights range from 1 to 3, depending upon whether local, toll, and emergency services are impacted by the outage. The magnitude weight ranges from 0.0 to 16.67 and a defined function relates the number of users impacted by the outage to a value. In addition, the magnitude weight is multiplied by a Time Factor, where TF = 1.0 for daytime, TF = 0.3 for evening, TF = 0.2 for weekends, and TF = 0.1 for late night. The duration weight ranges from 0.0 to 2.5 and a defined function relates the duration of the outage to a value. Both magnitude and duration weights approach their respective maximums asymptotically. However, for traditional voice in the wire-line PSTN, the service weights are discrete  importance factors  given by: IntraLATA Intraoffice - 1, IntraLATA Interoffice - 2, InterLATA - 2, and E911 - 3. To determine the outage impact for a particular time-period, we simply sum the outage-index of all the outages occurring during that time-period (e.g. day, month, quarter, year). An industry group, The Alliance for Telecommunications Industry Solutions (ATIS), uses the outage-index metric to assess the impact of large-scale telecommunications outages that are reportable to the FCC. In addition to summing the index into a quarterly metric, the quarterly outage-index time series is also tested for trend, an example of which can be seen in [6].

3.        Outage-Index Applied to Local Telecommunication Switch Failures
Local switch failures are also reported to the FCC by telecommunications carriers on a yearly basis. All switch outage incidents lasting two minutes or more must be reported, including the number of access lines and failure reason. Here the important results of applying the outage-index to 6121 switch failure-induced outages reported during the period October 1, 1992 through December 31, 1996 are presented. A reliability analysis of these events, which segregate switch failures into failure cause categories, is reported in [7]. Using the same failure cause categories, we may assess the impact of switch outages using the outage-index. For each switch outage, an outage-index is calculated, then summed for different categories. From a survivability perspective, we find that:


From these empirical findings, we may conclude:

These results show the usefulness of empirical survivability in assessing the impact of telecommunication outages due to local switch failures. Such results should be useful to telecommunication service providers, vendors and policymakers.

4.        Outage-Index Critique
On a positive note, the outage-index does provide a metric that implements the outage-triple, an important concept. Through the outage-triple, a measure of impact is generated and reduced to a single number for each outage episode. In addition, as shown in Section 3, applying the outage-index to local switch failures provides interesting and possibly valuable survivability insights. However, as defined by ANSI, the outage-index does have several weaknesses.
First, the discrete service weights are overly simplistic in that uniform importance across all subscribers and geographic regions are dictated. The importance of a residential subscriber might not equate to the importance of service to a business user. Likewise, the importance of a call in an urban area may not equate to that of a call in a rural area. Perhaps the importance factors could be better found through surveys in different locals. Each region might have different mixes of business and residential users. The uniform and static importance factors used are unrealistic.
Second, the asymptotic weights for magnitude and duration discount the impact of large and/or long duration outages. These factors undoubtedly decrease variance and may make the outage-index a more well behaved statistical process, but at what price? It would seem that the outliers cases (large duration, large magnitude) are what is important and what should receive the most focus. However as the weights are presently defined, the impact of the most significant outage events are minimized.
Third, the outage-index has poor statistical prediction capabilities, as applied to FCC-reportable large-scale outages. A study of the outage indices for a two and a half year period by the National Communications System, showed a very poor power-of-the-test in detecting trends in the quarterly outage-index. In fact the chance of detecting a 50% increase in outage-index was found to be only 0.42 [5].
Fourth, the maximum range of the magnitude weight (16.67) and the duration weight (2.5) make the magnitude of an outage almost seven times more important than the duration of the outage. Although this might add incentive for Carriers not to concentrate traffic on elements of the telecommunications infrastructure, it does not add incentive to recover from outages quickly. The outage-index is fairly insensitive to large variations in duration, especially for the magnitude of most large-scale ouatges reported.
Fifth, the asymptote for magnitude and duration weights should be infinity rather than finite. Also, the concave decay of these weights as the asymtotic limit is approached is perhaps the opposite of what society might perceive.
Lastly, normalization, which allows for the outage-index to be adjusted to account for overall network growth, is not a perspective relvant to users impacted by an outage. In the PSTN for instance, the yearly outage-index is compared to the baseline year, July 1992 through June 1993. The first part of normalization is an adjustment to the magnitude weight, based upon network growth in number of access lines and call volume in the network. To the industry this is justifiable "Since the network is growing… the magnitude impact is expected to decrease for the same number of customers affected." As an example of magnitude normalization, ANSI says that if there was 10% growth in the network since the baseline year, a 77,000 customer impact is really a 77,000/1.1 or a 70,000 customer impact. The magnitude weight is then 1.31 instead of the 1.52 value obtained with 77,000. In addition, normalization methodology allows for an upward adjustment of reporting thresholds, to drop outage events from the analysis. For instance, in the PSTN the magnitude reporting threshold is 30,000 customers affected. Using the same 10% network growth argument, an outage of 30,000 magnitude could be dropped from the analysis, because the normalized threshold is 30,000/1.1 = 33,000 [2].

5.        Conclusions
The outage-index is better than no survivability metric, and as defined by ANSI, allows survivability insights to be gained. However, the metric is deemed Carrier-centric, because of the following outage-index attributes:

The outage-index is a good start, but the Carrier-centric aspects of this metric should be reconsidered, and balanced against competing user-centric viewpoints.

4.        References

[1]
ANSI T1A1.2 Working Group on Network Survivability Performance, "A Technical Report on Network Survivability", Report No. 24A, Alliance for Telecommunications Industry Solutions, 1997.

[2]
ANSI T1A1.2 Working Group on Network Survivability Performance, "A Technical Report on Enhanced Analysis of FCC-Reportable Service Outage Data", Report No. 42, Alliance for Telecommunications Industry Solutions, 1995.

[3]
McCarthy, S. Reliability keeps up with network growth. Telephony, Vol. 228, No. 23, p. 14, June 1995.

[4]
McDonald, J., Public Network Integrity, IEEE Journal On Selected Areas In Communications, Vol. 12, No. 1, January 1994.

[5]
National Communications System, Office of the Manager, NCS Final Service Outage Assessment Report, July 31, 1995.

[6]
Network Reliability Steering Committee, Annual Report 1997: Prepared For The Change, Alliance for Telecommunications Industry Solutions.

[7]
Snow, A., "A Reliability Analysis of Local Telecommunication Switches", 6th International Conference on Telecommunications Modeling and Analysis, Nashville TN, 1998.

[8]
Zolfaghari, A. and Kaudel, F., "Framework for network survivability performance", IEEE Journal On Selected Areas In Communications, Vol. 12, No. 1, January 1994.



Back to the Table of Contents
Back to [31]   [32]    Forwards to [33]