CERT
 
Publications CatalogHistorical Documents All Research Papers Research Staff Biographies CMU Heinz School CMU School of Computer Science CERT Statistics US-CERT CyLab
 

The Uncleanliness Vector

Histories of Hostile Activity

Problem Addressed

In an operational network security analysis environment, there is little time for gathering context information on attackers. Analysts need tools that present background information in an easily digestible way.

The Uncleanliness Vector (UV) project is an attempt to provide context information on external networks to security analysts that indicates whether the network in question has engaged in other malicious activity and, if so, what kind.

Technical Approach

The hypothesis behind the UV approach is that an attacker will take over as many machines as possible to serve as hosts for further attacks. Therefore the likelihood that a particular IP address is so used for hosting an attack (such as a scan or distributed denial-of-service attack) is a function of the security administration of the network that address resides on. A network with a strong security policy and rigorous maintenance will identify and remove compromised hosts, while one that does not demonstrate this vigilance will remove those hosts more slowly and see the same hosts repeatedly compromised.

The UV describes the likelihood of attack from a block of IP addresses using an uncleanliness score of 0 to 100; a network with a value of 0 is considered clean, while a network with an rating of 100 is considered totally unclean.

The UV is composed of several distinct uncleanliness elements, each of which was assigned an uncleanliness value. We characterized this set of numbers as a vector, and the aggregate score as a weighted average of the individual elements. The weights indicate a prioritization of the uncleanliness elements by their relative threat level to the monitored network.

The vector elements and their relative weights are as follows.

Spyware site hosting (weight 6). Spyware is software installed on a client machine, generally without the user’s knowledge, that gathers information and reports that information back to the spyware site. The information gathered can range from host configuration data, to browsing history, to personally identifiable information (PII). A spyware site host is generally controlled by the entity that created the spyware.

Phishing site hosting (weight 5). Phishing attacks attempt to gather login credentials, financial information, or PII for the purpose of financial fraud or identity theft. Phishing sites are generally hosted on compromised machines, and the gathered information is reported back to email drops or other data gathering points.

Botnet command-and-control server hosting (weight 4). Botnets are networks of remotely controlled programs that allow an attacker to perform various actions such as denial-of-service attacks, spamming, and anonymous proxying. Botnet command-and-control servers are sites that the bots “call home” in order to receive commands. These traditionally are Internet Relay Chat (IRC) servers, but can include almost any type of service—one common alternative is web servers, where the bot regularly polls a URL to see if any commands have been issued. Botnet command-and-control server networks are not usually owned by the attacker, and thus either represent a compromised host or a service piggybacking on a legitimate server (such as an IRC server in both legitimate and illegitimate use).

Botnet members installed (weight 3). Hosts on which botnet members (the “bots” themselves) are installed are most likely compromised hosts.

Open proxy hosting (weight 3). Anonymous proxy networks could either be hosted 1) deliberately by the network owner, 2) unknowingly as the result of a compromised host on which a proxy application was installed, or 3) unknowingly through a misconfigured web server or internal proxy. Attacker-installed proxies could be “bot” programs, or simply specialized proxy software.

Spamming (weight 2). Hosts sending spam are usually compromised hosts, because the success of blacklists and other anti-spam measures has meant that dedicated spam hosting is no longer efficient.

Scanning (weight 1). Scanning is reconnaissance activity by attackers attempting to determine what services are available on a host. This involves probing ports on the host to determine which are open. Such probing could be purely investigatory or could be combined with an attack. Scanning usually originates from compromised hosts.

Figure 1: Number of Networks Showing a Particular Number of Unique Scanning IP Addresses (log-log scale)

2006 Accomplishments

Early in our analysis, we established that malicious activity is indeed distributed heterogeneously across the network. If malicious activity were distributed homogeneously, then the amount of the activity would depend only on the number of hosts on a network. The most complete data we had available for any of the UV elements was scanning data, so we analyzed how scanning was distributed across the Internet. We found that the scanning is concentrated, coming from only about 1% of networks, and that the amount of scanning does not depend strongly on the size or population of the network.

Figure 1 shows the distribution of scanners, rendered as a log-log scale plot. The x-axis indicates the number of IP addresses in a particular /24 network (class C network) that were observed scanning. The y-axis indicates the number of /24 networks that showed the particular number of unique IP’s scanning shown on the x-axis. Note that the majority of networks have only one host scanning. The chance of having two or more hosts scanning is nearly 10 times lower. The chance of having 10 or more hosts scanning is nearly 10,000 times lower. This type of distribution is called a “power law” distribution, and its presence here implies that scanning is strongly inhomogeneous—there are a few networks that are heavy sources of scanning with most networks tending to be clean.

Different types of malicious activity are related. For example, when we compared data for bots and scanning, networks where bots were reported also showed significant increases in scanning of the monitored networks. Figure 2 shows the time behavior of scanning observed from reported botnets (red line) versus all networks (green line). The observed scanning from botnet networks was much higher than the expected scanning (dotted line). Interestingly, scanning started several weeks before the botnets were reported, implying that the bots went undetected for some time.

We needed to translate the data on scanning, spamming, and other activity into some measure of “uncleanliness” of a network as a score on the scale 0-100. During our investigations, it became clear that this measure should not be related to the volume of malicious activity observed. In some cases we did not have access to data on volume of activity. In other cases it was clear that such data was not very meaningful.

The quantity that instead appeared to provide the most useful measure of the uncleanliness of a network was the number of unique hosts (i.e., hosts with unique IP addresses) that engaged in the hostile activity. This is because in most cases, the vector elements were really measuring the likelihood that a host on that network could be compromised. If a network has a large number of compromised hosts, then the network is more likely to be in use by an attacker.

Figure 2: Comparison of Scanning by Bots (red line) and Overall Scanning (green line) Over Time

Figure 1 and similar figures for the other vector elements show that malicious activity is distributed sparsely, and that there are relatively few networks where there are many malicious hosts. This observation tended to support our plan of using unique hosts counts as the measure of uncleanliness.

For each vector element, we chose two thresholds (s0 , s1) on the axis of unique host counts (x). The uncleanliness value was then chosen according to the equation:

(Eq. 1)

In other words, uncleanliness is 0 if no hosts are unclean, 10 for any number of hosts greater than zero but below the low threshold, linearly interpolated from 10 to 100 between the thresholds, and 100 if the number of unclean hosts meets or exceeds the high threshold.

Based upon analysis of the data we used, we found the thresholds shown in Table 1 to be appropriate for our input data (these values can of course be changed based on the patterns observed in the actual input data used).

Table 1: Thresholds Used for Calculating Uncleanliness Scores

The individual elements are then combined into an overall uncleanliness score as a weighted average, that is,

(Eq. 2)

where Ui is the value of uncleanliness vector element i, and ßi is the weight of that vector element, as given in the listing of vector elements above.

The data sources for each of the vector elements differed. In some cases, we found that publicly available blacklist information was useful (proxies, and the initial version of the spamming element), in other cases we had access to large volumes of actual traffic and were able to identify the malicious activity (scanning and later versions of spamming), and in other cases we processed incident report information (botnets and botnet C&C). Of particular interest from a research point of view are the methods we have developed for identifying scanners and spammers. Scanner identification was performed using methodology discussed in the 2005 CERT Research Report, and explained in detail in a CMU/SEI technical report published this year [1].

The spammer identification was developed in the course of this research. We observed, using network flow data [2], that spam traffic had certain characteristics that were significantly different from the traffic observed from known good email servers. In particular, spam was not accompanied by Domain Name System queries, and the set of hosts with which spammers communicated were highly dynamic. We were able to construct a spam identification process using these heuristics, which now feeds our spam uncleanliness measurements.

While constructing measures of uncleanliness of the public Internet, it was natural to think about the reflected image of uncleanliness: how much has the external uncleanliness affected the internal network, leading to what one might call “corruption?” This led us to define a corruption vector as a complement to the uncleanliness vector.

For certain of the vector elements, communicating with unclean networks is not necessarily a good indicator of corruption (e.g., external hosts are constantly scanning; this does not automatically lead to corruption of the scanned hosts). For other vector elements, communication with an external unclean host is in fact an indicator of internal corruption. The latter is true for proxies (internal hosts attempting to anonymize activity), botnet command-and-control (possible bots on the internal network), and spyware (probable spyware installed on an internal host).

The measure of corruption was (similarly to uncleanliness) based on the number of unique hosts seen communicating with unclean external hosts. However, significant challenges in counting unique hosts were encountered, due to the presence of internal gateways (Network Address Translation devices or proxies). When a spyware-infected host, for example, is located behind a gateway, the gateway’s IP appears to be infected with spyware, and it may be one of the few active IP addresses (or is perhaps the only active IP) on that network block. The result is a very high corruption score, when in reality the activity only represents a small number of the hosts behind the gateway.

The output of this research included software tools, primarily a daemon for querying a running uncleanliness database. The daemon makes it easy to integrate the uncleanliness vector into an existing operational environment. A UV server need only be set up somewhere on the network, and the daemon can be queried by the analyst interface server to retrieve information about a particular IP address. Since the granularity of the data is at the /24 level, a query for any IP in that range of addresses will yield the same result.

Additional tools make it simple for an operational environment to supplement UV with additional context data, including, for example, internal incident data. UV is a framework for simplifying large amounts of diverse data into a straightforward measure for analysts, and therefore we endeavored to make it as easy as possible to extend and include other data sources.

2007 Plans

We are investigating methods for better measuring the corruption of networks behind Network Address Translation (NAT) gateways and proxies. These will likely be based on traffic models that will estimate the number of hosts (and the number of infected hosts) based on traffic volumes of various types. For a sufficiently large number of NATed hosts, it should be possible to apply models of the average user based on observed web browsing and email activity, for example. Since the essential measure is actually the percentage of infected hosts, it may be possible to extract that information without actually estimating the number of hosts.

While examining various independent phishing data sources, it was observed that small overlaps exist in the data. We are in communication with the various data providers to determine the degree of independence of these data sources, and are attempting to construct statistical models (based on capture-recapture models used for measuring wild animal populations) to estimate the total population of phishing sites, both detected and undetected. Complications to constructing the model include that phishing reports tend to lead to their being shut down (referred to as the “trap death” problem in capture-recapture modeling), that some phishing data is not truly independent, and that some reporting biases exist in the data under study. Nevertheless, we are confident that such an estimate is possible and desirable.

References

[1] Gates, C.; McNutt, J.; Kadane, J.; & Kellner, M. Detecting Scans at the ISP Level (CMU/SEI-2006-TR-005). Pittsburgh, PA: Software Engineering Institute, Carnegie Mellon University, 2006. http://www.sei.cmu.edu/publications/documents/06.reports/06tr005.html.

[2] Gates, C.; Collins, M.; Duggan, M.; Kompanek, A.; & Thomas, M. “More Netflow Tools: For Performance and Security.” Proceedings of LISA ’04: Eighteenth Systems Administration Conference. Atlanta, GA, Nov. 2004. Berkeley, CA: USENIX Association, 2004. http://www.cert.org/netsa/publications/Lisa2004-gates-netflow.pdf.


Disclaimers and copyright information

Last updated May 10, 2007