Towards a Dependable Self Healing Internet

Raj Reddy

Input to March 8, 2000 congressional testimony

rr@cmu.edu www.rr.cs.cmu.edu

3/6/00 revised 3/9/00

Abstract

By now we understand the sources of highly publicized Internet crashes such as malicious hacker attacks and "legal" users overloading popular web sites. Many of the remedies require straightforward implementation of known solutions. However, there is a need to create a dependable Internet infrastructure in which at least 99% of the users can access at least 99% of the sites over 99% of the time. This appears to be a solvable problem – that will require both new research ideas, and the uniform application of known and new ideas across the Internet infrastructure. It requires the creation of a self-healing network: a self-monitoring, self-diagnosing and self-repairing national terabit testbed that is demonstrably reliable, available, secure and scalable. The creation and demonstration of such a "reference network" will enable commercial operators to gracefully evolve their current networks rather than undertaking expensive patchwork upgrades to the current system.

Introduction

Recent hacker attacks on popular web sites illustrate the fragility of the current Internet sites and infrastructure. The President's Information Technology Advisory Committee (PITAC) in its report to the President last year observed that: "the Internet is growing well beyond the intent of its original designers and our ability to extend its use has created enormous challenges. As the size, capability, and complexity of the Internet grows, it is imperative that we do the necessary research to learn how to build and use large, complex, highly-reliable, and secure systems ... It is therefore important that the Federal government...undertake research on topics ranging from network reliability and bandwidth, to … robust, reliable, secure ways to deliver and to protect critical information."

An analysis of the various highly visible disruptions to Internet access reveals a wide range of causes: denial of service attacks from malicious hackers using insecure hosts infected with "zombie" diseases (Yahoo), software bugs (Ameritrade), insecure configurations (Schwab), change management (Etrade), and security loopholes (Hotmail, Melissa).

Of all these attacks, perhaps the Melissa virus had the nastiest impact: numbers were amazing and the effect devastating in that many companies shut down their email for a day or two. These security loopholes and denial of service attacks have also exposed the fact that webmasters are cutting corners. In general, web sites are not being designed to be secure.

Many possible administrative and legal remedies exist which need to be enforced firmly by businesses and society. However, history has shown us that compliance failures will occur, either unintentionally or maliciously. Rather than leaving the Internet vulnerable because a few persons or organizations are careless or reckless, we should develop an information infrastructure that is not dependent on voluntary compliance of security practices and policies.

Creating a new Internet Infrastructure that is highly-reliable, secure and scales to billions of users and devices requires new research initiatives including industry-wide benchmarking and sharing of best practices.

In this paper, we discuss the financial implications of Internet downtime, present some known remedies that would lead to a more secure and dependable Internet and present the concept of a self-healing national terabit network testbed which is a self-monitoring, self-diagnosing and self repairing that is demonstrably reliable, available, secure and scalable.

The Impact of Internet Downtime on Businesses and Society

The cost of denial of service and overloading can be substantial. The Yankee Group estimates that the online industry may have lost $1.2 billion in revenue from the Web site attacks earlier this month (WSJ, Feb 24, 2000). A Gartner Group study showed that the average cost of downtime in brokerage operations is about $6.5 million per hour! $29 million in refunds were paid out by MCI to customers affected by the 10 day outage of its frame relay network in August 1999. Three thousand companies were affected. (Online News, 10/28/99). eBay paid $3.9 million in credits to its customers for the service outage that halted bidding completely at its popular service for an unprecedented 22 hours in June’99. Distributed network sites can lose $20,000 to $80,000 per hour as result of network downtime (Computer Reseller News, 1998). At a cost of $80,000 per hour, the average company will lose $7.1 million per year in centralized network downtime.

These costs are expected to increase as companies incur indirect costs in the form of lawsuits, regulatory scrutiny, impact on brand name and public image, loss of customer base, lower employee morale and productivity, and higher employee stress.

The impact on businesses of system outage can be even more devastating. In an April’99 survey of consumers, research firm Jupiter Communications found that 46 percent leave a preferred site if they experience technical or performance problems. Statistics from McGladrey and Pullen show that for every five organizations affected by a disaster, two will be unable to maintain their critical business functions and make a recovery. Of the remaining three, one will not survive the next two years. In fact, a company that experiences a computer outage lasting more than 10 days will never fully recover financially ("Disaster Recovery Planning: Managing Risk and Catastrophe in Information Systems" by Jon Toigo).

According to Cahners in-stat group, Internet downtime hits businesses financially, (http://www.instat.com/abstracts/ia/1999/is9906sp_abs.htm ), affecting direct revenue/customer base, compensatory payments, inventory cost, and depreciation of capital. It also affects business in ways not seen on the balance sheet, such as market capitalization loss, employee downtime, and delays to market-items that may prove more financially damaging than the explicit losses associated with an outage. The report "Data Failure: the financial impact on Internet business" quantifies the real-cost damages for site outages based on SEC filings and publicly released information. The report compares two e-commerce business models and illustrates how much is at stake in the event of data failure.

Steps towards secure and dependable Websites

Many of the problems of Internet access can be avoided by taking some simple common sense precautions. For example

Online Businesses can:

Government can:

Industry can:

Many of the common sense measures listed above depend on the voluntary compliance of over a hundred million Internet users and organizations that provide Internet service. However, history has shown us that compliance failures will occur, either unintentionally or maliciously. Rather than leaving the Internet vulnerable because a few persons or organizations are careless or reckless, we should develop an information infrastructure that is not dependent on voluntary compliance of security practices and policies.

Towards a dependable Internet

The phrase "Internet Security" is often used in a broader sense than computer security within an organization. Most organizations use passwords, firewalls, antivirus software, security audits, cyber hygiene, etc. to provide security and protection within an organization. However denial-of-service attacks occur outside the effective control of the impacted organizations and do so by overloading Internet fabric. The operative question is not "security" as interpreted narrowly in the research circles but rather "how to create a dependable Internet Infrastructure?" in which at least 99% of the users can access at least 99% of the sites well in excess of 99% of the time. We will use the phrase "dependable Internet" to specifically include attributes such as reliability, availability and scalability in addition to security.

By dependable, we mean a system ("as if my life depended on it") that is:

Denial of service occurs when the network fabric is overloaded through intentional and unintentional ("legal") overloading of the system with too many requests. This is analogous to large number of people calling California in the event of an earthquake report, or a computer calling a phone continuously thereby blocking anyone else getting through in case of an emergency. The research challenge is "how to create a dependable Internet Infrastructure?" in which at least 99% of the users can access at least 99% of the sites in excess 99% of the time.

Creating a dependable end-to-end Internet Infrastructure requires coordination among user premises equipment, web servers, and the core routers in the Internet fabric. Even if the users and web servers are not cooperating, an Internet Infrastructure capable of detecting and preventing the abnormal traffic patterns can ensure that most users are able to access the information using the Internet most of the time.

I submit that it is possible to create a system capable of achieving that is demonstrably reliable, available, secure and scalable while ensuring absolute protection of personal privacy and without major reductions in networking speed. Indeed, rapid advances in computing power and networking speed should make the new security systems all but invisible to users. However, that we will not see the development or universal use of networks capable of meeting our goals for dependability and privacy without a concerted research investment supported by both government and industry.

One strategy would be to support a "reference network" testbed designed with the specific goal of evaluating innovative strategies for network protection. The purpose of the testbed is to try out new approaches without disrupting the crucial production infrastructure; it’s an R&D vehicle. As was the case with ARPAnet, which was also an R&D vehicle, such a testbed would be able to provide useful networking services and at the same time let research organizations evaluate advanced dependable networking concepts. In the following section, we put forward a proposal towards creating a highly dependable Internet.

Towards a Self Healing Network

A self healing network is one which continuously monitors all the traffic within the system (every packet entering the system is validated before it can proceed) with a view to detect and disable abnormal traffic patterns. The goal of a self healing network is to provide a mechanism for detecting unauthorized use of networking equipment and provide a mechanism for tracking inappropriate uses and identifying the individuals using networks for malicious intent, without compromising individual rights to privacy and security on the network. At first blush, this requirement appears to be impractical as the Internet is expected to handle trillions of packets every day and would require expensive retrofitting of the existing commercial ISPs. I submit that such a transition is not only essential to the future economic growth and security of the nation, but also practical given the expected exponential advances in technologies.

A variety of technical solutions are available for creating a self healing network. It is possible, for example, to develop "software agents" capable of self-monitoring, self-diagnosis and self-repair much as the human immune system uses (distributed) anti-bodies to disable antigens and restore balance in the human body. Just as in the human systems where a few people may get sick some of the time but the society as a whole continues to function, we may accept an occasional localized denial of service as long as most users are able access most of the web sites without any degradation of service.

Self monitoring within the Internet core fabric requires agents capable of continuous and autonomous monitoring of Internet traffic by each of the computers (or routers) which constitute the core fabric of Internet. Each of these routers could have a specialized coprocessor responsible for examining the address label (or "packet header") which helps the routers guide messages through the Internet. The coprocessor is also responsible for autonomous data mining of the packet header information, and creating statistical signatures enabling the "self diagnosis agents" to analyze traffic and alert systems and system operators to abnormal and potentially dangerous traffic patterns. The "self repair agents" undertake a set of autonomous corrective actions against the offending source which is generating the unusual traffic. These actions could include dropping of some of the packets or limiting the traffic from the source to a "fair share" of the total number of packets entering the fabric.

In the near term, while we are waiting for the research in autonomous monitoring, diagnosis, and repair agents to be completed, we can use human experts such as those in CERT(Computer Emergency Response Team) at Carnegie Mellon University to perform the tasks using a Network Monitoring and Management Center analogous to the one used by telecom companies to monitor and manual reroute telephony traffic. The main difficulty here is, since the there are over seven National Internet Service Providers and many more internationally, getting a unified view of the entire Internet and sharing of information across all the ISP requires an architecture that is currently missing.

The work of the autonomous agents, and humans tracking network security, could be helped if the new generation of routers add information to packets which make it easier to detect malicious patters of use and trace the attacks to their source. If each organization can ensure that each packet exiting from the local area network of that organization carries the address consistent with the valid set of addresses for that organization, such an action would disable masquerading and spoofing attacks by hackers using other peoples resources and facilitate tracing the attacks to the original source.

We need to do all this without paying an undue penalty in networking speed or costs. The situation will get worse as the number of users on Internet goes from a hundred million to a billion with tens of billions of internet attached devices. Furthermore, the proposed self healing network will add to the packet handling overhead at each router in the fabric and has the potential to make the system slower and waste bandwidth. The resulting increase in number of packets and computational overhead can be expected to grow a million fold over the next 10 to 15 years.

This million fold increase in overhead in packet handling will be ameliorated by the exponential improvements in processor (predicted by Moore's law), memory, and bandwidth technologies. These advances will provide a thousand fold improvement in performance over the same 10 to 15 year period. The additional thousand fold improvement needed will have to be harnessed through the use of efficient algorithms, distributed computation, and increasing locality of Internet traffic patterns ("Internet is global - the traffic is local").

In addition to the research needed to develop terabit networks, faster routers, efficient algorithms and distributed computation techniques, research will also be needed for data warehousing of meta-data contained in packet headers (a trillion packets can generate as much as a petabyte of data every day and will have to be purged on a regular basis!), data mining of this data to establish statistical parameters that can be used to classify normal and abnormal traffic requests, and repair strategies including the generation of a signal to sites making abnormal requests without prior arrangement for surge capacity (analogous to the busy signal used in voice telephony).

In the past, we have found ways to balance privacy and security in traditional commerce. Applying these precedents to the new networked world will require combining the skills of technologists, with the skills of people familiar with the law, regulations, and people who will take care to ensure that individual rights are protected. Clearly this is an enormous challenge but is one of the most critical research challenges facing the nation today and deserves an appropriate response.

Conclusion

In conclusion, creating a dependable Internet infrastructure, that is at least as dependable as telephone service, is essential to the future economic growth and security of the nation. It appears possible to create a system capable of achieving dependability and security while ensuring protection of personal privacy and without impacting the network performance. Indeed, rapid advances in computing power and optical networks should make the new security systems nearly invisible to users.

The main challenge is to find the right balance between having a dependable Internet infrastructure without compromising the ease of use by non-experts and protecting the privacy of the individuals connected to the infrastructure. To accomplish this will require both new research ideas and the uniform application of known and new ideas across the Internet infrastructure.

One strategy would be to support a self-healing national terabit network testbed designed with the specific goal of evaluating innovative strategies for network protection – including commercial concepts. Such a testbed would provide useful networking services and at the same time let research organizations evaluate advanced networking concepts around reliability, availability, security, and scalability. The testbed by itself won’t enable us to make progress unless it’s accompanied by the funding that supports the research that creates the new approaches to be tried on the testbed - i.e the testbed and the research results are synergistic and will only happen with a concerted research investment supported by both government and industry.

It is estimated that market capitalization of Internet based industries created since 1990 is over a trillion dollars resulting in capital gains taxes of over $200 billion to the nation. Investing a small fraction of this national income in research towards creating a self healing Internet will ensure the continuation of this engine of growth!

 

Acknowledgement.

This paper has benefited from the comments and suggestions from several PITAC members: Jim Gray, Irving Wladawsky-Berger, Vint Cerf, Bob Kahn, les Vadasz, Susan Graham and Joe Thompson and from other colleagues: Anish Arora, V.S. Arunachalam, Kay Howell, Henry Kelly, Ed Lazowska, and Rich Pethia. Pls send comments to rr@cmu.edu