Web Site Availability

Best practice (ISO 17799) advocates the development of information security policies to ensure the confidentiality, integrity and availability of information.

It is becoming increasingly clear, however, that most organizations are concentrating on the former two with little regard for the latter.

In this context availability is defined as "ensuring that information and vital services are available to users when required."

Threats to availability can be due to a number of factors:

The failure of hardware components within critical servers or devices such as routers and firewalls that allow users to communicate with the site.
The inability of the web server application (or components of it) to process all requests during 'flash' periods - or times when an excessive, unexpectedly high number of users are accessing the site. This manifests itself by either serving pages slowly or not serving them at all, causing the user to see "Page cannot be displayed" or "Connection timeout" errors.
Devices such as firewalls not being able to forward traffic to and from the site quickly enough to provide a suitable response time to the user. The exact number of connections that a single firewall can support varies significantly from product to product and is influenced by the hardware that it is running on. However, in my experience it is not possible for a single firewall to pass more than 75,000 simultaneous connections to the web site - a value much lower than the total number of connections capable of being supported by a series of web servers in a cluster.
Security is a key concern when the site has a high profile. Some vulnerabilities exist in software that cause the server to stop processing requests or perform slowly if exploited. Examples include buffer overflows and other denial-of-service (DoS) attacks. These vulnerabilities can often be exploited remotely without the knowledge of the firewall, as security administrators are often unable to configure their software to investigate the true intent of each connection without compromising high performance.
In certain cases a site with a high profile can become a target before it goes online as remote attackers bid for the kudos of compromising the site or causing it to crash. These attacks can be co-coordinated to involve simultaneous floods of activity in the hope that the site (which can involve any of its components) is susceptible to a distributed denial-of-service attack (DDoS).

Other serious security threats include the possibility of defacement to the home page (a popular achievement amongst hacking groups - see https://defaced.alldas.de for examples) or the exploitation of vulnerabilities that permit access to data files on the servers. The latter can cause inappropriate disclosure of personal data (in contravention of the U.K. Data Protection Act and other similar legislation elsewhere) such as financial records, medical records or life information.

Individual components making up the site can affect availability and the following need to be considered:

The Internet route taken by users to reach the web servers. This can often be shared with other, larger organizations as the connections converge onto the infrastructure provided by an Internet service provider (ISP). Without controlling the bandwidth at the point that connections arrive at the server, performance will undoubtedly fluctuate, reducing the quality of service that a user expects.
The Internet connection route that services the web site may fail, causing the entire site to become unavailable.
The technical infrastructure of a web site includes many components that simply route traffic to its final destination. Routers and switches need to be duplicated to avoid the possibility of the site crashing should one of them fail.
Security devices such as firewalls and intrusion detection systems should not adversely affect the performance of the site by inefficiently examining traffic for suspicious or malicious activity. Any device, such as a firewall, should not present a single point of failure within the design.
When multiple servers are required to contain a site it is necessary to control the traffic to each using a load-balancing device that can effectively provide each member of the cluster with sufficient connections. These devices should also be duplicated, as each connection passes through it and failure would prevent connections from reaching the server.
An appropriate number of servers should be used to support the content of the site. It is important to ensure that sufficient capacity is available in both the hardware and software that makes up each member of the cluster.

In addition, companies need to be aware that when designing a solution for a web site it is important to undertake significant testing before users are permitted to access the facility. The following tests are important:

Application code. This ensures that the user does not unexpectedly receive errors or cause the server to respond with improper information. Ineffective testing of an application code is often the route of many exploits. Tests can be automated, although the developers of the code will have the greatest insight into potential areas of concern.
Web server software. To ensure that all known and published vulnerabilities have been controlled using appropriate updates and modifications. Again, this is a popular route taken by malicious users to compromise a site. Using commercially available tools and those published by hacking communities on the Internet it is possible to quickly interpret recognizable faults and effect the remedial action.
Firewall security policies. To examine the authorized connections that are permitted to reach the site from the Internet. In certain conditions it may be possible to communicate with a web site or components such as database and transaction servers via the firewall following the compromise of an integrated device. The firewall policy should explicitly deny all improper connections between devices.
Load testing. Using specialist tools it is possible to simulate significant periods of demand by flooding the site and its components with many thousands of simultaneous connections and measuring the response times presented to each virtual user. As most users admit that they avoid slow sites or those that present errors, it is critical to ensure that simulated activity reveals problems before real customers or users of the site encounter them.
Security tests. The most important complement to finding vulnerabilities on the servers within the site is to ensure that they cannot be exploited in a manner that makes the system susceptible to a DoS attack. Simulated attacks can cause pre-production servers to fail, allowing the fixes to be applied before going live to a community where a proportion will be intent on trying the very same tests maliciously.
Disaster recovery. Depending upon the nature of the site it may be necessary to ensure that a simulated failure of any active component of the site will cause an alternative route to be established. This could either be within the infrastructure of the same site or via an alternative site that contains replica systems when failure causes critical components to stop servicing requests at the primary location.

Controls that can be put into place to reduce the risk to a site's availability include:

Dual Internet connections with the ability to automatically use an alternative route (sometimes via a completely different ISP) when a failure occurs.
Clustered firewalls to ensure that the necessary checks of incoming traffic do not adversely affect performance.
Dual load-balancing switches that can automatically forward traffic to appropriate servers based on their ability to respond to a request in the appropriate way. Load balancers can be configured to look beyond the number of connections that a server is processing by ensuring that it is always possible to serve particular documents or process transactions, and routing traffic away from a server that fails the tests.
Multiple web content and supporting database servers to allow the load-balancing device to attempt an even distribution of traffic across the cluster.

If the site contains secure content using the popular https protocol it is advisable to ensure that the overhead of encrypting and decrypting the information is passed from the servers to a specialist device to increase performance significantly. In some cases a performance improvement of 5:1 can allow more users to process transactions online simultaneously.

Bandwidth management and caching solutions are available to ensure that the bandwidth of any Internet connection is used effectively, particularly when it is shared with other organizations or applications. By allocating minimum and maximum performance controls and dynamically adjusting them to suit demand it is possible to ensure a consistent quality of service to the user, avoiding the possibility of them not returning when faced with poor performance.

Ian Emery works for Your Communications (www.yourcommunications.co.uk) and is responsible for developing e-security products and services.