The current perception of IT management and users alike is that downtime of critical servers and their associated applications is just an unfortunate fact of life.
This perception is a direct reflection of the way it has always been. A diligent review of potential software based solutions to remove downtime would have found that there are really only two choices, clustering and data replication. Clustering technology is expensive, complex to implement and complex to manage. Data replication offers a cheaper solution where the objective is to have access to a real time copy of the data, but this does little to address the objective of the user remaining connected to a working application irrespective of the reason of failure, without the need for human intervention.
It becomes pretty obvious that there was a significant opportunity to address the need for a "cluster class" level of application availability, but delivered at the cost associated with data replication solutions. Easy to understand, but challenging to deliver.
Make it simple
The first issue is the removal of the levels of complexity in the planning, implementation and on-going management processes. For complexity, read consulting costs, extensive hardware costs and dedicated, highly trained internal resources to deliver the on-going management.
By keeping the solution simple, it is then possible to remove the pre-sales and planning costs of the solution whilst delivering a "nothing shared" platform. Rather than invest in very expensive fault tolerant hardware, SAN's and all the associated networking infrastructure, why not simply add a second dissimilar server and connect them together, preferably avoiding the network to reduce data traffic. This delivers a low cost "shared nothing" platform that is obviously extremely easy to manage.
Cloning
Through innovation, the next step is to clone the primary server to create a true pair of servers, one that can be seen by the network (active server) and one that is hidden (passive server). The most important aspect of cloning is that it enables the automated or manual switching between the pair with "cluster class" performance.
However, the downside of cloning is that the reliability of the original primary server is of paramount importance. All too often high availability solutions are purchased to address an unstable primary server and simply act as a safety net. To overcome this challenge, it requires a tool that can undertake a very detailed and comprehensive investigation of all the hardware, drivers, software, performance monitoring and network (whether LAN or WAN) and then deliver a comprehensive guide to addressing the reliability issue. (It makes sense to do this anyway, but is particularly important in a cloned environment).
Application protection
Having addressed the shared nothing and reliability issues, the next step is to address the challenge of protecting and restarting the software applications. There is more to this than you might initially think. For example, to deliver a fully operational Microsoft Exchange Server it is essential that other auxiliary applications are also protected such as anti virus software, backup, anti-spam etc. All the critical aspects of software environment required to deliver the fully operation application have to be understood and protected.
The common approach to this challenge has been the delivery of bespoke / tailored scripts carried out on site by a consultant for the specific server. More often than not there is no documentation nor any means of understanding the implications of future upgrades to the operating system, software applications or even the high availability application.
The solution is to enable the very rapid development of protected application module products that can be deployed over and over again in multiple sites, yet will work effectively with each other in any permutation. As important, these application modules will very rarely be impacted by upgrades. This approach enables the development of a library of protected applications that are fully documented, fully proven and can be delivered at low cost reflecting the write once, sell many times approach.
This approach offers the ability to build in intelligence in the protection of each application / operating system. In an ideal world, the active system does not fail, so why not take remedial action as the first step in the process, rather than just failover / switch over to the passive server? As a proactive approach it enables processes to be automatically monitored and where appropriate stopped and / or restarted as the first step. Analysis demonstrates that this approach removes as much as 30% of critical server failures.
Automation
The final step is pulling all of this together.
Reliability has to be the first step of high availability, unless the only objective is to deliver a safety net for the data.
The cloning implementation process can be highly automated, partly due to the benefits of the reliability process and undertaken with minimal downtime.
Post implementation management is very straight forward, as the entire solution is product based on standard hardware and operating system, performing within the standard network environment.
Cost
Irrespective of the nature of the failure, a "cluster class" high application allows the user to automatically remain connected to a working application. Consequently, what users need is a solution that guarantees the uptime of critical systems at a reasonable price. This solution can ensure an affordable price that could be accessible to all sizes of organisations.
Choice?
Possible failure of critical application servers such as Microsoft Exchange or SQL based applications, which include Sales Automation, Help Desk or File Server, could run up some considerable costs. This extra expenditure will derive from lost staff productivity and a new utilisation of IT resources that will maintain and fix the failed servers. The costs are difficult to quantify but an impact of failure on these critical servers being down for multiple hours would develop a vast expense.
Downtime is now optional for almost every organisation however large or small.
Neil Robertson is CEO of the Neverfail Group