In Search of Continuous NT Computing
Implementing data mirroring with failover might be a better high-availability solution for your enterprise than clustering. Here's how to tell whether data mirroring with failover is right for you.
July 31, 1998
Clustering isn't the only way to increase NT's availability
Maybe it's because I live in Colorado, but when I hear vendors talk about their Windows NT clustering and high-availability products, I think of the medicine shows that once roamed the West. Remember the ones from the classic Westerns? The shows traveled from town to town, selling snake oil and promising it would cure everyone's ills. But what really happens when you sip the magic elixir--or attempt to implement a high-availability solution in your environment?
Clustering isn't a cure-all for your NT availability woes. No amount of faith or sweat will make an incorrect solution solve a problem it wasn't designed to solve. Don't misunderstand me--I've implemented both hardware and software solutions to increase NT's availability, and those solutions work. However, the key to succeeding in your quest for continuous NT computing lies in understanding your needs and each solution's capabilities, and in prototyping tomake sure a solution will scale to solve your business problem.
Software and hardware solutions that increase NT's availability are part of, not replacements for, good systems-management practices. These solutions increase system availability and reduce your business' exposure to computer downtime by providing redundancy to your computing environment in the same way RAID or multiple power supplies in a system provide redundancy. NT clusters and related solu-tions only let you recover from or mask failures that cause system outages--you must continue to execute proper backup and disaster-recovery strategies.
In this article, I'll give you my refined definitions of clustering terminology and review availability classifications that help categorize vendor solutions. Then, I'll identify features that can help you select and implement a data mirroring and failover solution that meets your business requirements. Along the way, I'll point out some differences between data mirroring and failover solutions and Microsoft Cluster Server (MSCS).
Defining Clustering
No amount of faith or sweat will make an incorrect solution solve a problem it wasn't designed to solve. |
Although many vendors offer NT clustering products, trade publications and product data sheets sometimes apply the word cluster or clustering like a branding iron to products with varying feature sets. These products actually span a continuum, from high-availability solutions to fault-tolerance products. To categorize these offerings, I propose this definition of clustering: A cluster is a group of servers that independently execute their operating system (OS) and applications to let clients access resources that are available to all the servers in the group. If a cluster member experiences system failure, resource access is available through another cluster member and does not require operator intervention or system restart. Collectively, clustered systems provide higher availability, increased manageability, and greater scalabilitythan each system can provide independently.
My definition isn't much different from the definitions I've heard in my discussions with Microsoft staff. Mark Wood, Windows NT Server, Enterprise Edition's first product manager, defines a cluster as a group of independent systems linked together for higher availability, easier manageability, and greater scalability. Jim Gray, senior researcher in Microsoft's Scaleable Servers Research Group and manager of Microsoft's Bay Area Research Center, describes a cluster as "a collection of independent computers that is as easy to use as a single computer." He further describes clusters as solutions that not only provide failover capabilities but also disperse data and computation among a cluster's members. (To learn more about Jim Gray's clustering vision, see his sidebar "Commodity Cluster Requirements,"June 1998.)
My definition of clustering narrows the focus to exclude from the clustering category data-mirroring-with-failover solutions, which do not provide access to common storage resources or support the automatic reentry of a recovered system into a cluster. Access to common storage and the seamless addition and removal of systems in a cluster is key to the distinction I make between NT clustering solutions such as MSCS--which provide seamless interaction between systems--and data mirroring with high availability products--which do not provide seamless interaction. This distinction might appear minor today, but it will become increasingly important as Microsoft expands MSCS beyond its current two-node support.
While I'm defining terms, let's look at the terms active/standbyand active/active. These terms apply at the system level to referto the nodes in a cluster that actively perform work or wait in standby mode toassume the load of a failed cluster node. Active/standby means one node isworking and the other is waiting. In active/active implementations, both nodesactively perform independent work.
Active/standby and active/active apply to the system level, but they canalso apply to applications. For example, both nodes in an MSCS solution canactively run and offer services; this capability makes MSCS an active/activesystem-level implementation. At the application level, MSCS supports bothactive/active and active/standby solutions. For example, SQL Server 6.5Enterprise Edition and Internet Information Server (IIS) support active/activeconfigurations on MSCS, whereas Exchange 5.5, Enterprise Edition runs only in anactive/standby configuration.
A problem arises when you apply any definition of clustering to current NTavailability solutions: Most of these solutions address only theincreased-availability portion of the clustering triad, the other two elementsof which are manageability and scalability. Mark Smith pointed out thisshortcoming in "Clusters for Everyone," June 1997, and it's still truea year later. Few increased-availability products for NT offer continued(automated failover and back) access to resources without operator interventionor, worse, system restart. Even fewer products have addressed the manageabilityand scalability legs of the triad that mini and mainframe clusters have targetedfor over a decade. In fact, the real advances in multinode scalability have beenlimited to database and Internet-related solutions. Does that shortcoming meanincreased-availability solutions are bad? Certainly not. However, this situationmeans you'll probably have to take advantage of each product's strength,work around its limitations, and use a combination of products to meet your NTavailability needs.
Availability Classifications
The keys to selecting and implementing the right high-availability solutionare identifying applications that need increased availability, defining theoutage duration and type your business can tolerate, and determining how muchyour business is willing to spend for the redundancy necessary to meet yourexpectations. Vendors such as Digital Equipment, HP, and NCR place thesingle-host availability of NT Server running on Pentium Pro systems at the 99percent uptime level. For systems that must operate 24 hours a day, 7 days aweek year-round, 99 percent availability translates to about 87 hours of plannedand unplanned downtime per year. Adding RAID data protection to such a systemlets it survive some level of disk failure and raises availability to 99.5percent, or 44 hours of downtime per year.
Fifty-two planned outages lasting 50 minutes each (44 hours distributedover 52 weeks) is manageable for most sites. For other sites, though, even a fewminutes of planned downtime, let alone the threat of outages lasting for days,justify moving beyond the usual commercial availability of a single NT system.These sites are where high-availability (data mirroring with failover) andfault-resilient clustering (data-sharing solutions such as MSCS) products thattake NT systems to 99.9 percent (8.8 hours of downtime per year) and 99.99percent (53 minutes of downtime per year) availability come into play.
High-availability and clustering solutions provide system redundancy andsupport some level of application restart or resource failover among membersystems. These features increase system availability by facilitating thetransfer of resource responsibilities to surviving systems. Although theresources remain highly available, the transfer, or failover, takes time, fromseconds (for a few file shares) to minutes (5 minutes to 10 minutes for therestart of an application such as Exchange). Some client/server applications, byfluke or design, can survive these momentary transitions. Other applicationscannot tolerate any identifiable transfer time. For a more detailed view ofsystem availability, see Chapter 3 of Transaction Processing: Concepts andTechniques, by Jim Gray and Andreas Reuter (Morgan Kaufman Publishers,1992).
Hardware fault-tolerant products take NT to the 99.99 percent availabilitylevel in a different way from fault-resilient clusters. Hardware fault-tolerantsolutions (such as Marathon Technologies' Endurance 4000) involve total systemredundancy in which all components perform actively during normal operation.This configuration allows continuous processing or compute-through capabilityfor hardware-related failures. Unlike high-availability and fault-resilientcluster solutions, this configuration requires no application restart. Thus, noloss of application state or client connectivity to the hardware fault-tolerantsystem occurs.
As you move up the scale from 99.9 percent to 99.99 percent availability,successive solutions result in more than incremental increases in cost. Thisfact is why it's imperative that you identify in dollars what availability isworth to your enterprise.
Data Mirroring with Failover
At the low end of the price and complexity spectrum ($1895 to $3999 for twonodes), realtime data mirroring between servers lets you increase NT'savailability and provides features that let each server assume the other'sidentity in case of failure. Data mirroring with failover between NT systemsisn't new. In fact, it's a high-availability solution I ran across in 1995 andfirst used at one of those buried-in-a-mountain government installations. Theair force colonel responsible for the system was concerned about having failovercapability between the system's two new NT servers. He was used to thefault-resilient capability of the VMS-based clusters he had been relying on, andhe wasn't going to give it up completely when he moved to NT.
What's new in data mirroring is an increase in the number of realtimedata-mirroring products. The abundance of solutions has increased competitionand driven improved product functionality. From a failover standpoint, productfunctionality improvement has meant moving away from older active/standbysystem-level implementations that, in some cases, require a reboot for thestandby system to assume the identity of the failed system. Today's solutionscan retain their identity while assuming the identities (including NetBIOS namesand TCP/IP addresses) of multiple failed servers.
Table 1, page 132, lists some prominent current solutions that combine datamirroring with failover capability. (Many data-mirroring products do not includefailover capabilities, and others, such as NT 5.0's IntelliMirror, do not offerrealtime capabilities.) Although the advanced features of the solutions in Table1 are targeted toward NT 4.0, NSI Software's Double-Take for Windows NT and theQualix Group's OctopusHA+ support NT 3.51 but have reduced failover capabilitieswith NT 3.51.
The solutions in Table 1 meet the independent OS and application-executioncriteria of my definition of clustering. Where they fall short--and why I don'tclassify them as true clusters--is in their inability to access shared storage.(To learn more about these solutions' features and functionality from a standardclustering perspective, see Jonathan Cragle, "Clustering Software for YourNetwork," July 1998.) Each node accesses data only on its locally attachedstorage. This limitation doesn't render these solutions useless in your searchfor higher NT availability; in fact, the beauty of these solutions is that theirhardware and software requirements are not (unlike those of MSCS and otherclustering solutions) stringent. For the most part, with these solutions you canuse the systems and (with some elbow grease) applications you already have, tobuild a system-level high-availability solution between two or more than twosystems that can run NT. At most, because you are duplicating your data, youmust add disk space. Depending on the traffic your systems support and the datamirroring and failover product you choose, you might also need to add networkcards to create a private network between the systems you target for datamirroring and failover. Figure 1 illustrates a typical two-node data-mirroring configuration with internal storage.
An advantage of network-based data replication is that it opens the doorfor wide-area data replication. Wide-area data replication can give you offsitecopies of your data, which you can use for centralized backup and which providesome level of disaster tolerance. Some of the products' architectures allow morewide-area data replication than other architectures allow. Therefore,investigate product architecture as you evaluate each solution, and prototypeyour implementation. Extend your product investigation to the granularity atwhich a product lets you select data to mirror. Some products let you selectspecific files to mirror, and other products mirror at the volume level. Anadditional benefit of network-based data replication is the flexibility thatsome products' many-to-one mirroring capabilities provide. This flexibility letsyou mirror many systems to one failover server, an approach that differs fromthe paired-server implementation MSCS offers.
The advantage of mirroring data can also be a disadvantage. Because data ismirrored, every data change results in two or more writes to disk. The ormore comes into play if a log file holds changes before the changes arewritten to the target system. Depending on the implementation, the use of logfiles can mean that one write on a nonmirrored system can result in four writesin a mirrored solution: one write to write the data, one write to the sourcetransaction log, one write to the target transaction log, and one write to themirrored target data. The additional overhead these writes cause can makemirroring solutions less desirable in write-intensive environments. Anotherdisadvantage of data mirroring is that recovering from a failure requires theremirroring of data. For long outages affecting large volumes of protected data,remirroring can cause lengthy recovery times. Data mirroring solutions try toreduce recovery time by maintaining logs of the changes that have occurred andmirroring only the changes to the recovered system.
Another factor to investigate when you evaluate data mirroring and failoversolutions is how a solution supports application failover. The products listedin Table 1 handle the failover of file shares, and OctopusHA+ and Co-StandbyServer handle print shares without a reboot as well. Table 1 also identifies some basic failover features these solutions provide, and it shows that, unlike MSCS, these solutions limit out-of-the-box failover support to the system level. That is, these solutions detect only system failures and fail over all applications, rather than individual applications.
Each solution addresses the intricacies of failing over applicationsdifferently from other solutions. Remember that, unlike MSCS, these products areattempting to provide failover support to applications such as Exchange 5.0,which were not written with failover in mind. Beyond file and print services,all these solutions support only active/standby configurations for mostserver applications. Application failover might be version- and feature-specificand require operator intervention and a reboot at the time of failure.Application failover support is an area in which prototyping can stop you fromtaking the wrong high-availability path. The last three columns in Table 1identify some failover features data mirroring and failover solutions providefor major BackOffice applications.
Exchange, with its reliance on primary computer name and account and use ofRegistry keys, is a particularly difficult application to support. Initially,data mirroring failover solutions for Exchange were little more than workaroundsthat involved prestaging a failover system for Exchange disaster recovery. Theintroduction of MSCS, with its support for active/active server configurationswith active/standby failover support for Exchange 5.5 Enterprise Edition, andcompetition among the data mirroring vendors has driven the refinement of datamirroring with failover solutions. All the vendors in Table 1 now automate theentire Exchange failover process (changing the primary computer name, stoppingand starting services, patching the Information Store to fix globally uniqueidentifiers--GUIDs--and in some cases removing the computer from oradding it to the domain). Microsoft's planned support for MSCS and ExchangeEnterprise Edition versions that support active/active configurations, andperhaps even service partitioning, should drive the data mirroring vendors toeven higher ground. Some vendors are already investigating the feasibility ofadding many-to-one Exchange failover capabilities.
Taking Stock
I hope you now understand why clustering isn't the only path to increased NTavailability. Although data mirroring doesn't fit my definition of clustering,nor is it a fault-tolerant solution, it offers features that clustering can'toffer, and it might better meet your NT availability and business requirements.If you want to use your existing equipment, can tolerate system-level failoverand restart of your existing applications, or are in search of an increasedlevel of disaster-tolerance, the recent improvements in data mirroring withfailover products more than justify a download from the Internet and a day or soof evaluation.
About the Author
You May Also Like