LifeKeeper 1.0 for EnVista Servers

LifeKeeper 1.0 for EnVista Servers are bundled solutions for NT clustering.

Joel Sloss

May 31, 1997

13 Min Read
ITPro Today logo in a gray background | ITPro Today

Bundled solutions for NT clustering

Amdahl is one of many server companies that offer theirsystems bundled with someone else's cluster software. When you purchase Amdahl hardware, you canchoose from VERITAS FirstWatch, NCR LifeKeeper, and soon Microsoft Wolfpack software--all on WindowsNT. With these choices, if you already have cluster software running on your network but on adifferent operating system (such as FirstWatch or LifeKeeper on UNIX), you can choose the same setupfor NT and not have to learn and integrate a whole new system. If you're looking for fault tolerancewithout sacrificing performance, easy administration, and high availability, consider the Amdahl andLifeKeeper solution, which we reviewed in the Windows NT Magazine Lab. For now, let's forgetabout price and focus only on what Amdahl's enterprise cluster solution will do for you.

Technology Overview
You have to look at this solution in two parts--the hardware and the software. As with Microsoftand Wolfpack, NCR will support LifeKeeper--and allow it to be sold--only on certified hardware, suchas NCR's WorldMark server or Amdahl's EnVista Frontline Server (FS). You can't buy just theLifeKeeper software and set it up on any system.

The hardware can be simple or complex, depending on the performance you want and the money youcan spend. Amdahl set up the Lab with a serious contingent of machinery: two quad Pentium ProEnVista FSs, each with 512MB of RAM, and an LVS 4500 Ultra Wide SCSI-3 disk array with twenty 4.3GBdrives. Figure 1 shows the configuration and interconnects. Either server alone can support 1000 ormore users; together, with proper load balancing, the servers can support twice that number.

Each component in the cluster solution is fault tolerant on multiple levels: The servers havedual power supplies (available with three modules), dual SCSI controllers, Error-Correcting Code(ECC) memory, and hot-swap drive bays. The disk array has hot-swap drive bays, five power supplieswith battery backup, ECC cache memory, and dual RAID controllers (availability managers). With suchredundancy, Amdahl has eliminated many--but not all--points of failure: Disk drives, power supplies,and disk controllers can still fail. One qualification is that if one of the SCSI controllers fails,you have to manually switch control of its drives to the other RAID controller in the LVS array. Ifyou don't, the SCSI controller failure initiates a server failover. This setup, however, providesbetter disk I/O throughput because you use more than one SCSI card. You will have to employother means (special drivers and software, such as Adaptec Duralink) to use multiple networkcontrollers for redundancy and load balancing.

"Clustering Solutions Feature Summary," page 58, shows that LifeKeeper does almosteverything you might want. The fault-tolerance features of the EnVista servers and LVS 4500 diskarray strengthen the system so that you would have to inflict heavy damage on both systems to makethem go down. I'll point out the few exceptions in this review. LifeKeeper is fully compliant withthe Wolfpack APIs, so upgrading or interoperating with Wolfpack in the future won't be a problem.

How It Works
LifeKeeper uses hierarchies of resources to define cluster groups; an entire hierarchy is whatfails over from one system to the other. For example, a hierarchy can include a LAN Manager resourcename, a disk volume, and an IP address. A Microsoft SQL Server failover hierarchy might include thedisk volume that the data resides on, a named-pipe NetBIOS name (the SQL Server alias that appearson the network, such as accounting), the specific database, and an IP address. Eachhierarchy has dependencies of objects; for example, the disk volume must come online before thedatabase can. The group can contain as many objects as needed to protect a given resource. You couldhave, say, 10 disk volumes, a SQL Server object, Exchange Server, LAN Manager names, and IPaddresses all fail over at once under the accounting hierarchy.

Several heartbeats travel at once between the two nodes, so that any single failure--such assomeone tripping over a LAN cable--won't trigger an unexpected failover. By default, you haveinterconnects via a direct network crossover, a LAN connection, and a serial link. All theseinterconnects must fail before LifeKeeper shifts services from the primary to the secondary node.You can also dedicate a small (1MB) partition on one of the shared array drives, and the nodes cancommunicate through it. Each heartbeat runs at a different priority level, and you can configure thepolling frequency to control how long the secondary node waits before assuming control over sharedresources. But the polling frequency also affects system performance, because these heartbeats areinterrupt-driven and consume processing time.

LifeKeeper can protect anything, from single disks to whole database systems and messagingplatforms, as long as you have the appropriate recovery kit (a set of services and DLLs) forthe applications you want to cluster. The basic system comes with failover capabilities for IPaddresses, disk volumes, LAN Manager (NetBEUI names), and SQL Server.

You use three management tools to maintain your cluster: Hardware Status, Hierarchy Status, andApplication Status. In the Hardware Administration window, shown in Screen 1, you can seewhether the nodes are up or down, view the status of heartbeat interconnects, and create or modifyinterconnects. In Screen 1, the primary interconnect is down. In the Hierarchy Administration windowyou see in Screen 2, you can view all running hierarchies and their status (green means inservice; yellow, out of service; red, dead; gray, unknown), and you can create, modify, and deletehierarchy objects, resources, and properties. The Application Status window you see in Screen 3 islike the Hierarchy Administration window, but it shows you the primary server for each applicationprotected by the cluster, associated hardware resources, and so forth.

Setup and Configuration
Setting up and configuring LifeKeeper is both simple and complex. The process is straightforwardif you know what you're doing and have the appropriate information up front; but setup is complexand time-consuming. For the Lab's test, the Amdahl engineers and I spent more than two full daysfrom plugging in the systems to having a fully functional cluster. Read the documentationfirst--this step is extremely important--because following the proper configuration sequence willsave you a great deal of time. Upgrading an existing system to a cluster configuration can bedifficult--you have to rebuild your system from the ground up-so don't do it on a livesystem.

On the hardware side, setting up the Amdahl servers and LVS 4500 disk array takes some effort,because you have to install all the SCSI and network controllers, configure the disk array, andinstall NT. However, for a fee, Amdahl will set it up for you on site.

Similarly, on the software side, Amdahl contracts with a service provider to get you up andrunning with LifeKeeper. The service engineer installs the software; you install the service on onesystem and reboot, then do the other system. But before you start, you need quite a bit ofinformation up front:

  • Static IP addresses for each NIC attached to your LAN--you can't use Dynamic HostConfiguration Protocol (DHCP)

  • An IP address for each NIC running the heartbeat interconnect--these IP addresses can beanything you want if you use a direct crossover link; but if you use your LAN, they come out of youravailable address pool

  • A placeholder IP address for each system--this address becomes active when a secondary nodetakes over a service from the primary node

  • Three IP addresses (one switchable address and one placeholder for each node) for each of theservices you want LifeKeeper to protect (SQL, Exchange, disk shares, etc.); these addresses are notmandatory if you aren't using TCP/IP Sockets

You have to enter all addresses assigned to NICs (including system placeholders) from theControl Panel, Network applet; you enter applications and their placeholder addresses through theLifeKeeper software.

Next is disk setup. You need to fully configure your RAID array before you install LifeKeeper.A configured RAID array simplifies assigning drives and volumes to primary servers and installingapplication software that you want the cluster to protect.

When you have the drives online and your IP addresses in hand, you can install theadministration software and LifeKeeper NT service. On one server at a time, install the software andthen reboot, so that the servers won't fight for ownership of the SCSI bus.

How you install SQL Server, Exchange, or any other cluster application depends on whether youare using any load balancing and how you want to optimize for the best performance. An active/activeconfiguration lets you use both systems for meaningful work all the time--for example, you can putyour accounting database on one system and your order entry on the other. With an active/standby configuration, you can leave your database running on one server and use the other as amessaging platform. Each approach has performance and administration advantages and disadvantages,which you must analyze before your deployment.

We set up the Lab's cluster in an active/active configuration, with SQL Server and fileservices running simultaneously on both systems. Amdahl installed both the LAN Manager and SQLServer recovery kits. In an active/active cluster, you place the MasterDB and TempDB devices on theserver's internal system drives, and only the data devices go on the shared array. To avoid Registryconflicts and keep LifeKeeper from confusing the databases, the directories for Master and Temp mustbe different. In an active/standby configuration, all database devices go on the shared array.Finally, SQL Server's services need to be on manual startup, rather than automatic, so thatLifeKeeper can control them.

Before you can set up your databases, each server needs to have primary control over theappropriate disk volumes; therefore, you create a LAN Manager hierarchy that contains the desiredvolume on each server. For example, on our test system, we had made four RAID 5 volumes--E, F, G,and H--out of the 20 available drives. EnVista 1 owned E and F, and EnVista 2 had G and H. If youcreate hierarchies on each system, LifeKeeper can protect the volumes, fail them over whennecessary, and keep track of their status.

Now, you can create your data devices and place your databases in them. After you have fullyconfigured each SQL Server, you can set up the SQL hierarchies (all of which you can do from onesystem--LifeKeeper will keep it straight). Selecting the specific database automatically includes adependency for the disk volume housing the data. Create an alias for the database (for named-pipeconnections that users will make), name the hierarchy, and you're ready to go.

Among the other factors you can configure are how many times LifeKeeper retries its heartbeatbefore initiating a failover, and whether you want the resource to automatically fail back to theprimary server when the primary server is back online. Automatic failback could be a problem if theserver keeps crashing or if you are performing system maintenance.

Testing LifeKeeper
On the whole, LifeKeeper worked as advertised--I could swing one database back and forth betweenservers while the other ran along by itself. However, the databases weren't always completelyreliable. After the Amdahl engineers initially configured the cluster, the databases and hierarchiesbecame corrupted (databases came up suspect--the stored procedures for repairing them didn't work;hierarchies stayed red), so I had to start over with new databases. I may have caused this problemby improperly changing the network configuration (domain, IP addresses, etc.) of the two systems.However, deleting and re-creating the corrupt databases and hierarchies and rebooting both serverssolved the problem. The situation was worse because the main documentation wasn't as helpful as Iwould have liked, and the manual specific to the SQL Recovery Kit wasn't detailed enough or veryclearly written.

After I re-created the data devices, reloaded the test data sets for Dynameasure, andreassigned the hierarchies, the cluster functioned normally. The only hang-up at this point was thatabout one out of five failovers I performed came up with an application error, and LifeKeepercouldn't bring the database back online. I had to reboot both systems to recover.

On the plus side, the failover for both disk volumes and the databases worked the way they weresupposed to--complete failover took about 30 seconds. In either a hard crash or a manual hierarchyfailure, LifeKeeper stops SQL Server on the primary node, moves the disk volume to the secondarynode, and brings the database back online on the secondary server. LifeKeeper does this withoutdisturbing the work on the secondary server, although the extra user workload has obviousperformance effects. In the active/active configuration, SQL Server never stops on the secondarynode, so the failback server doesn't drop user connections as it takes on the new services.

However, the system does not maintain user connections from the primary (failed) system whenthe service comes up on the secondary node; the failover killed all current connections andprocesses. Thus, when I ran the Dynameasure tests, all the simulated users stopped working when theprimary node died, and they did not pick back up when the database shifted to the other node.However, I could restart the test without changing the name of the target server (I remained pointedto the same test alias), and the simulated users spun right up.

This way of handling user connections occurs as a result of how the cluster works, how SQLServer works with user connections, and how NT maintains security tokens. First, when transactionsare not fully committed before a crash, the system doesn't write them to the transaction log;consequently, the system doesn't know the transactions occurred. When the system brings SQL backonline, SQL can rebuild to the last known ready state according to the transaction log. Data isnever corrupted (it doesn't come up in that interminable recovering mode), but SQL doesn'tknow about any processes that were running. Second, NT and SQL Server can't move security accesstokens from one machine to the other (security for NT and SQL needs to be the same on both nodes).The saving grace here is that if you are using an NT domain, the users don't need to log on again.And LifeKeeper migrates database security from the primary node to the secondary node. However, ifyou are using SQL security, users must log back on to the server through their client applications.

You must have application logic built into your client programs for the programs to be clusteraware. Program cluster awareness is the only way to automate reconnecting to the server when it isavailable again. Thus, a multiphase commit cycle for an order entry system might come up with aresource error as the system crashes. But when the database is ready on the secondary node, you justhit the submit button again.

What happens if your system crashes in the middle of a large overnight decision support run?Well, tough tomatoes. Unless you build specific logic into your application to handle thiseventuality, you'll just have to run the program again.

What It All Means
You need to note the difference between high availability (99 percent), which this type ofsolution provides, and 24*7 fault tolerance and availability (100 percent). Even with the kind ofconfiguration reviewed here, you can still have down time--although it is minimal--and the solutionis not foolproof.

Some negative features are the system's inherent single points of failure, such as the SCSIcontrollers and the disk array; loss of user connections during failover; and the system's cost. Forthe best results, consider your environment's design and load, the frequency of your transactions,and the frequency with which your client applications connect and reconnect to the serverapplication. You may have to redesign client software to make it fully cluster aware.

But I don't want to sound too negative here. Of the solutions the Lab tested, I foundLifeKeeper to be the best to use once I adjusted to the interface and learned the properadministration methods--but it wasn't the most reliable. Although I wasn't fond of the color-codedicons (yellow vs. green meant that my red-green color-blind eyes couldn't tell the differencebetween online and offline services!), I found LifeKeeper's interface logical and the tools easy touse.

For more functionality, such as dynamic load balancing, you could design custom middlewareapplications using NCR's TOP END transaction monitor. Load balancing could let you use three serversfor one database. And with the software development kit (SDK) available for LifeKeeper, you canbuild custom recovery kits for proprietary applications or other major applications that LifeKeeperdoesn't support yet. Between Amdahl's excellent hardware and LifeKeeper's first-rate functionality,I can definitely recommend that any large IS shop look into this combination for an immediatesolution.

LifeKeeper 1.0 for EnVista Servers

Amdahl 800-223-2215Web: http://www.amdahl.comPrice: $2000 per server for LifeKeeper software,$750 per application recovery kit, $30,720 EnVista FS (as configured), $93,737 LVS 4500 (as configured)

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like