Attaining Availability—Avoiding Failure
Suddenly losing your mission-critical server can devastate your business, but few companies plan methods to avoid disaster. Windows NT Server offers many availability options to help keep your company up and running.
If you don't consider disaster planning and availability part of yournetwork management strategy, consider Stratus Computer's findings from a recentsurvey of Fortune 1000 companies. In 1992 (the last year such research data wasavailable), computer downtime cost US businesses more than $3.8 billion in lostrevenue and worker productivity. This downtime equals an average hourly revenueloss of $78,000 and approximately 38 million worker hours annually, or $444million in wages.
A sudden loss of a mission-critical server can be financially disastrous.In most companies, just the downtime before recovery can be too costly. Stillnot convinced? According to "Down But Not Out" (HP Professional,September 1994), "The average company loses two to three percent of itsgross sales within 10 days after losing its data processing, and criticalbusiness functions cannot continue for more than 4.8 days without a recoveryplan in progress. Half of the companies that do not restore their data center tooperation within 10 business days never fully recover. Ninety-three percent ofthe companies lacking a recovery plan are out of business within five years of amajor disaster."
Despite these claims, few companies plan ways to prevent or mitigate losses.To protect the bottom line, companies need to evaluate potential losses andimplement an appropriate availability scheme for their network.
A good starting point is to review the availability mechanisms that WindowsNT Server supports. These mechanisms include data backup, uninterruptible powersupplies (UPSs), and redundant systems. With an understanding of the options,companies can make informed decisions about implementing the appropriate levelsof protection for their LAN and WAN and be better prepared for the next level ofavailability--ensuring server availability with server redundancy.
LAN Availability
Downtime can result from disasters such as fires, floods, power failures,and--let's face it--users. Users frequently (yet accidentally) delete criticalfiles or stumble onto control-key combinations that can restructure databasesand wreak havoc throughout a company. So when planning a network, you need toconsider availability, backup, and disaster recovery.
Most network administrators implement availability by a mirroring of theprimary system. This redundant system eliminates single points of failure.Fortunately, NT Server comes with support for tape backup, UPS, and redundantsystems.
Critical Data and Programs
Data backup is at the forefront of availability. The backup process copiesimportant information onto magnetic tape or disks. Without backups, vital data,complex application and network configurations, customized setups, and userpasswords and IDs are difficult and expensive--perhaps even impossible--tore-create. Backing up information is also important because of its changingnature. Compaq reports that as much as 40% of its company data changes everymonth.
To restore a system after a disaster, you need to back up all data andprograms and determine whether certain users or groups have special backupneeds. For example, an accounting group may require data backups beyond theregularly scheduled full-system backups. For information on NT-native backupprograms, see Bob Chronister, "System and Enterprise-wide Backup Software,"Windows NT Magazine, April 1996.
UPSs
Most systems improve OS performance by writing changes to RAM before writingthem to disk (write-back caching). When a power interruption turns off or resetsa computer, you can lose cached information and potentially corrupt data.Because the server processes most data on the network, any power fluctuationscan adversely affect data flow to and from client workstations.
Most system administrators equip critical servers with UPSs in case of apower failure. But don't overlook key network connection points such as mainservers and LAN/WAN peripherals (routers, bridges, hubs, and concentrators).Site-to-site and wide-area networks are susceptible at these points, so use UPSsto maintain data flow and processing stability among servers.
What about client workstations? In a peer-to-peer network, any workstationcan be the server to any other workstation on the network. Peer-to-peer activitygreatly increases the data flow on the network to each workstation, but makesthem susceptible to brownouts and blackouts. So, you need UPSs at clientworkstations. This way, if you lose power, you have time to save active filesand do an orderly shutdown. For more information, see Larry McClain, "Roundupof UPS Products for Windows NT," Windows NT Magazine, November1995.
Redundant Systems
With availability solutions, two are always better than one. Redundancy letsa system gracefully handle a failure in any component for which a duplicate isavailable. Sophisticated systems use the duplicate component to balance theprocessing load until a failure occurs. Then, the remaining component picks upthe full load with a decrease in performance but little or no interruption inservice. HP's research on server failures shows server downtime most oftenoccurs from system hangs when the server or network OS freezes or stops running,from power failures to the server, and from hard drive and memory failures.
Disk redundancy solutions are disk mirroring, disk duplexing, and diskarrays. With mirroring, two disks (or two partitions on different drives) on onecontroller are copies of one another. A system write operation writes the datato both disks, so they are always synchronized. If the primary disk fails, nodata is lost because the secondary disk has an exact copy of the data on theprimary disk.
A caveat to mirroring is that two disks don't improve performance. Usually,performance worsens because the disk controller has to write every operationtwice.
To solve this problem you can use disk duplexing--disk mirroring withanother adapter running the secondary drive. Duplexing provides protection forboth disk and controller failure and improves disk I/O performance overmirroring. Duplexing doesn't adversely affect performance because both diskcontrollers perform write operations simultaneously. And using two controllersremoves a potential single point of failure within a system.
The third form of disk redundancy is RAID. A disk array is a group of diskdrives, and each drive stores information in parallel with the others.Redundancy relies on parity, a mathematical calculation that lets the disk arrayreconstruct any corrupt or missing information if one disk fails.
The six levels of RAID are RAID 0 through RAID 5. Each level offers variousmixes of performance, reliability, and cost.
RAID 0: Disk striping (a disk array that implements striping without anydrive redundancy)
RAID 1: Disk mirroring or duplexing (two drives storing identicalinformation, mirroring each other)
RAID 2: Redundancy through hamming code (extra check disks that detect andcorrect single-bit errors and detect double-bit errors)
RAID 3: Striped array plus parity (one redundant check disk for each groupof drives)
RAID 4: Independent striped array plus parity (a disk array architectureoptimized for transaction processing applications)
RAID 5: Independent striped array with distributed parity (storing data onthe equivalent of one disk, but distributing the check data over a group ofdrives)
NT supports only RAID 0, 1, and 5. Although you can mix and match RAID 0,1, and 5 across the disks in a system under NT Server, consider only RAID 1 orRAID 5 or a combination. RAID 0 doesn't provide data redundancy.
With RAID 5, if one drive fails, the array continues to function. Thesystem reconstructs the missing or damaged information with the parityinformation on the other disks in the array. In RAID 5, the controllers writedata one segment at a time and interleave parity bits among the assigned disks.Table 1 on page 72 lists RAID 5 characteristics. For a glossary of RAID-relatedterms, see the sidebar, "RAID Tech Talk," above.
Availability with NT
In addition to data backup and UPS devices, RAID needs to play an importantpart in your availability scheme. With NT Server, you can implement RAID ineither hardware or software. So your decision comes to either buying thesolution from a RAID hardware vendor or building it using NT's softwarefeatures.
With RAID in hardware, the disk controller creates and regenerates theredundant information in one of two ways: the host bus-based system or theSCSI-SCSI system. In a host bus-based system, the disk controller contains a CPUand firmware for calculating parity and striping data. Most host bus solutionsare on EISA and PCI bus systems. The faster the host bus, the faster the RAIDsubsystem.
The SCSI-SCSI RAID subsystem alternative consists of an external drivechassis and a device similar to a host-bus adapter. This external chassisconnects to the host system via a standard SCSI cable and appears to the systemas one or more SCSI devices. The SCSI-SCSI RAID subsystem doesn't require adevice driver on the host, and you can use the subsystem on any system with aSCSI bus.
NT Server's software capabilities let you mirror the stripe on onecontroller to a second. Mirroring across controllers removes the controller as asingle point of failure. Two disadvantages of a purely software-based RAIDimplementation are performance and reliability.
Performance: With RAID software, the system CPU performs extra worksuch as calculating parity in RAID 5. With hardware-based RAID, the controllercalculates parity data and duplicates disk writes, freeing the system CPUs tohandle the usual processing tasks.
Reliability: Protecting a drive that contains the OS from drivefailure is difficult in software-based RAID because the OS must boot before theprotection is available. In contrast, the hardware-based RAID subsystem protectsdata as the system boots. If a drive with the OS fails, the controllerreconstructs the OS at boot time.
Combining software- and hardware-based RAID under NT Server provides thebest of both worlds. If a hardware-based RAID controller fails, the system isdown until you replace the con-troller. If you install two controllers, you cancreate a RAID 0 stripe on each controller.
RAID Solution with NT Server
To set up RAID, you need NT Disk Administrator, a graphical utility thatmanages disk resources, including drive partitioning, volume creation anddeletion, and software RAID configuration. You can make the disk subsystem moreredundant with multiple disk controllers. NT Setup lets you incorporate new diskcontroller drivers. You use this utility before you configure drives with DiskAdministrator. With SCSI drives, NT can isolate and avoid bad disk sectors. Inthis way, NT can recover data from redundant bad sectors and write theinformation to good sectors.
Increasing Server Availability
Disk mirroring, duplexing, and RAID protect your system from disk failureonly. So what happens if a CPU dies? To increase availability, the next step isserver redundancy at the OS level. To meet this need, Microsoft teamed with HP,Digital Equipment, Compaq, Tandem, NCR, and Intel to implement a techniquecalled clustering.
Clustering refers to a set of loosely coupled, independent computer systemsthat work together and behave as one system. Clusters offer high availabilitythrough component redundancy, so when a component or server fails, the clustercontinues to provide service. Digital Equipment pioneered clustering in themid-1980s on VMS.
A cluster of NT servers provides common, highly available services to PCand workstation clients. You manage an NT cluster as one secure entity. You caneasily add incremental processing, I/O, and storage capacity to the clusterdomain. With file services, clients access remote directories in the clusterthrough the File Manager, the same way they access any directory in Windows 3.xor NT. The location of the directory server is transparent to the user.
A cluster can be a simple set of standard desktop personal computersconnected with Ethernet, or a sophisticated hardware structure withhigh-performance symmetrical multiprocessing (SMP) systems interconnected with ahigh-performance I/O bus. You can add systems to the cluster as needed toprocess more or to handle more-complex client requests. So if one component in acluster fails, the system can automatically disburse the workload of the failedcomponent among the surviving components.
Microsoft plans to deliver clustering, under the code name Wolfpack, in twophases. For a comprehensive analysis of cluster technology and Wolfpack inparticular, see Mark Smith's "Closing In on Clusters," page 51.
If you can't wait for Microsoft's clustering solution, check out DigitalEquipment's Clusters for Windows NT Server (see Joel Sloss, "DigitalClusters for Windows NT," page 63). The product doesn't require hot andcold standby, proprietary hardware, interconnects, or special versions of NT,and Digital Equipment has optimized it for client/server computing.
Be Prepared
Count on this: At some point your network will go down. Whether yourdowntime results from the wrath of nature or user error, you will lose data,servers will crash, routers and bridges will fail, and communication lines willfall. Although you can't make an NT Server or network fail proof, you can makeit failure resistant. So plan for disaster, and defend against it. Your bestdefense is a solid offense--availability techniques. Keep in mind thatavailability is a means to an end and not an end in itself. You need to developplans and procedures for recovering from failures before you have one.
Please see the sidebar "RAID Tech Talk".
Digital Clusters for Windows NT Server |
Digital Equipment * 800-344-4825Web: http://www.windowsnt.digital.com/clusters/default.htmPrice: $995 per server |
RAID Vendors |
Mylex * 800-776-9539Web: http://www.mylex.comMicropolis * 800-395-3748 or 818-709-3325, option 4, for fax-backinformationWeb: http://www.micropolis.com/How_To_Buy.html (to find your localsales office)Seagate Technology * 408-438-6550Web: http://www.seagate.com |
About the Authors
You May Also Like