Sizing Your NT RAID Array
Know how to troubleshoot bottlenecks in your disk subsystem, and you can properly size a RAID array to give you breathing room for future workloads.
July 31, 1998
Obtaining maximum performance while adding storage
How do you know where to start whenyou need to add storage capacity to your existing Windows NT solution? You can'tsimply add more disk space and expect to improve performance. As you increaseyour storage capacity in an enterprise environment, you need to be able todetect bottlenecks in your RAID subsystems, know which RAID levels to consider,and know how to size your RAID arrays according to current and futureperformance requirements.
If you're new to RAID or just need to brush up on your technology, seeTable 1, page 186, for a comparison of RAID levels and definitions or go to theRAID Advisory Board Web site (http://www.raid-advisory.com) for an extensiveRAID review. The disk subsystem is one of the most flexible resources you canconfigure in NT. How well you design your disk subsystem can drasticallyinfluence NT's overall performance.
Getting the Big Picture
Before you can detect a disk subsystem bottleneck, you need to determinewhether your system is suffering from other bottlenecks associated with the CPU,memory, disk, network, applications, clients, and NT resources. (For informationabout tuning NT to improve performance, see "The Beginner's Guide toOptimizing Windows NT Server," part 1 and part 2, June and August 1997.) Ifyou add resources to an area of NT that isn't throttling your system'sperformance, you won't improve NT's overall performance. Tuning a resource orpurchasing additional hardware only to find that your efforts were in vain canbe frustrating. Assuming your disk subsystem is causing the only bottleneck onyour NT system, you can take several steps to detect and correct the bottleneckand improve your system's disk performance.
Detecting Single-Disk Bottlenecks
Detecting a bottleneck in your disk subsystem is an important first step inhelping you determine how much additional disk space and disk performancecapacity you need. On NT systems with one hard disk, the disk becomes abottleneck that throttles the system when the disk can't keep up with therequested workload. As a result, the disk's response time for processingapplication requests becomes unacceptable. This delay forces applications towait on disk service.
NT's Performance Monitor is an excellent tool for detecting diskbottlenecks (for an explanation of Performance Monitor, see John Savill, "TroubleshootingNT Performance Moni-
toring," April 1998). To collect disk subsystemstatistics for use with Performance Monitor, you must type
diskperf -ye
at the NT command prompt and reboot the server; otherwise, the performancecounters will all report zero. The -y option tells NT to start the disk counterswhen you restart NT, and the -e option enables the disk counters you need tomeasure the performance of physical disks in striped disk sets. (You might nothave a striped disk set now, but turning on these counters will save you fromhaving to reboot later.)
Selecting Disk Counters
The number of disk-related counters that Performance Monitor provides can beoverwhelming. A good counter to watch is %Disk Time, which is available underPerformance Monitor's LogicalDisk object. The %Disk Time counter reports thepercentage of elapsed time that the selected disk is busy servicing read orwrite requests. If %Disk Time averages 60 percent to 80 percent, the disk is notcausing a bottleneck. However, this level of performance warrants taking acloser look at the disk in question. When %Disk Time exceeds 80 percent, thedisk is getting busy. At this level of performance, the time the disk requiresto service each request increases, and you need to closely monitor several otherdisk-related counters that are also available under Performance Monitor'sLogicalDisk object.
The first of these additional counters is Avg. Disk Queue Length. Thiscounter measures the average number of read and write requests that NT queuedfor the selected disk during a sample interval. A hard disk becomes a seriousbottleneck when the Avg. Disk Queue Length exceeds 2 for a sustained period.When this delay occurs, applications are waiting to access the disk.
Another counter to watch when the %Disk Time exceeds 60 percent to 80percent is the Avg. Disk sec/Transfer. This counter measures the time in secondsof the average disk transfer (i.e., the time the disk needs to service eachrequest). A disk can complete only so much work before its service begins todegrade. When disk performance begins to degrade, the Avg. Disk sec/Transferincreases dramatically. This increase affects NT's overall performance.
You will want to review the Disk Transfers/sec counter to determine theamount of work a disk is completing. This counter measures the rate of read andwrite operations (also known as the rate of input/output per second) on theselected disk. The amount of work a disk can support depends on the disktechnology and the I/O workload the disk encounters. In my experience, an UltraFast/Wide SCSI 7200rpm disk encountering a mixed I/O workload (random,sequential, write, and read operations) supports approximately 50 disk transfersper second to 100 disk transfers per second before its performance degrades.Monitoring the Avg. Disk sec/
Transfer counter lets you observe thisperformance degradation.
Detecting RAID Bottlenecks
RAID technology lets you group multiple hard disks and present them to NT asone logical disk device. To detect a RAID bottleneck, you use the single-diskbottleneck detection techniques I just described, but with a twist. The %DiskTime counter uncovers problems that are brewing in any RAID device. When you'reattempting to detect a RAID bottleneck, RAID 0, disk striped sets, is theeasiest RAID level to work with. RAID 0 takes advantage of all the disks in thearray equally. Thus, a three-disk RAID 0 array can support three times as muchworkload (i.e., disk requests) and three times as many outstanding disk requests(Avg. Disk Queue Length) as a one-disk configuration before becoming abottleneck.
In a RAID 1 mirror with two disks, the array uses both disks for all writeactivities. To determine the workload that a RAID 1 mirror can support (i.e.,the number of transfers per second), use the following equation: (diskreads/sec + [2 * disk writes/sec])/(number of disks in the RAID array).Today's RAID 1 mirrors use a two-disk configuration. Despite greateravailability, RAID 1 arrays support a slightly lower workload in awrite-intensive environment than systems with one hard disk. However, if yourAvg. Disk Queue Length divided by the number of disks in the array exceeds 2,you have a serious bottleneck in a RAID 1 mirror.
The RAID 5 disk stripe with parity environment is similar to a RAID 0stripe set for read-intensive environments. A RAID 5 array with five diskssupports almost five times as much workload and up to five times as manyoutstanding disk requests as a one-disk system before becoming a bottleneck. Tocalculate how many disk requests a RAID 5 array can support, use the followingformula: (disk reads/sec + [4 * disk writes/sec])/(number of disksin the RAID array). A RAID 5 array's performance is different than a RAID 0stripe set's performance because of additional disk activity associated withparity generation. In a RAID 5 array, parity information is spread across allthe disks in the array for fault tolerance. To calculate this parityinformation, each RAID 5 write operation reads the data block, reads the parityblock, logically exclusive Ors (XORs) the data, writes the data block, writesthe parity block, and so on for each single write operation. Thus, each writerequest in a RAID 5 array incurs four disk operations. This parity generationslows write operations in RAID 5 environments compared with RAID 0. However,this parity information lets you continue operations if one of the disks in theRAID 5 array fails. You can replace the failed disk and reconstruct the faileddisk's data on the new disk using parity information from the other disks in thearray.
You can use hardware-based RAID solutions to avoid the performance pitfallassociated with generating this parity information. Hardware-based RAIDcontrollers generate parity information using their own CPU, not the system'sCPU. As a result, a system using a hardware-based RAID solution can handle moredisk I/O operations than a software-based solution. An additional benefit ofoffloading parity generation to a hardware-based RAID solution is that you canrecover and use processing power elsewhere on your system that might otherwisebe wasted on disk I/O parity.
Sizing Additional Disk Capacity for RAID Arrays
If you evaluate your RAID array's performance and determine that the arrayis causing a bottleneck in your system, you can intelligently size additionalstorage capacity. Without the information that the Performance Monitor countersprovide, you can only guess how much disk space you need to add to improveperformance.
Adding a RAID-based disk subsystem to NT can improve a system'sperformance, availability, and manageability. However, you need to considerfault-
tolerant support, cost, capacity, and performance when sizing RAIDsubsystems.
Determining How Many Disks to Add
How do you know how many disks you need to meet your performancerequirements? The primary performance requirements for a RAID array are adequate throughput and response time. The workload you place on the array and theamount of work the RAID array can support (i.e., transfers per second) influenceboth requirements.
To help you know what steps you need to take when adding storage capacityto NT, let's look at an example. Imagine that you have a server with a RAID 5array composed of three 4GB Ultra Wide SCSI 7200rpm hard disks. Havinghistorical information to work from when adding storage capacity is helpful, soimagine that you've stress tested your NT file server using Bluecurve'sBi-Directional copy workload to simulate a file server workload (for informationabout Bluecurve, see Carlos Bernal, "Dynameasure Enterprise 1.5,"September 1997).
From the Bluecurve stress test results, you learn that the maximumthroughput that this configuration (configuration 1) provides at the 20-userlevel is 3.8MB per second (MBps) with a response time of 13.9 seconds. When youreview the corresponding Performance Monitor log to determine what's happeninginside NT during the tests, you see that the %Disk Time stays at 100 percent. Asa result, I omitted this counter to ease viewing the chart you see in Screen 1,page 187. As the Disk Transfers/sec increases against the disk array, the Avg.Disk Queue Length grows to almost 16 and the average RAID array response time(i.e., Avg. Disk sec/Transfer) increases to 0.121 second, which is slow. Thisinformation indicates that this RAID array is causing a bottleneck. Now that youknow a bottleneck is occurring, you can use this information to determine thebest economical solution to remove the bottleneck and increase the usable diskcapacity.
Estimating Required Additional RAID Performance Capacity
The Avg. Disk Queue Length for configuration 1 is 16, which exceeds themaximum recommended rating of 6 (3 disks * 2 outstanding requests each). Also,the maximum transfers per second are 139 ([126 + (4 * 73)] / 3) per disk, whichexceeds the suggested workload that one disk can support. The combination oflong queues and excessive numbers of transfers per second slow the Avg. Disksec/Transfer response time to 0.121 second.
You want to limit each disk in the array to no more than two outstandingrequests at a time, so you need a minimum of eight disks to remove thebottleneck. I recommend you replace the three-disk RAID 5 array with a 10-diskRAID 5 array. Adding two more disks than the system requires gives you some roomfor possible surges in workload and room to accommodate future requirements.This configuration removes the disk bottleneck and provides 36GB of usablestorage capacity.
Graph 1 shows how the average response times of the RAID array inconfiguration 1 compare with those of the new configuration (configuration 2).Graph 2 shows how the throughput levels of the RAID array for configuration 1compare with those of configuration 2. Configuration 2 lowered the aver-
age response time from 13.9 seconds to 9.2 seconds and improved the throughputfrom 3.8MBps to 4.9MBps at the 20-client level. The Avg. Disk Queue Lengthdropped from 16 to 12, and Avg. Disk sec/Transfer dropped from 0.119 second to0.049 second. These results provide insight into the reason why the throughputand response time reported by the Bluecurve clients improved significantly. Inaddition, Performance Monitor reported that the RAID array provided greater than7.34MBps of disk throughput while supporting a workload of 68 ([147 + (4 *117)]/9) transfers per second per disk. This sizing solution provides improvedperformance with room to grow.
Disk Storage Capacity vs. Disk Performance Capacity
In the example in this article, you learned how to determine the number ofdisks you need to add to a RAID array to remove a disk bottleneck and providethe necessary storage capacity. This example provides 36GB of usable diskstorage capacity. So why did I suggest you create a RAID array using ten 4GBdisks instead of five 9GB disks to provide 36GB of usable storage capacity? Theanswer has to do with the supported disk workload. Just because disk capacityincreases from 4GB to 9GB, the workload each disk can support doesn't increaseif the disks in the RAID array are from the same family (e.g., Ultra Wide SCSI7200rpm). Regardless of the disk storage capacity, each 7200rpm disk can supportonly about 100 transfers per second. Thus, if you use five 9GB disks instead often 4GB disks, you meet the storage capacity goal of 36GB, but the RAID array isstill a bottleneck. You can also use nineteen 2GB disks to provide even betterperformance, but this solution is economically prohibitive.
Meeting Your Storage and Performance Needs
Understanding how to use and evaluate NT's built-in metrics anddistinguishing between storage capacity and disk performance capacity isimportant. After you understand these concepts and the relationships of theinformation that Performance Monitor provides, you can remove the guessworkassociated with sizing your RAID array and meet your storage and performanceneeds. In a future article, I'll show you how you can tune your NT RAID solutionfor maximum performance.
About the Author
You May Also Like