Inside the Blue ScreenInside the Blue Screen
Find out how and why NT generates blue screens, how to interpret the cryptic data NT lists on them, and how to troubleshoot them.
November 30, 1997
Understand the clues the blue screen provides
The color blue has become synonymous with disaster in the Windows NT world. Although NT is more reliable and stable than its cousins, Windows 3.x and Windows 95, it nevertheless is subject to the frailties of third-party software, add-on peripherals and their device drivers, and Microsoft's bugs. Almost everyone who has used NT for any length of time has seen a blue screen (also known as the blue screen of death). Screen 1, page 58, displays a typicalexample. NT stops processing and paints one of these displays whenever it hasencountered a situation in which it cannot continue, or in which continuing maylead to data corruption.
What most users and many developers don't know is what the screen'sinformation means. If you're lucky, simply resetting the computer will get youon your way. If you're unlucky, you'll repeatedly get a blue screen every timeyou start NT or perform a particular operation (e.g., inserting a new floppy).Even if you've successfully moved past a blue screen with a reboot,understanding the clues it provides can help you avoid future blue screens orgive you a hint about what driver or piece of hardware is causing problems.
This month, I'll talk about how NT generates blue screens, what leads totheir appearance, how to interpret the cryptic data NT lists on them, and how togo about troubleshooting them. I'll tackle the topic from the perspective thatNT device drivers are not your forte and that debugging a blue screen with dumpanalysis tools or a kernel-mode debugger is infeasible. In the process, I'lldescribe the inner workings of NT's kernel mode. (For a different angle on bluescreens, see Mark Edmead, "The Blue Screen of Death," June 1997.)
NT Architecture Basics
To understand what leads to a blue screen, you first need to understand NT'sbasic architecture. NT executes in two modes, user mode and kernel mode, asshown in Figure 1, page 59. Kernel mode is a highly privileged processor mode,with direct access to all hardware and memory; user mode is a less privilegedmode, with no direct access to hardware and restricted access to memory.
User mode is the mode in which applications and operating systemenvironment subsystems execute. The operating system environments that NTsupplies include POSIX, OS/2, Win16, DOS, and Win32. Applications are clients ofexactly one environment subsystem and use only the APIs that subsystem exports.Thus, Win32 programs are clients of the Win32 subsystem and use only the Win32API.
The subsystems use basic NT services that the NT Executive and theMicrokernel provide. These services run in kernel mode. The Executive includescore operating system components: the Process Manager, Virtual Memory Manager,I/O Manager, Local Procedure Call (LPC) Facility, Object Manager, and SecurityReference Monitor. The Executive is generally portable across processorarchitectures (e.g., Alpha, x86), and it relies on the Microkernel forprocessor-specific functions such as context-switching (scheduling) andsynchronization primitives.
Beneath the Microkernel resides the Hardware Abstraction Layer (HAL),through which the Executive subsystems and the Microkernel interface with theprocessor. Microsoft ships different HALs for different processors and processorboards.
Device drivers are modules that interface NT and applications to specifichardware devices. A large number of device drivers for disk drives, video cards,modems, network cards, and input devices ship with NT. However, hardware vendorscan include custom device drivers with their hardware, and NT dynamically addsthe drivers to its kernel-mode environment.
User Mode vs. Kernel Mode
What differentiates user mode from kernel mode is the privilege level. Aprogram executing in user mode runs in a sandbox (not unlike a Java virtualmachine's sandbox) that the NT Executive and the program's operating systemenvironment create for the program. The sandbox enforces restrictions as to whatthe program can do. One type of restriction relates to what parts of thecomputer's memory the program can reference and in what ways.
Figure 2 shows the virtual memory map that NT creates for applications.Addressable memory totals 4GB, but NT evenly divides the space between thememory assigned to a program and the memory that the kernel-mode portion of NTuses.
The lower 2GB mapping changes, depending on which program is currentlyrunning. For example, if Microsoft Word is running, NT places Word's addressmapping in the lower 2GB; if Netscape Navigator runs next, its mapping replacesWord's mapping.
The upper 2GB mapping always remains that of the Executive, Microkernel,device drivers, and HAL. Thus, the split between user mode and kernel mode alsoshows up in NT's address space mapping. (In NT Server 4.0, Enterprise Edition,you can adjust the address split between user mode and kernel mode so thatapplications have 3GB of memory, with 1GB left for NT's Executive, drivers, andHAL. You will see this split only when NT is running on systems with severalgigabytes of physical memory.)
The primary memory restriction placed on user-mode programs is that theycannot access any of the kernel-mode memory. User-mode programs also cannotaccess invalid portions of their mapping (i.e., portions not filled with data orcode from the program). This arrangement contrasts with the kernel-mode portionsof NT, which have free rein over the entire address map. For example, NT doesnot stop a device driver from writing data into Word's address map, but NTprevents Word from writing over the device driver's image.
The user-mode sandbox enforces another restriction that limits a program'sability to directly access hardware devices such as disks, the video screen, andthe printer. Programs must typically go through their operating systemenvironment (e.g., Win32) to read data from or write data to a peripheral. Theoperating system environment then usually calls on the services of the Executivein kernel mode, effectively forwarding the request. The Executive finallycompletes the request, sometimes with the aid of a device driver, but almostalways with the use of functions in the HAL that interface with the computer'shardware. NT implements the transition between user mode and kernel mode as asystem call gateway, through which the passage of data is preciselycontrolled.
Although a user-mode program can try to directly communicate with ahardware device, NT prevents it from doing so. Any kernel-mode component,however, can touch any part of the hardware. For example, a device driverimplemented to interface with a disk drive can access video hardware without NTstopping it.
What do I mean when I say that NT stops user-mode programs from reachingoutside their sandboxes to touch memory that isn't theirs or access hardwaredevices directly? You probably have seen the result of such an attempt, whichScreen 2 shows. The infamous Dr. Watson dialog box signals that NT caught aprogram doing something illegal, and NT is terminating the program. Thedetection of such transgressions takes place in a kernel-mode subsystem such asthe Process Manager or the Virtual Memory Manager. Some legal user-modeoperations (e.g., referencing memory that the paging file is currently using)generate processor exceptions, but a program can also trigger exceptions when itsteps outside its sandbox. A kernel-mode component must determine whether anexception is the result of a legal or an illegal operation; when a kernel-modecomponent catches an illegal exception, it notifies the Dr. Watson user-modeapplication. With the help of hardware support in the processor, the kernel-modeportions of NT keep user-mode applications constrained to acceptable activityand prevent user-mode applications from corrupting other applications orcrossing the boundary between user mode and kernel mode other than through thegateway.
Kernel-Mode Rules
Thus far, my explanation implies that kernel-mode device drivers andsubsystems do not execute in a sandbox and can do anything they want. Well, thisimplication is almost true. Portions of the memory map are undefined, andconsequently, those portions are invalid regardless of what tries to accessthem. For example, if the space between 3GB and 4GB in the address map is notdefined, a device driver accessing that portion of the map will cause aprocessor exception. In this example, the Virtual Memory Manager will recognizethat a kernel-mode device driver has tried to touch invalid memory. Forexceptions that originate in user mode, a kernel-mode subsystem handles theexception.
Kernel-mode components also have rigid rules about what they can do whenthe processor is in different states. I'll summarize the key ideas behind theserules in the next few paragraphs, but for details, see my November 1997 column,"Inside NT's Interrupt Handling."
Each processor in an NT system has an associated Interrupt Request Level(IRQL) that changes as the processor's interrupt controller fields varioussoftware and hardware interrupts. Although IRQLs have almost nothing to do withscheduling priorities, you can think of IRQLs as priorities in the sense thatthe interrupt controller blocks out interrupt requests with lower IRQLs whilethe processor is handling interrupts with higher IRQLs. In its design, NTattempts to keep the IRQL at Passive Level, where no interrupts are blocked out,as much as possible. The NT scheduler executes at Dispatch Level, and NTservices hardware interrupts at even higher IRQLs.
Only when the IRQL is below Dispatch Level can kernel-mode componentsaccess pageable memory or cause scheduling operations. Pageable memory includesall user-mode application memory and portions of kernel-mode memory. Pageablememory gets its name from the fact that its data can be temporarily moved fromthe processor's physical memory to a paging file on disk and brought back whenneeded. When a kernel-mode component (such as a device driver) accesses part ofthe memory map referring to pageable memory that has data in a paging file, ittriggers a processor exception (the same one that's triggered when a componentreferences invalid memory), and the Virtual Memory Manager must retrieve thedata. However, if the IRQL is Dispatch Level or higher, the Virtual MemoryManager cannot be invoked.
The scheduler's IRQL is Dispatch Level, so a device driver cannot yieldcontrol of a processor to another program or kernel-mode component if the IRQLis at Dispatch Level or higher. To do so would force the invocation of thescheduler, which would detect that it had been called at an illegal processorstate.
Where Do Blue Screens Come From?
So where am I going with all this information? As I stated earlier, illegalprocessor exceptions that user-mode applications cause usually result inapplication termination and a Dr. Watson message, but the rest of the systemcontinues.
When a kernel-mode device driver or subsystem causes an illegal exception,NT faces a difficult dilemma. It has detected that a part of the operatingsystem with the ability to access any hardware device and any valid memory hasdone something it wasn't supposed to do.
NT could just ignore the exception and let the device driver or subsystemcontinue as if nothing had happened. The possibility exists that the error wasisolated and that the component will somehow recover, letting NT limp along.What's more likely is that the detected exception resulted from deeperproblems--for example, from a general corruption of memory or from a hardwaredevice that's not functioning properly. Permitting the system to continueoperating will probably result in more exceptions, and data stored on disk orother peripherals can become corrupt--a risk that's too high to take.
A device driver or subsystem also might realize that something is not quiteright. For example, a subsystem might call a function in a device driver whenthe processor IRQL is Passive Level. If the function returns and the IRQL haschanged, the device driver has somehow modified the IRQL without restoring it,which reveals a bug in the driver. As device drivers and subsystems execute,they require certain operations to succeed or return results within a validrange. For instance, if the Configuration Manager tries to read a Registry filefrom the disk and encounters an error, the Configuration Manager might not beable to continue processing without risking damage to the Registry.
To stop a system in the face of kernel-mode exceptions and to provide asystems administrator or developer information about what has happened,NT exports the KeBugCheck function for use by kernel-mode device drivers,subsystems, and the Microkernel. This function takes a Stop Code and four moreparameters that are interpreted on a per-Stop Code basis. After KeBugCheck masksout all interrupts on all processors of the system, it switches the display intoblue screen mode (80 columns by 50 lines text mode), paints a blue background,and begins to print information about the system's state.
Mapping the Blue Screen
The blue screen contains five areas of text from top to bottom: the StopCode, system information, a list of loaded drivers, the stack trace, and anadministrative message. In Screen 1, blank lines separate these areas. Someareas might be missing in a blue screen if the system state is too corrupt forNT to fill them in.
The administrative message tells you to contact your systems administratorif you have a chronic blue screen problem on your system. The most usefulportion of the display is usually the Stop Code area. This area lists the StopCode and the four additional parameters passed to KeBugCheck. In Screen 1, theStop Code is 0x000000A, and the additional parameters appear inside theparentheses after the Stop Code.
The Stop Code is a number that represents the nature of the detectedproblem. The bugcodes.h file in the Windows NT Device Driver Kit contains acomplete list of the 150 or so Stop Codes. However, you will typically encounteronly 4 or 5 of them. The text line below the Stop Code provides the textequivalent of the Stop Code numeric identifier. I'll discuss some of the commonStop Codes a little later.
Interpreting the additional Stop Code parameters rarely provides anyinsight into a problem for anybody other than a device driver writer (or amember of the Microsoft NT development team). Fortunately, NT does someinterpretation for us. KeBugCheck scans the parameters for one that looks likeit might be an address pointing to the memory image of an Executive subsystem ora device driver. When KeBugCheck finds one, it prints the parameter, the baseaddress of the module the parameter is in, and the name of the module. This lastpiece of information is crucial, and I'll describe how you can use it a littlelater.
The system information area of the screen is below the Stop Code area, andit simply identifies the system's processor type (e.g., Pentium, x486) and NT'sbase build number (no Service Pack information appears). In Screen 1, the BuildNumber is 0xf0000565 (1381 in decimal), which is what you'll see for any NT 4.0installation. An IRQL number also appears in this area, but a bug in KeBugCheckcauses it to record the IRQL incorrectly.
Below the system information on the blue screen is the loaded driver area.Here you'll see a listing of all the registered device drivers at the time ofthe stop. KeBugCheck prints the name, base memory address, and date-stamp (thetime a driver was built). Unless you develop device drivers, this information isuseless.
Finally, just below the loaded driver area is a snapshot of the systemstack at the time of the call to KeBugCheck. Each module (except the first one)in the list had invoked the module printed on the line above it and was waitingfor a result. The system detected a problem while the module on the first linewas executing, and often this module matches the module shown in the Stop Codearea (Ntfs.SYS in Screen 1).
Interpreting the Blue Screen Information
So, what do you do with the data the blue screen provides? Many times, allyou can do is reset the system and hope that the blue screen doesn't happenagain. But sometimes an important clue is lurking in the Stop Code area or stacktrace that can help you take a more proactive approach to ridding the system ofthe blue screen.
First, the Stop Code can provide all the information you need to identifythe problem. The sidebar, "Common Stop Codes," page 62, lists severalStop Codes, their causes, and some suggestions about what to do if you encounterone. Microsoft Windows NT Workstation Resource Kit contains moreinformation about Stop Codes.
Often, you begin seeing blue screens after you install a new softwareproduct or piece of hardware. If you've just added a driver, rebooted, and got ablue screen early in system initialization, you can reset the machine and pressthe space bar when instructed, to get the Last Known Good configuration.Enabling Last Known Good causes NT to revert to a copy of the Registry'sdevice driver registration key (HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServices) from the last successful boot(before you installed the driver).
If you keep getting blue screens, an obvious approach is to uninstall thethings you added just before the appearance of the first blue screen. If sometime has passed since you added something new or you added several things atabout the same time, you need to note the names of the modules you see in boththe Stop Code and stack trace areas. Note that ntoskrnl.exe refers to the imagethat contains all NT's core kernel-mode subsystems as well as the Microkernel.
If you recognize any of the module names as being related to something youjust added (such as scsiport.sys if you put on a new drive), you've possiblyfound your culprit. Many device drivers have cryptic names, so one thing you cando to figure out which application or hardware device is associated with a nameis to run the Regedit Registry viewing tool the next time you boot the system oron a similarly equipped machine. Search for the name of the driver under theHKEY_LOCAL_MACHINESYSTEMCurrentControlSetServices key. Thisbranch of the Registry is where NT stores registration information for everydevice driver in the system. If you find a match, look for a value calledDisplayName. Some drivers fill in this value with a name descriptive of thedevice driver's purpose. For example, you might find Virus Scanner, which canimplicate the antivirus software you have running.
You can also search Microsoft's online Knowledge Base (http://www.microsoft.com) for the Stop Code and the name of the suspect hardware orapplication. You might find information about a workaround, an update, or aService Pack that fixes the problem you're having.
Setting the Blue Screen Options
Instead of just halting the system with a blue screen, you can have NT logan event to the system log, send you an administrative alert, write a dump ofthe machine's physical memory to disk, or automatically reboot the computer. Youcan configure these options on the Startup/Shutdown tab of the System applet inControl Panel, as shown in Screen 3, page 64.
If you want to track how often a computer runs into problems, select theoption to record the event in the system log, which you can view with the EventViewer administrative tool. In general, you won't want the machine's memorywritten to disk unless you have a chronic problem that a particular hardwarevendor or Microsoft will help you debug. In this case, be prepared to copy afile as large as the computer's memory (i.e., 128MB for a 128MB machine) to sendfor debugging. Contact the hardware vendor or Microsoft for instructions aboutwhere and on what medium to send the dump.
Finally, automatic rebooting is an option you want to enable if yourmachine is performing a task for which you want to minimize downtime. If youhave a Web server that configures itself automatically when NT starts, automaticrebooting after a stop will keep your site offline for as little time aspossible.
At Wits End
Unfortunately, you can't run a magical program to identify the exact causeof blue screens or make them go away. Even with extensive knowledge of NTinternals and device drivers, you'll still find that reading a blue screen andtrying to figure out what happened is a little like fumbling around in a darkroom. However, the next time you're unpleasantly surprised with a blue display,you might find some solace knowing what's going on behind the scenes--that asubsystem or driver made a call to KeBugCheck to provide the information in thedifferent areas of the screen.
About the Author
You May Also Like