Inside Memory Management, Part 2
Look closely at the internal data structures NT uses to keep track of the state of memory.
August 31, 1998
Learn how NT keeps tabs on memory
Last month, I began this two-part series about memory management in Windows NT by introducing the concept of virtual memory. I reviewed the way x86 and Alpha processors use a two-level virtual address-to-physical address translation scheme. I discussed paging and introduced two powerful features of the Memory Manager: memory-mapped files and copy-on-write memory.
This month, I'll go into more detail about the internal data structures the Memory Manager uses to track the state of memory. I'll discuss working sets and the Page Frame Number (PFN) Database. I'll wrap up with a tour inside the additional data structures the Memory Manager uses to track memory shared by two or more applications, and I'll discuss Section Objects, the data structures the PFN Database uses to implement memory-mapped files.
Working Sets
The biggest effect the Memory Manager has on individual applications'performance and on the system is in its allocation of physical memory to each active process. The amount of memory the Memory Manager assigns to a process is called the working set. Every process has a working set. A special working set, called the system working set, is physical memory that belongs to parts of the NT Executive, device drivers, and the Cache Manager. If a process' working set is too small, the process will incur a high number of page faults as it accesses page table entries that describe data not present in memory but located either in the process' executable image on disk or in a paging file. For each such access, the Memory Manager must intervene and perform disk I/O to retrieve the data. If a process' working set is too large, the process will not incur page faults, but its physical memory might be holding data that the process will not access for some time, and that other processes might require. Thus, the Memory Manager must try to achieve a balance for all processes according to their memory usage patterns.
The choice NT makes for two memory-management policies--the page fetch policy and the page replacement policy--affects the way NT manages working sets. The page fetch policy is the method NT uses to bring in a process' data. NT uses the most common page fetch policy, demand paging. In demand paging, a process' data is paged-in as a process accesses it. A different page fetch policy, prefetching, requires the Memory Manager to aggressively bring in a process' data before the process asks for it. Prefetching can improve the performance of an application at the expense of other applications, because the Memory Manager's prediction of what data a process will ask for is likely to be imperfect and can waste physical memory in storing unused data. NT therefore implements a slight variation on demand paging: clustered demand paging. Instead of bringing into a process' working set only the page that a process accesses, the Memory Manager will attempt to bring in pages that surround the requested page. The idea behind clustered demand paging is that if a process accesses a page, it will likely also access the memory just preceding or following that page. The pages the Memory Manager thus brings in are known as a cluster. In NT, clusters can vary between 0 pages and 7 pages, depending on the amount of physical memory present on the system and whether the system is accessing code or data.
The page replacement policy affects the way the Memory Manager decideswhich page to remove from a working set. The Memory Manager must remove pageswhenever physical memory is full of in-use pages and space is necessary toaccommodate a page of data a process is accessing. Two characterizations forreplacement policies are global and local. In a globalreplacement policy, the Memory Manager considers all pages of physical memory asreplacement candidates, without regard to which working set the pages belong to.In a local replacement policy, the Memory Manager considers as replacementcandidates only pages in the working set of the process that is accessing thepage to be brought in. The disadvantage of global policy is that a rogue processcan, by accessing large amounts of memory very quickly, adversely affect allother processes in the system by forcing their data out of memory. As with itspage fetch policy, NT implements a variation of the replacement policy. Thisvariation is local, because it first considers the pages in the working set ofthe process that is accessing the data. But the variation is also global in thatit removes pages from the working sets of other processes if their memoryrequirements are low.
Replacement policies are further characterized by the method with whichthey choose a page to replace out of the set of local or global candidates. Onepolicy, first in, first out (FIFO), removes pages in the order in whichthey were added to the working set (i.e., the first page added is the first pageremoved). An algorithm that has a more positive effect on the performance ofapplications is least recently used (LRU). The LRU algorithm requiresthe aid of the MMU to give the Memory Manager an idea of how recently a processaccessed a particular page. LRU replaces first those pages that processes havenot accessed for the longest period of time. On uniprocessors, NT uses the clockalgorithm, a simple form of LRU replacement that is based on the limited accesstracking x86 and Alpha MMUs provide. Whenever a process references a page, theseMMUs set the Accessed flag bit in the page table entries (PTEs) of the page. Ifa page's Accessed flag is not set, the Memory Manager knows that no processeshave accessed the page since the last time the Memory Manager examined thepage's PTE.
The working set of every process, including the system working set, has aminimum size, a current size, and a maximum size. On a typical NT system with64MB of physical memory, the default minimum size of a working set is 200KB, andthe default maximum size is 1.4MB. Processes with the PROCESS_SET_QUOTAprivilege can override these values using a Win32 API. When a process starts, itgenerates a large number of page faults as it references data in its virtualmemory map (most often in its executable image) that it must bring into physicalmemory. The Memory Manager lets a process' current working set size grow asneeded to its working set maximum. When a working set reaches its maximum size,the Memory Manager will allow additional pages into the working set only ifplenty of free (unused) pages are available. If free pages are not available,the Memory Manager uses the replacement policies I just described to determinewhich page in the working set to replace. The pages in the working set arelinked in a list that the Memory Manager scans. On a uniprocessor, if the MemoryManager finds a page with its Accessed flag set, the Memory Manager clears theflag and proceeds to the following pages, selecting for replacement the nextpage it finds with a cleared Accessed flag. Thus, the Memory Manager avoidspages that the process has accessed since the previous scan, because the MemoryManager assumes that the process will access those pages again. If the MemoryManager finds no pages to select for replacement, it restarts the scan andalmost certainly finds a page whose Accessed flag it cleared on the previouspass and the process has not accessed since the first scan.
On a multiprocessor system, the Memory Manager does not clear Accessedflags during scans, because any modification to a PTE on a multiprocessor systeminvalidates Translation Look-aside Buffer (TLB) entries (which I described lastmonth) on all processors. If the Memory Manager invalidates an address' cachedtranslation by clearing its Accessed flag, the address must again undergo theslow three-step virtual memory translation process the next time a processreferences it. The result is an effectively random page-replacement algorithm onmultiprocessors. Microsoft is developing more effective replacement algorithmsfor the multiprocessor version of NT 5.0.
The Balance Set Manager
The Memory Manager adjusts the sizes of working sets once every second, in response to page-in operations or when free memory drops below a certainthreshold. The Memory Manager also examines all working sets to determinewhether the Balance Set Manager thread needs to trim them. If free memory isplentiful, the Balance Set Manager removes pages from only the working sets of processes whose current size is above minimum that have not recently incurred many page faults. However, if the Memory Manager awakens the Balance Set Manager thread because free memory is scarce, the Balance Set Manager can trim pages from any process until it creates an adequate number of free pages--even to the point of dropping working sets below their minimum size.
In addition to trimming working sets, once every 4 seconds the Balance SetManager can swap out pages that belong to the call stacks of threads that havebeen sleeping for more than 7 seconds (3 seconds on systems with less than 19MBof memory). The call stack is a special type of memory that records the functioninvocations a thread makes. If a thread has been asleep (e.g., while waiting fora user to enter a keystroke) for a long time, the Memory Manager assumes thethread will sleep for a longer time. If the Balance Set Manager swaps out thestacks of all the threads in a process, it trims the process' working set tonothing, and the system assumes that the process is inactive. Swapping out thestacks of all the threads in a process minimizes the footprints of processesthat don't require memory because they are waiting for infrequent events tooccur before they become active.
The Page Frame Database
Last month, I described how NT organizes physical memory as a list of pageframes, each of which is one page in size (4096 bytes on x86 processors, 8192 bytes on Alphas). The Memory Manager mirrors this list of page frames with its own list--the PFN Database. Each entry in the PFN Database corresponds to a physical page of memory (e.g., entry 5 in the PFN Database corresponds to page frame 5 in physical memory). Entries in the PFN Database record information about the state of data stored in corresponding physical pages.
A physical page can exist in one of eight states, which Table 1, page 64, describes. Figure 1 illustrates how the Memory Manager adds to a list all thePFN Database entries that belong to pages in the same state (except pages in thevalid or transition states). For example, all modified pages go on the modifiedpage list. A given page undergoes several status changes during its lifetime;therefore, its PFN entry moves among different lists. Figure 1 shows how workingsets are associated with valid pages in the PFN Database.
Let's look at some of these state changes, beginning with a valid page.Figure 2 depicts the state changes I'll discuss. A page is in the valid state ifit is currently part of a process working set. A valid page can be dirtyor clean--a dirty page is one a process has written to, so that thepage's content in physical memory might be different from its content in abacking file (either a paging file or a memory-mapped file). The MMU sets a bitin a page's PTE when a process writes to that page.
When the Balance Set Manager trims a valid page from a working set, thepage moves to one of three lists. If the page is clean, it will move to thestandby list. If the page is dirty, it almost always moves to the modified pagelist. In some cases, as when a file system accesses pages backed by files thefile system creates to represent its metadata (such as the FAT on FAT filesystems, or the MFT--Master File Table--on NTFS file systems), a dirty pagemoves to the modified no-write page list instead. NTFS takes advantage of thistype of marking so that it can ensure that pages belonging to NTFS logging fileswrite to disk before file metadata writes to disk.
A page on the modified page list moves to the standby page list after theMemory Manager's modified page writer writes the page's data to disk. However,while the modified page writer is writing its data to disk, the page makes astop in the transition state (which has no list). A page on the modifiedno-write page list moves to the standby page list when the file system instructsthe Memory Manager to move the page. This instruction comes after a log file is safely written to disk, for example. When the number of pages on the modified page list exceeds a threshold, or when free memory drops below athreshold (based on the amount of physical memory on the system and determined during the boot), one of two modified page writer threads wakes up to perform disk I/O that sends the data to a paging file or a data file.
A page on the standby page list moves to the free page list when theprocess that had the page in its working set either frees the virtual memoryassociated with the page, or the process terminates (which is another way offreeing the virtual memory). An interesting optimization based on the lists I'vealready discussed is possible. If a page moves from the working set of a processto the standby page, modified page, or modified no-write page lists, and if thatprocess subsequently accesses the page before the Memory Manager assigns thepage to a different process, the Memory Manager can place the page back in theprocess' working set. This operation is known as soft-faulting (or soft-pagefaulting) memory back to a process. Soft-faulting is a way the MemoryManager can tentatively remove memory from a process but give it back quickly ifthe process reaccesses the memory in a short amount of time.
Pages on the standby page list move to the zeroed page list after a specialthread, called the zero-page thread, clears their content. The zero-page threadexecutes in the background at priority 0. It runs only if no other thread canrun, and its job is to move pages from the free page list to the zeroed pagelist as it clears their content.
When a process accesses a page and the Memory Manager decides that theprocess' working set can expand, the Memory Manager checks the zeroed page listto find a physical page to assign to the process. If the Memory Manager finds apage in the zeroed page list, it can immediately use the page. If the zeroedpage list is empty, the Memory Manager checks the free page list. If the freepage list is also empty, the Memory Manager checks the standby page list.Finally, if the standby page list is empty, the Memory Manager's thread waitswhile the Balance Set Manager trims a page from a working set, or until a pageshows up on the standby page, free page, or zeroed page lists (e.g., a pagemight be in transition, and after it writes to disk it appears on the standbylist).
The necessity of zeroing a page before assigning it to the working set of adifferent process is a C2 security requirement. (For more information about NT'sC2 security rating, see "Windows NT Security, Part 1," May 1998.) Theoperating system (OS) must reinitialize all OS resources (e.g., memory, diskspace, objects) before reassigning them to prevent the creation of securityholes in which one process can see another process' potentially sensitive data.In some cases, the process does not require zero-filled memory, as when theMemory Manager allocates a page to store data read from a memory-mapped file. Inthese cases, the Memory Manager checks the free list for an available pagebefore it checks the zeroed list.
The final list is the bad page list. As its name implies, the bad page listis an off-limits holding area in which the Memory Manager, with the support ofmemory parity error detection, places pages it has detected as faulty.
The number of pages that are on the standby page, free page, and zeroedpage lists defines the free memory that various memory-tracking tools (e.g., theTask Manager) report. The commit total is the amount of currentlyallocated memory that paging file space backs. If the Memory Manager pages thedata in commit total memory out of physical memory, the Memory Manager storesthat data in a paging file. The commit limit is the amount of committotal memory the Memory Manager can allocate without expanding the sizes ofexisting paging files.
Shared Memory and Prototype PTEs
PFN Database entries contain varying information, depending on which statecorresponding pages are in. In most cases, a PFN Database entry contains apointer to a PTE that references a page. However, if two or more processes sharethe same page, multiple PTEs reference the page: one PTE in the virtual addressmap of each process sharing the page. Instead of pointing the PFN Database entryat one of these PTEs, the Memory Manager points the PFN Database entry at a datastructure the Memory Manager allocates, called a Prototype PTE (PPTE), asFigure 3 shows. I'll describe the way the Memory Manager uses PPTEs to manage shared and mapped memory.
I explained last month that to share memory, a process must create aSection Object. Section Objects hold information about the name of a file, its size, and what portions of it are mapped. Section Object creation defines the shareable data, and each additional process that wants to participate in the sharing must map a portion of the data into its address space. This mapping is known as mapping a view of a section, because processes might map only a portion of the data that a Section Object defines.
When a process allocates a Section Object, the Memory Manager allocatesanother data structure called a Segment. Segments contain storage to hold enough PPTEs to describe all the pages in the Section Object. Usually, when a page moves from a process' working set to the standby page, modified page, or modified no-write page lists, the Memory Manager marks the page's PTE invalid and sets a bit in the page to indicate that the page can be soft-page faulted, which means the PFN of the page stays in the PTE. PTEs that the Memory Manager marks as invalid for shared pages do not continue to store PFNs; rather, the Memory Manager updates them to point at the shared page's PPTE. This trick makes it easy for the Memory Manager to update the PFN of a shared page withoutmanually updating the PTEs that refer to the page in the address spaces of allthe processes that reference the page.
Consider an example in which two processes share a page. When the page'sdata is in memory, each process has a valid PTE that stores the PFN where thepage's data resides in physical memory. If the Memory Manager removes the PTEsfrom the working sets of both processes and sends the page's data to a pagingfile, the PTEs are both invalid and contain pointers to the PPTE. When theMemory Manager brings the page's data back into physical memory, the MemoryManager updates the PPTE to reflect the page's new PFN. When one of theprocesses tries to access the page, it generates a page fault. Then the MemoryManager looks at the PTE, finds the new PFN in the PPTE the PTE points to, marksthe PTE as valid, and updates its PFN. When the second process accesses thepage, the Memory Manager updates that process' PTE similarly. Without thisoptimization, the Memory Manager needs to track down both PTEs (or more, if agreater number of processes shared the page) and update them when it brings thepage back into memory--an expensive and potentially wasteful operation.
A reference count in the PFN Database tracks the number of processes thataccess a shared page, and the Memory Manager knows that when the count becomeszero, the page is not marked valid in any working set. At that point, the MemoryManager moves the page to the standby page or modified page lists.
Memory-Mapped Files
Memory-mapped files are a special form of shared memory. When a process maps a file into its address space, the Memory Manager creates several support data structures that aid the process' interaction with file systems. Figure 4 shows how these data structures are related. A File Object represents the file on disk, and the File Object is the target of all I/O the Memory Manager performs on the file. The Section Object points at its Segment, which contains the PPTEs for the Section Object. The Segment points to a control area that the File Object also points at. This control area is the nerve center for a mapped file, so that even if a process creates more than one Section Object for a file, there is still only one shared con-trol area.
When a process references a virtual address that the process indicatesshould be backed by a file, the Memory Manager examines the Virtual AddressDescriptor (VAD) that describes the memory range, then locates the control area.The Memory Manager allocates a page of physical memory to bring the requesteddata into and updates the PTE for the virtual address map. Because the MemoryManager finds the File Object through the control area, the Memory Manager caninitiate an I/O operation to the file system that owns the disk on which thefile resides. The operation reads the page from the file on disk (the page is inthe transition state while the I/O is in progress). After the operation readsthe page, the Memory Manager marks the PTE as valid and lets the processcontinue its attempt to access the data, which is now present.
A caveat with regard to memory-mapped files exists: Files can map as datafiles or as images. Files map as images when the NT Process Manager loads themfor execution. The same file can map as a data file and an image, andmaintaining separate control areas for data files and images lets NT ensure the consistency of the different mappings that different processes make.
More on Memory
Last month, I claimed that memory management is one of the most complextasks an OS faces. In this two-part series on memory management, I've provided only an overview of the policies and mechanisms NT implements to provide applications with memory resources appropriate to their needs and the needs of other concurrently running programs. If you want to learn more about the Memory Manager, I recommend Inside Windows NT, Second Edition, by David A. Solomon (Microsoft Press).
Next month, I'll cover a subsystem that's closely tied to the MemoryManager--the Cache Manager. I'll discuss how the Memory Manager sizes systemworking sets (including the file system cache) differently from process working sets.
About the Author
You May Also Like