Validate File Extensions

Orphaned file-extension mappings might point to viruses, malware, or other file-deletion problems

Dave Roth

March 20, 2007

14 Min Read
ITPro Today logo in a gray background | ITPro Today


Program files frequently vanish from Windows machines. Although essential to a program or tool, .exe, .dll, and other files commonly disappear. Users installing and uninstalling applications are often the culprit. They might upgrade an application by installing a new version on top of their existing version or try to remove the existing version by simply deleting its files. Viruses and their ilk can also delete files—as can hard-drive corruption and even system-cleanup utilities.

Whatever the cause, when a file disappears, its file-extension mapping is often broken. A file-extension mapping tells Windows what program to run when you double-click a file that has a particular file extension. Double-clicking a .doc file launches Microsoft Word, for example, and double-clicking a .pdf file runs Adobe Acrobat. Few things drive users crazier than trying to open a file whose file extension has become unmapped.

One way to gauge a machine’s health is to monitor for broken file-extension mappings. Ideally, all file extensions are mapped to some application, and if the application is missing, the file extension is unmapped. This month, I show you a Perl script that examines file mappings to determine which are broken, or orphaned. With this information, you will know which software you should reinstall or properly remove to fix orphaned file-extension mappings.

Understanding File-Extension Mapping
Windows stores file-extension maps in the registry under the SoftwareClasses key; you’ll find user-specific file-extension maps in the HKEY_CURRENT_USER hive and systemwide file-extension maps in the HKEY_LOCAL_MACHINE hive. Windows conveniently aggregates the information from both hives and exposes it under the virtual HKEY_CLASSES_ROOT hive.

Every file extension has a registry key identified by its extension name, and this key contains a default value that describes its file type. For example, a .pdf file has a registry key of HKEY_CLASSES_ROOT.pdf, and the .pdf extension is mapped to the AcroExch.Document file type. You can validate this extension mapping by opening a command-line window and running

assoc .pdf

The assoc command tells you what file type is associated with a given file extension.

When you know the file type associated with a given file extension, you can look up the type in the registry. To continue with our .pdf example, the registry will contain a key called HKEY_CLASSES_ROOTAcroExch.Document, which contains all the information Windows needs to manage this file type. However, the huge amount of possible information in this key can quickly overwhelm you.

The file-type registry key (HKEY_CLASSES_ROOTAcroExch.Document) contains a default value that typically describes the file type. In our example, the value is Adobe Acrobat 7.0 Document, which is self-explanatory. The file-type registry key might also contain several other values and subkeys, including Shell, CLSID (for class ID), CurVer (for current version), and Protocol. Some subkeys (such as Shell) contain even more subkeys, such as OpenCommand. Each subkey has a special purpose, and discussing each one is beyond the scope of this article. However, the Perl script I built validates that the computer you’re analyzing has code that maps to the relevant file type.

Validating Subkeys
To validate that a file type is correctly mapped to an application, the script checks only the most common subkeys: Shell, CLSID, and DefaultIcon. The goal is to examine each of these keys to determine the path to a file, which the script then checks to make sure the file still exists on the computer. If the file’s there, you’re assured that the file mapping is intact. If the file path is missing from the computer, the file mapping is broken, or orphaned. Let’s quickly review the subkeys that the script validates.

The Shell key. The Shell key contains subkeys that describe actions the Windows shell (aka Windows Explorer) can perform on a file, such as Open, Print, Edit, and New. Each action has its own subkey under Shell and contains a subkey called Command. The Command subkey’s default value is the path to the program that runs when the Windows shell executes that action. For example, when you double-click a .pdf file, which maps to the AcroExch.Document file type, that file type has the key

HKEY_CLASSES_ROOTAcroExch.DocumentShellOpenCommand

That key, in turn, has a default value of

"C:Program FilesAdobeAcrobat 7.0ReaderAcroRd32.exe" "%1".

This value specifies that when you double-click a .pdf file, the shell runs the AcroRd32.exe application, passing in the path to the .pdf file, as the %1 value indicates.

You can run the FType command from the command line to display the value for the shell’s Open command for the specified file type. For example, to validate the Open command for the .pdf file type, you can run

Ftype AcroExch.Document

However, some files types might be mapped to applications only for printing, sending in email, or creating, but not for opening. And FType displays information only for the Open command—not for other shell commands such as Edit and Print. Thus, the script must look for any legitimate shell command that maps to a valid application on the computer.

Note that the Windows Explorer context menu for the file type shows the Shell key’s subkeys. To see this list, just select a file of the particular file type you’re interested in, then right-click.

The ShellEx key. Similar to the Shell key, ShellEx is an “extended” shell key that typically contains CLSIDs to COM objects. The script also checks this key.

The CLSID key. Not all file types have a CLSID key, which identifies the COM CLSID associated with a file type. A CLSID, also known as a globally unique identifier (GUID) or a universally unique ID (UUID), relates to a COM component. For example, a COM component manages Microsoft Excel’s charting capabilities, letting you embed Excel charts in other applications such as Word or Microsoft Outlook.

Windows determines which COM component to use for a particular task by looking in the file type’s registry key for a CLSID. The CLSID looks something like {B801CA65-A1FC-11D0-85AD-444553540000} and refers to another location in the registry: HKEY_CLASSES_ROOTCLSID\{B801CA65-A1FC-11D0-85AD-444553540000}. This location can contain another set of registry keys, each with a story that’s beyond the scope of this article. Suffice it to say that the script checks several of these keys for either file paths or other CLSIDs that it might need to recursively look up and try to resolve to a file path.

The DefaultIcon key. The DefaultIcon key specifies the path to a file that contains an icon image, which Windows Explorer uses to display a file type’s icon. Oddly, some file types serve only as placeholders for icons. For example, system drivers aren’t file types that users can open, print, or create. However, Windows still displays an icon for them in an Explorer window. Even though no applications or COM components map to the icon file type, Windows still needs the file extension for this file type.

Running the Script
You run the VerifyFileExtensions.pl script on the machine where you want to examine and validate file extensions. The script doesn’t validate file extensions on remote machines. Because the script scrutinizes file extensions for both the machine and the current user, it effectively analyzes all extensions to which that user has access. For another user to analyze his or her file-extension mappings, that user would need to log on and run the script. (Note that if you run a 32-bit version of Perl.exe on an x64 machine, the script will see only the 32-bit part of the registry and its output might be a bit misleading. See the sidebar “The 64-Bit Wildcard” for more information about this caveat.)

You run the script by simply calling it as follows:

Perl VerifyFileExtensions.pl

The script collects all the information it needs, then displays two lists. The first list contains all orphaned file extensions that don’t map to any existing file on the computer. You should consider either removing or reinstalling software related to these extensions. The second list shows all file extensions that don’t map to applications but that have valid icons mapped to them.

You can also run the script by passing in specific file extensions, such as

Perl VerifyFileExtensions.pl .txt .doc .pdf

With this approach, the script examines and displays results for only the extensions you specify.

Note that you should consider the script’s results advisory, using the output only as a guide to investigate file extensions that seem to be orphaned. You don’t have to take any action on the list of possible orphaned file extensions if you don’t want to. For example, you might have removed software whose uninstall function didn’t correctly clean up file extension mappings. Or the script might have flagged an extension as orphaned when it really isn’t. This anomaly is usually a result of a new or special way software or the OS manages specific file extensions.

Walking Through the Script
Listing 1 shows the VerifyFileExtensions.pl script, which uses the Win32::TieRegistry extension. Typically, I prefer to use Win32::Registry for manipulating the registry, but Win32::TieRegistry abstracts much of the registry API and exposes it as a simple tied hash, which makes querying it easy. The Win32::TieRegistry extension comes as part of ActivePerl, ActiveState’s version of Win32 Perl, available at http://www.activestate.com. However, if you use an older version of Perl (such as pre-5.5), you might need to download and install Win32::TieRegistry by using ActivePerl’s Perl Package Manager (PPM) utility as follows:

ppm –install win32-tieregistry

The code at callout A in Listing 1 loads Win32::TieRegistry and sets up important variables that the script uses later. Unfortunately, the Win32::TieRegistry module defaults to accessing the registry by requesting all possible permissions. For most non-administrator users, this behavior often results in read failures. The script addresses the problem by requesting only permissions the user is permitted to have. This code appears in a BEGIN block to ensure that it executes before any other code in the script runs:

BEGIN{   use Win32::TieRegistry( Delimiter=>"\", ArrayValues=>0 );   $Registry = $Registry->Open(', {Access => 0x2000000});}

The rest of the code at callout A sets up several variables worth noting. The @APP_COMMANDS and @APPEX_COMMANDS arrays are the list of shell and extended shell commands, respectively, that the script checks. A file type is valid if any of these commands have a default value associated with them. The $AppExtensions variable comes from the Windows PATHEXT environment variable, which defines which file extensions the OS considers executable (e.g., .exe, .com, .bat, .cmd). I modified the variable a little so that I can use it in regular expressions, which the script uses to determine whether a binary file path is a valid application. And the %COMPILED_REGEX hash precompiles regular expressions that the script uses. Because the script precompiles these regular expressions only once and not for each file extension it checks, it runs much faster.

Last, the code at callout A sets the flush current STDOUT buffer state ($|) to TRUE (1) so that any print command prints directly to the screen instead of waiting for a newline command. Later, the script displays the status of the script’s progress. Because this status display doesn’t use newline characters, there’s no automatic flushing of the display buffer, so the script sets $| to TRUE to force such flushing.

The code at callout B is the script’s main loop. It starts by either collecting the extensions the user passed in on the command line or gathering all file extensions listed in the registry by enumerating all keys under HKEY_CLASSES_ROOT ($Registry->{Classes}) that begin with a dot (.) character, such as .doc and .pdf.

The script generates this list of keys by using the grep command, as follows:

grep { "." eq substr( $_, 0, 1 ) && "\" eq substr( $_, -1 ) } keys( %{$Registry->{Classes}} );

Note that grep produces a list of extension names from all registry entries that not only begin with a dot but also have a backslash () as the last character. This point is important because Win32::Registry signifies key names by ending them with a backslash, letting the script determine whether it’s examining a registry key or a registry key’s value name.

For each file extension, the script calls the LookupExtension() subroutine, which discovers various information about the file extension. When the code has processed all file extensions, it displays the list of orphaned file extensions—that is, file extensions that didn’t map to a file type, whose file type is missing configuration information, or whose file paths aren’t valid. The code at callout B then prints the list of file extensions that appear to be orphaned but still have icons defined, which is important because sometimes file types map only to icons.

Exploring the LookupExtension() Subroutine
Callout C shows the LookupExtension() subroutine. It first tries to determine which file type is associated with a file extension (the $ApplicationName variable) by querying the default value ($Registry->{Classes}->{$Extension}->{“\”}) of the file extension’s registry key ($Registry->{Classes}->{$Extension}). It then creates the $Data hash reference, which contains information about the extension. The script uses this information later if it identifies the extension as orphaned.

Next, the subroutine removes any trailing backslashes that Win32::TieRegistry sometimes leaves attached to the file extension. If an application name exists, the code queries the full application name ($Registry->{Classes}->{“$ApplicationName\”}->{“\”}) and calls the GetAppPath() subroutine to investigate various file paths that relate the application. If the call to GetAppPath() fails, meaning the code couldn’t locate the file type’s application, the script calls GetIcon() to determine whether the file extension has a valid icon defined.

Callout D shows the GetAppPath() subroutine, which examines multiple registry keys to determine whether a file extension is orphaned. If the subroutine determines that the file extension is valid, it returns a TRUE value (1); otherwise, it returns FALSE (undef).

The subroutine first checks whether the file extension and file type’s application keys are valid. It then checks the file extension for specific values such as Generic and PerceivedType, which the OS uses for built-in file extensions such as .cpl, .drv, and .sys. The subroutine next checks for the application’s FriendlyTypeName value, which the system typically uses. If GetAppPath() finds such a value, it extracts the application path from the value by using a precompiled regular expression and validates the path.

GetAppPath also examines COM CLSIDs to see whether the file extension or file type is mapped to a COM class. If so, instead of causing a separate program to run, the file extension requests that the COM server process the file. The subroutine examines a variety of well known COM-related keys such as PersistentHandler and CLSID. Finally, the subroutine determines whether the file type’s Shell and ShellEx subkeys exist and, if so, checks each valid Shell and ShellEx command (@APP_COMMANDS and @APPEX_COMMANDS variables) to see whether a valid shell command file is defined. If any of these file paths are present, the script considers the extension valid and not an orphan.

Callout E shows the LocatePath() subroutine, which accepts a file path and tries to find the related file. Because the registry might contain file names that have no path information, Perl might not be able to use the standard –f test to determine whether the file exists. This subroutine constructs a valid file path by expanding any embedded environmental variables and using the Win32::GetLongPathName() function. The subroutine returns all paths it can construct; otherwise, it repeats the same process for each directory in the PATH environment variable, essentially walking the PATH for the existence of the specified file.

Callout F shows the GetPathFromCommand() subroutine, which the script uses to process Shell and ShellEx command lines. The specified path might be complicated by specified parameters, so this subroutine uses a precompiled regular expression and various other processing to determine the application path from such command lines.

Callout G shows the GetPathForClassID() subroutine, which examines a passed-in COM CLSID and fetches its file path. The code first checks for the CLSID key in the registry ($Registry->{Classes}->{“Clsid\”}->{“$ClassID\”}), then checks a variety of subkeys for a file path. In some cases, the result is another CLSID, for which GetPathForClassID() is recursively called.

Last, the GetIcon() subroutine, at callout H, looks for a DefaultIcon key. If the key exists and has a default value, the subroutine extracts the path by using a precompiled regular expression. The script needs to use a regular expression to extract the path because icon values consist of a string that represents the .dll or .exe file that contains the icon and an index number indicating which icon is in the file—for example, C:Program FilesMovie Makerwmm2res.dll,27. Alternatively, if the DefaultIcon value is %1, GetIcon() considers the value to be a valid icon path.

Fixing Orphaned File Extensions
Orphaned file extensions often point to other problems. They can indicate that a virus or malware is deleting files from your system or that users are incorrectly removing or reconfiguring applications. Orphans can also be a sign of random application corruption, letting you know to reinstall software packages. By occasionally running this script, you can determine what you need to reinstall or remove from a system. You can also use it to help explain to users why sometimes nothing happens when they repeatedly double-click a file, which can drive any user crazy.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like