Detect Directory Differences

Perl script makes comparing files easy

Dave Roth

April 9, 2006

9 Min Read
ITPro Today logo in a gray background | ITPro Today


I recently had to set up a new Windows Server 2003 machine. The setup process was straightforward until I installed Perl. You can call me old fashioned, but I prefer not to run a setup application or use an .msi file to install Perl. Instead, I choose to copy Perl from another machine. Essentially, I simply copied the Perl directory from one machine to another.

Several days later, I discovered a problem. The copy process was successful, but I had copied the Perl directory from the wrong machine. As a result, a couple of files were missing and a couple of others were slightly different. Usually, it wouldn't be too much of a problem to hunt down such discrepancies. However, the Perl directory consists of more than 6000 files. So, to ensure that I knew what files were different between machines, I needed a tool that could compare directories. I decided to write a Perl script to do the job. To write this script, I had to determine which algorithm to use, then implement that algorithm.

Determining the Algorithm
You can use different algorithms in a script to compare directories. For example, a script might use a comparison algorithm that takes into account the directories' tree structure, file count, filenames, file sizes, and file time-stamps. The ideal algorithm would also analyze each file's contents to verify that each file contains the same data. It's possible that two different files could have the same relative file paths and identical file sizes yet have different contents. For example, you could have the C:Dir1 readme.txt and C:Dir2read me.txt files. Both files have the same relative path (relative to C:Dir1 or C:Dir2), the same filename (readme.txt), and the same file size (21 bytes). However, one file might contain the text This is a readme file and the other file might contain the text I think I smell food!

I didn't need such an elaborate comparison between the Perl directories. I just needed a simple algorithm that would take into account the directories' partial tree structure, filenames, and file sizes. So, I decided to create an algorithm that compared the files' paths and byte sizes.

To create this algorithm, I determined that I could use a single hash in which each path is stored as a hash key. The hash key's associated value is a subhash whose key indicates the analyzed directory (1 or 2) and whose key's value specifies the file's size. The resulting hash might look like the one in Figure 1. In this sample hash, the file named FileNumber1.txt exists in both directories (1 and 2) and both files are the same size (1234 bytes). However, only directory 1 contains a file named FileNumber2.txt, which has a size of only 32 bytes.

With this algorithm, reporting the results would be simple. The script would simply need to walk through each key in the %FileList hash. If the %FileList hash key's value contains only one subhash key called {1} or {2}, the file exists in only one directory, so the script would print that file's path on screen. If both {1} and {2} subhash keys exist, the script would compare their values. If these values weren't identical, the files have different sizes, so the script would print each file's path and size on screen.

Implementing the Algorithm
The DirDiff.pl script demonstrates how I implemented the algorithm I decided to use. Listing 1 shows an excerpt from that script. (You can download the entire script from the Windows Scripting Solutions Web site. See page 1 for download information.) The code at callout A in Listing 1 declares some of the variables that the script uses. The use vars line declares the global variables %Config and $gFileCount. These variables need to be accessible from all different subroutines. Therefore, they aren't lexically declared with the my keyword.

The %FileList, %File, and %Size variables are declared at the beginning of the script because they are used later by the write command in the PrintReport() subroutine. For the write command to properly print the values in these variables, the variables have to be declared locally in that particular scope. However, because the script uses strict (as all good Perl scripts should), the variables must first be declared. Because you can't locally scope lexical scalar variables (i.e., those variables created using my), the script lexically declares these variables at the beginning of the script. Perl variables aren't typically declared this way. The only reason the script does it this way is because of the use of the write command.

Callout B highlights the main block of DirDiff.pl. This block of code calls the CollectFileList() subroutine for the directories being compared, then enumerates through the directories using a foreach loop. It would have been just as easy to hardcode two calls to CollectFileList() and pass in the two directories, but less fun to script. Finally, the block of code calls the PrintReport() subroutine.

In callout B, note the print statement that specifies the STDERR file handle. You'll find such print statements throughout DirDiff.pl. I included these statements for those users who want to redirect the script's output to a file instead of the screen. Because of these statements, all data printed to STDOUT (the default print file handle) will be redirected, but the STDERR output will continue to display on screen. That way, the script's progress information doesn't clutter the output redirected to the file.

The CollectFileList() subroutine, which callout C shows, creates a list of the files in a directory. The subroutine accepts four parameters. The first parameter ($Path) specifies the path to the directory to be examined.

The second parameter ($FileList) is a reference to the %FileList hash. I used a reference because this hash will be modified and I want these changes to persist across multiple calls to the subroutine. Alternatively, you could pass in the hash instead of a hash reference, then return the modified hash. However, the size of this hash will undoubtedly grow quite large for bigger directories. Passing such large hashes in and out of subroutines impacts the script's performance in terms of both memory usage and speed.

The last two parameters specify the directory being examined ($Context) and the relative path to that directory ($RelativePath). Using the relative path rather than the full path is important because the script must examine file paths relative to these two directories. Although the full paths will never match each other, the relative paths should.

When the script can open the directory specified by $Path, each object in that directory is enumerated. The script constructs a string that ensures the object's file path won't wrap on screen and assigns that string to the $Pretty-Path variable. The script prevents wrapped file paths by calling the AbbreviatePath() subroutine, which replaces enough characters in the string with an ellipsis (...) to ensure that the string is short enough to fit on a printed line. The $PrettyPath string is then printed to STDOUT to indicate the script's progress. Knowing the progress is important when you use the script to analyze large numbers of files.

Next, DirDiff.pl checks to see whether each object is a directory. If an object is a directory, the script adds it to the @DirList array for later processing. If an object is a file, the script retrieves the file's size (unless the user includes the -l option, one of the command-line options that I'll discuss shortly) and stores the file information in the %FileList hash. If the user wants directory recursion (which the user specifies with the -s command-line option), the script enumerates each directory in the @DirList array and recursively calls Collect-FileList() for it.

Although not evident by its name, the PrintReport() subroutine does more than just display the script's results. This subroutine also performs the directory comparison. As callout D shows, this subroutine starts by resetting $gFileCount so that it can be used to determine the difference in the number of files in the directories. After writing the report's header, the subroutine sets the format for the file data. Then, the subroutine prepares for the directory comparison by enumerating the full list of file paths in the %FileList hash. For each path, the subroutine determines whether the file was present in directory 1 or directory 2 and retrieves the file's size.

The code at callout E is where the PrintReport() subroutine compares the files in the two directories. The subroutine first checks to see whether each enumerated file is present in both directories. When a file is in only one directory, the subroutine reports this discrepancy. When a file is in both directories, the subroutine checks to see whether the file in directory 1 is same size as the file in directory 2. If there's a discrepancy (i.e., the size differs), the %File hash entries' strings are modified to include the file size. Finally, the subroutine uses the write command to print the discrepancies on screen.

Using the Script
To run DirDiff.pl, you need to pass in the paths to the two directories you want to compare. For example, to display the differences between the C:Dir1 and C:Dir2 directories, you run the command

Perl DirDiff.pl C:Dir1 C:Dir2 -s 

The analysis will include all subdirectories. If you don't want recursion into the subdirectories, simply omit the -s option.

Because enumerating thousands of files and retrieving their file sizes can take a considerable amount of time (especially if any of the directories are on a remote share on a network), it's useful to disable the file-size analysis by including the -l option. For example, to compare only the names of the files on \server_ashare_1 and \server _bshare_1, you run the command

Perl DirDiff.pl  \server_ashare_1  \server_bshare_1 -s -l 

(Although this command appears on two lines here, you would enter it on one line in the command-shell window.) I wrote and tested DirDiff.pl on machines running Windows Server 2003, Windows XP, and XP 64-Bit Edition.

A Simple But Valuable Tool
Since I wrote DirDiff.pl, I've found that I use it much more than I previously suspected. Although the logic of the script is rather simple, its value is quite high.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like