Divide and Conquer Mega-Sized Text and Log Files
Have split files rather than a splitting headache
February 22, 2009
Several times a year, various department heads give me text files and ask me to perform data analyses and create summary reports. Often these files are massive directory dumps or application data dumps that can be as large as 700MB and contain more than 9.5 million lines of text. I'm also occasionally asked to extract contents from extremely large text-based cluster logs, Web logs, and event logs for technicians who need to send log samples to our security department or to analysts to diagnose problems. In addition, there are times when I need to look at the data in an enormous text file so that I know how to work with its contents in a script.
Sometimes these mega-sized text and log files are too large for Notepad to open. Other times, Notepad is sluggish when I try to scroll through their contents. Having smaller files not only makes it easier to do data analyses but also dramatically speeds up code development and testing.
After many months of working around this problem using a mixed bag tactics such as exporting data to Microsoft Access or trying to open the files with some other application, I finally decided to write the Log Splitter utility. This HTML Application (HTA) splits large text files into smaller files that I can easily open with Notepad and easily work with when writing a script.
The Log Splitter utility offers simple but adequate functionality. After you download the utility (click the Download the Code Here button at the top of the page) and copy it to your computer, double-click it. In the UI (see Figure 1), enter the pathname of the text or log file you want to split or use the Browse button to locate it.
You can split up the file by the number of lines or number of pages. To find out how many lines are in the file, click the GetLineCount button. Knowing the total number of lines can help you decide whether to split it by line count or number of files.
If you want to split the large file into a specific number of smaller files, select the Split into number of files option, then specify that number in the Enter Line Count or Number of Files field. If you want to split the large file into smaller files that contain a certain number of lines, select the Split by Line count option, then specify the maximum number of lines you want in the smaller files. You must enter a value of 100,000 or higher. I found that lower values tend to produce too many files, particularly if you're splitting a file that's several hundred megabytes.
All that's left to do is to click the RunScript button and click OK. Before the utility starts splitting the large file, it checks for possible problems. If it finds a problem, it displays a message. For example, the utility checks to see whether the specified file is a text or log file. If you try to split another type of file, you'll receive the message This script only works with '.log' and '.txt' files.
After splitting the large file, the utility saves and names each smaller file. For example, if you're splitting C:datamassive.log into three smaller files, the smaller files will be named C:datamassive~1.log, C:datamassive~2.log, and C:datamassive~3.log. If these smaller files already exist, they'll be overwritten.
Depending on the size of the file you're splitting, the process could take a long time to finish (e.g., about five minutes to split a 100MB file into five files), so the utility's UI is hidden while the process runs. The UI reappears when the process completes.
If you get the following Microsoft Internet Explorer (IE) message when running the Log Splitter utility—A script on this page is causing Internet Explorer to run slowly. If it continues to run, your computer may become unresponsive. Do you want to abort the script?—abort and see the Microsoft article "How to set time-out period for script." This article tells you how to add a new registry entry named MaxScriptStatements to alleviate the problem. It's a relatively simple modification but as with all registry changes, you need to use extreme caution. After I received this message, I set the MaxScriptStatements value to 100000000 (100 million). That value works well for me, but you could try a smaller value and see how it works on your computer.
About the Author
You May Also Like