How To Configure PowerShell to Process Data in Batches (Demo Script)
Batch processing in PowerShell is an effective technique for handling large datasets. Discover how batch processing works and how to incorporate it into your scripts.
May 10, 2024
One aspect of PowerShell that I have always liked is its flexibility. It’s possible to create a PowerShell object containing data from nearly any source. Even so, despite this versatility, PowerShell can behave erratically when an object becomes overly large.
When you ingest excessive amounts of data into PowerShell, you can rapidly deplete the available memory. This can, at best, lead to diminished performance and, at worst, cause scripts that normally run smoothly to produce a variety of errors.
There are many ways to prevent a PowerShell object from becoming too large. A method that I often use is known as batch processing.
Batch processing involves reading data in small, manageable batches rather than attempting to read the entire dataset all at once. Each batch is processed independently before moving on to the next batch of data. Eventually, all data within the dataset is processed, but only a fraction of the total dataset is loaded into an object at any given time.
How To Set up Batch Processing
The approach to setting up batch processing can vary depending on the data source. For example, when dealing with SQL Server data, you might creatively structure your SQL Select Statements so that only a small portion of data is accessed at a time.
In most cases when I find myself working with a large dataset in PowerShell, the data is being ingested from a CSV file. As such, I would like to demonstrate how to set up batch processing for CSV data.
Here is the demonstration script that I have created:
# Specify the path to your CSV file
$CSVFilePath = "C:\scripts\uszips.csv"
$DataRows = (Get-Content -Path $CSVFilePath | Measure-Object -Line).Lines
# Define the batch size
$BatchSize = 500
# Initialize a counter to track processed rows
$RowCounter = 0
# Calculate the total number of batches
$TotalBatches = [Math]::Ceiling($DataRows / $BatchSize)
# Loop through the rows in batches
for ($BatchIndex = 1; $BatchIndex -le $TotalBatches; $BatchIndex++) {
# Calculate the range of rows for the current batch
$StartRow = $RowCounter + 1
$EndRow = [Math]::Min($RowCounter + $BatchSize, $DataRows)
# Read the next batch of rows from the CSV file
$Batch = Import-Csv -Path $CSVFilePath | Select-Object -Index ($StartRow..$EndRow)
$Batch.City
# Process the batch
Write-Host "Processed batch $($BatchIndex) of $TotalBatches (Rows $StartRow to $EndRow)" -ForegroundColor Green
# Increment the row counter
$RowCounter = $EndRow
}
How the Script Works
Now that I have provided you with the source code, let’s break down how this script works.
The first line of code sets up a variable named $CSVFilePath, pointing to a CSV file that will act as the data source. Since I needed a large CSV file, I used the zip code database that I had originally downloaded for my series on building PowerShell applications. The CSV file contains tens of thousands of rows of data, which makes it perfect for demonstrating batch processing.
Next, I created a variable named $DataRows to store the total number of lines of data that exist within the CSV file. Notice that I am allowing PowerShell to determine the number of rows of data rather than just hardcoding a number.
From there, I established a variable named $BatchSize to define the number of rows within each batch. For this demonstration, I set the batch size to 500 rows of data.
Additionally, I created a $RowCounter variable that keeps track of the next row that needs to be read. The variable named $TotalBatches determines the total number of batches that will need to be processed based on the data rows and batch size.
The script then uses a For loop. The For loop is based on a value named $BatchIndex. The Batch Index is initially set to 1 and is incremented by 1 with each loop cycle. The loop continues so long as the batch index is less than the number of total batches. To put it another way, the loop progresses once for each batch.
The first thing that happens inside of the loop is that I determine the starting and ending rows. The starting row is simply the $RowCounter variable+1. Since the Row Counter is initially set to 0, the first iteration of the loop sets the start row to 1. The ending row is equal to the starting row plus the batch size or the total number of rows, whichever is smaller. For example, in the first iteration, the starting row is set to 1 and the batch size is 500. Hence, the ending row is either 501 or the total number of rows in the file, whichever is smaller. Since the file contains tens of thousands of rows, 501 is smaller, so the end row is set to 501 for this iteration of the loop.
The next line of code creates a variable named $Batch. This variable contains the contents of the CSV file from the starting line to the ending line. Note that these lines change with each loop iteration so that all the lines of data are eventually processed.
The $Batch.City line exists solely to demonstrate data processing. Since the CSV file is a collection of postal data, the City field indicates the associated city for each zip code. Here, I am simply displaying the city name. If I were doing any actual data processing, this is where the code for that processing would need to be placed.
Finally, the script displays that the current batch has been processed. The $RowCounter is set to be equal to the current ending row, and the loop proceeds to its next iteration. The cycle continues until all rows are processed.
The figure below shows what happens when the script is run.
Figure 1. This is what happens when the batch processing script is run.
About the Author
You May Also Like