BINARY_CHECKSUM Query Better for Big Table Samplings

Selecting a random sampling of rows from a large table in SQL Server 2000 used to take a long time and use a lot of disk I/O. The BINARY_CHECKSUM query changes all that.

Readers

May 22, 2007

5 Min Read
ITPro Today logo

If you use SQL Server 2000, you've likely run into this problem: You want to select a random sampling of rows from a large table with lots of rows, but you're unsure of how to do so. Having a random sampling of rows can be useful when you want to make a smaller version of the table or if you want to troubleshoot a problem by seeing what kinds of rows are in the table.

To get a random sampling, you might betempted to select the top n rows from thetable. However, this sample isn't random andthe first n rows aren't necessarily representative of the whole table. Other solutions existthat involve adding columns to the tables,but adding columns isn't always possible orpractical.

The standard way to grab random rows from a small table is to use a query such as

SELECT TOP 10 PERCENT *  FROM Table1  ORDER BY NEWID()

The key here is the NEWID function,which generates a globally unique identifier(GUID) in memory for each row. By definition, the GUID is unique and fairly random,so when you sort by that GUID with theORDER BY clause, you get a randomordering of the rows in the table. Taking thetop 10 percent (or whatever percentage youwant) will give you a random sampling ofthe rows in the table.

The NEWID query is often proposed when questions about how to select random rows are asked in discussion groups. It is simple and works very well for small tables. However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all the rows in the table to be copied into the tempdb database, where they're sorted. This causes two problems:

  1. The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.

  2. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that will never be reclaimed without a manual shrink command.

What you need is a way to randomlyselect rows that won't use tempdb and won'tget much slower as the table gets larger.Here's a new idea on how to do that:

SELECT * FROM Table1  WHERE (ABS(CAST(  (BINARY_CHECKSUM(*) *   RAND()) as int)) % 100) < 10

The basic idea behind this query is to generate a random number between 0 and 99for each row in the table, then choose allthose rows whose random number is lessthan the value of the specified percent. Inthis example, we want about 10 percent ofthe rows selected randomly, so we choose allrows whose random number is less than 10.

Let's take a closer look at how the(ABS(CAST((BINARY_CHECKSUM(*)* RAND()) as int)) portion of this queryworks. The BINARY_CHECKSUM function generates a checksum value basedon the values of the columns you specify.If two rows are different, they'll typicallygenerate different checksum numbers. TheBINARY_CHECKSUM function is generally used to verify whether any of thecolumns in a row in a table have changed.However, for our purposes, it generates anumber that looks like a random numberfor each row.

The shortcoming of using the BINARY_CHECKSUM function for our purpose isthat every time it's used on a row that hasn'tbeen modified, it returns the same checksumnumber. Thus, when used by itself, subsequent runs of the query return the same"random" set of rows, which obviously isn'tdesirable.

To fix this shortcoming, we added the RAND function to the BINARY_ CHECKSUM query. The RAND function scrambles the numbers returned by the BINARY_CHECKSUM function. Thus, we get a different set of rows each time the query is run, making it truly random. The ABS and CAST functions are used because BINARY_CHECKSUM(*) * RAND returns a float that can be a negative number.

The asterisk (*) in BINARY_ CHECKSUM(*) tells the function to use all the columns in the row in its calculations. Alternatively, you can specify a subset of the columns in place of the asterisk. Because this function is CPU-intensive, specifying the minimum number of columns or minimum number of bytes will give you the best performance. The best candidates would be the columns in a unique index. If you decide to use specific columns instead of all the columns, you can add NEWID as a column in the BINARY_CHECKSUM function so that the BINARY_CHECKSUM query will return a random number each time. Thus, you don't need to use RAND in the query, which simplifies it slightly:

SELECT * FROM Table1  WHERE (ABS(CAST(  (BINARY_CHECKSUM  (keycol1, NEWID())) as int))  % 100) < 10

Because there's no sorting involved in the BINARY_CHECKSUM query, only a single pass through the table is required to choose n% of the rows. The time and I/O stay linear in proportion to the size of the table.

To test the BINARY_CHECKSUM query against the NEWID query, we set up three large tables containing 1 million rows (435MB), 7 million rows (3GB), and 14 million rows (5.4GB) on a HP ProLiant DL580 G2 server with 1GB memory, four 2.2MHz Intel processors, and eight 36GB disks in RAID 1+0 configuration. Table 1 shows the results. As Table 1 shows, the BINARY_CHECKSUM query saves a lot of time and I/O compared with the NEWID query.

The Microsoft SQL Server team realized that not being able to easily take random samples of rows was a common problem in SQL Server 2000, so the team addressed the problem in SQL Server 2005 by introducing the TABLESAMPLE clause.

This clause selects a subset of rows by choosing random data pages and returning all the rows on those pages. However, for those of us who still have products running on SQL Server 2000, who need backward compatibility, or who need truly row-level randomness, the BINARY_CHECKSUM query is a very effective option.
—Marcelo De Barros and Kenton Gidewall

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like