Programming SQL in a Set-Based Way

Pull yourself out of your comfort zone and think in a new way

Kurt Survance

June 23, 2010

8 Min Read
ITPro Today logo in a gray background | ITPro Today

As T-SQL programmers, we always hear that the SQL language is optimized for set-based solutions rather than procedural solutions, but we seldom see examples from that perspective. Consequently, many beginning SQL programmers don’t have a clear understanding of what set-based means in terms of the code they need to write to solve a specific problem.

Even for those who understand the concept, there are many programming problems for which a set-based solution seems impossible. Sometimes that's true. It's not always possible to find a set-based solution, but most of the time we can find one by using a little creative thinking. A good SQL programmer must develop the mental discipline to explore set-based possibilities thoroughly before falling back on the intuitive procedural solution.

In this article, I provide a relatively simple example that illustrates how to think in a set-based way about a common type of problem that also has an intuitive procedural solution.

The Business Case

When you visit the doctor’s office, the first thing the nurse does is put you on a scale, record your weight, and check your height. Checking your weight makes sense from a medical point of view, but have you ever wondered why the nurse records your height each time? Unless you're very young, your height hasn’t changed since your last visit and isn't likely to change again.

The reason the nurse checks your height is to guard against identity theft. Health care providers want to make sure that the services they provide are going to the person who gets the bill—not to an imposter with a forged identity card.

This kind of identity theft happens more frequently than you might think. HIPAA regulations now require an audit of changes in permanent physical characteristics in a patient’s history that might suggest identity theft.

Querying this kind of information provides a good example for comparing procedural thinking and set-based thinking when programming in SQL.

The Problem Statement

The generic programming problem is that the solution depends on the order of rows and requires the comparison of current row values with values in previous rows. This is a type of problem in which the procedural solution is intuitive, but the set-based solution isn't so obvious.

In this particular problem, we're looking for rows where a previous visit for the same patient has a height value that's different from the height on the current record. We want to return the patient’s unique medical record number, the date the change occurred, what the height was changed from, and what the height was changed to. We don't want to return any records that don't mark a change in height.

Listing 1 gives you the code to create and populate the tables in this example, if you'd like to run the example yourself.

Listing 1: Creating and populating the tables

We use the AdventureWorks sample database to create tables for our test but you may use another database by changing the USE statements in all 3 listings.

USE AdventureWorks;SET NOCOUNT ON;CREATE TABLE Dates(ID int, VisitDate datetime);--populate table with 20 visit datesDECLARE @i int, @startdate datetime;SET @i = 1;SET @startdate = GETDATE();WHILE @i <= 20BEGIN    INSERT Dates    (ID, VisitDate)    VALUES (@i, @startdate);      SET @startdate = DATEADD(dd,7, @startdate);    SET @i = @i+1;ENDCREATE TABLE PatientHeight(PatientID  int not null,Height int);-- populate table with 1000 patientids with heights between 59 and 74 inchesSET @i = 1;WHILE @i <= 10000BEGIN    INSERT PatientHeight    (PatientID, Height)    VALUES (@i, @i % 16 + 59);      SET @i = @i+1;ENDALTER TABLE PatientHeight ADD CONSTRAINT PK_PatientHeight    PRIMARY KEY(PatientID);-- cartesian join produces 200,000 PatientVisit recordsSELECT     ISNULL(PatientID, -1) AS PatientID,     ISNULL(VisitDate, '19000101') AS VisitDate,    HeightINTO PatientVisitFROM PatientHeightCROSS JOIN Dates;ALTER TABLE PatientVisit ADD CONSTRAINT PK_PatientVisit    PRIMARY KEY(PatientID, VisitDate);-- create changes of heightSET @i = 3;WHILE @i < 10000BEGIN    UPDATE pv    SET Height = Height +2    FROM PatientVisit pv    WHERE PatientID = @i    AND pv.VisitDate =     (SELECT TOP 1 VisitDate     FROM Dates     where id = ABS(CHECKSUM(@i)) % 19);  SET @i = @i + 7;END/*-- return AdventureWorks to its previous state when you are finished-- with this example.DROP TABLE Dates;DROP TABLE PatientHeight;DROP TABLE PatientVisit;*/

A Procedural Approach

The intuitive, procedural way to attack this problem is to order the records by patient and visit date, then loop through the records for each patient one row at a time. We query the first record for the patient and save the patient’s original height in a variable. Then, we loop through subsequent records for the patient, comparing height values. If we find that the height is different on a subsequent record, we write an audit record, update the height variable with the current value, and continue looping through the rows. Then we move to the next patient.

Listing 2 contains the code for the cursor-based solution. The cursor method works, but it's very inefficient. It could pose a serious performance problem when working with a large number of rows. How can we do this in a set-based and presumably more efficient way?

Listng 2: the cursor-based solution (USE AdventureWorks)

CREATE TABLE #Changes( PatientID int, VisitDate    datetime, BeginHeight smallint, CurrentHeight    smallint);DECLARE @PatientID    int,    @CurrentID    int,    @BeginHeight    smallint,    @CurrentHeight    smallint,    @VisitDate    datetime;SET @PatientID = 0;DECLARE Patient_cur CURSOR FAST_FORWARD FORSELECT PatientID, VisitDate, HeightFROM PatientVisitORDER BY PatientID,VisitDate;OPEN Patient_cur;FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;WHILE @@FETCH_STATUS = 0BEGIN-- first record for this patientIF @PatientID <> @CurrentIDBEGIN    SET @PatientID = @CurrentID;    SET @BeginHeight = @CurrentHeight;END IF @BeginHeight <> @CurrentHeightBEGININSERT #Changes ( PatientID, VisitDate, BeginHeight, CurrentHeight)VALUES(@PatientID, @VisitDate, @BeginHeight, @CurrentHeight);SET @BeginHeight = @CurrentHeight;ENDFETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;ENDCLOSE Patient_cur;DEALLOCATE Patient_cur; SELECT * FROM #ChangesDROP TABLE #Changes

A  Set-Based Approach

The difference between a procedural and set-based solution boils down to the way you define the problem. Stated in its simplest form, the change we're interested in involves only two records: two consecutive visits by the same patient. Everything else is irrelevant.

We start by ordering the data by the patient’s ID number and then by visit date. In that way, the records of consecutive visits by the same patient are adjacent to each other. The problem is then reduced to finding a way to join consecutive records from this set.

When we understand the problem in that way, the solution isn't so difficult to discover. We need to create a sequence number for the sorted rows that can be used to join one record with the next in a self-join.

We can create a common table expression (CTE) populated with patient data sorted by PatientID and VisitDate, adding  a sequential ID using the ROW_NUMBER() function.

We can self-join this temporary table like this:

… from CTE t1join CTE t2 on t2.ROWID = t1.ROWID + 1…

This will produce a set of records that represents every possible opportunity for the value of the patient’s height to change—that is, a set of records such that each contains the data from each set of two consecutive records in the original data set.

At this point, filtering out the records that don't represent a change is trivial. We simply review our statement of the problem: To qualify as a record of interest, the patient must be the same in consecutive visits but the two heights must be different. Listing 3 contains the code that implements this set-based method.

Listing 3: The set-based solution (USE adventureWorks)

WITH PV_RN AS(    SELECT ROW_NUMBER() OVER (ORDER BY PatientID, VisitDate) AS ROWID, *     FROM PatientVisit)select t1.PatientID,t2.VisitDate as  DateChanged,t1.Height as HeightChangedFrom,t2.Height as HeightChangedTofrom PV_RN t1 join PV_RN t2 on t2.ROWID = t1.ROWID + 1    where t1.patientid = t2.patientid    and t1.Height <> t2.Heightorder by t1.PatientID, t2.VisitDate;

Relative Performance of the Two Methods

In Listing 1, we created the PatientVisit table and populate it with 200,000 records containing the PatientID, VisitDate, and the Height recorded for that visit.  The table contains about 2,600 records that represent a change in height for a patient.

We used SQL Profiler to capture execution statistics of the two methods.  First, we flushed the buffers to get the cold execution statistics, then we re-ran the query to get hot execution statistics after the data was in cache.  Both the cursor and the set-based code returned identical results. Table 1 shows the execution statistics for each. Notice the huge difference in logical reads.  This 160:1 difference can be a show stopper in many situations.  CPU and Duration are roughly eight times as high in the cursor solution.

Method

Execution

Duration

Reads

CPU

Set-Based

Cold

503

1298

515

Cursor

Cold

4090

203646

3931

Set Based

Hot

476

1248

484

Cursor

Hot

3958

203728

3713

Table 1: Execution Statistics

The auditing requirements for a large healthcare provider can easily generate a million rows per day in the audit table. So, even if you run your audit reports for only a single day’s data, you'll have a lot of rows to process—far too many for a cursor or other looping mechanism to handle efficiently.

Set-Based Thinking

Note that the more efficient solution operates on whole sets of data, not on the individual rows. Compare this with the cursor solution, in which operations are repeated for each row in a set.

Nothing in this simple example is rocket science. You'll encounter SQL problems that are much more difficult to solve in a set-based way and some that are impossible. However, even this example requires a significant mental adjustment for programmers new to SQL programming. It requires a conscious effort to pull yourself out of your comfort zone and think in a new way. Even in the most difficult situations, don’t give up on a set-based solution until you've given it a fair amount of thought.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like