Programming SQL in a Set-Based Way
Pull yourself out of your comfort zone and think in a new way
June 23, 2010
As T-SQL programmers, we always hear that the SQL language is optimized for set-based solutions rather than procedural solutions, but we seldom see examples from that perspective. Consequently, many beginning SQL programmers don’t have a clear understanding of what set-based means in terms of the code they need to write to solve a specific problem.
Even for those who understand the concept, there are many programming problems for which a set-based solution seems impossible. Sometimes that's true. It's not always possible to find a set-based solution, but most of the time we can find one by using a little creative thinking. A good SQL programmer must develop the mental discipline to explore set-based possibilities thoroughly before falling back on the intuitive procedural solution.
In this article, I provide a relatively simple example that illustrates how to think in a set-based way about a common type of problem that also has an intuitive procedural solution.
The Business Case
When you visit the doctor’s office, the first thing the nurse does is put you on a scale, record your weight, and check your height. Checking your weight makes sense from a medical point of view, but have you ever wondered why the nurse records your height each time? Unless you're very young, your height hasn’t changed since your last visit and isn't likely to change again.
The reason the nurse checks your height is to guard against identity theft. Health care providers want to make sure that the services they provide are going to the person who gets the bill—not to an imposter with a forged identity card.
This kind of identity theft happens more frequently than you might think. HIPAA regulations now require an audit of changes in permanent physical characteristics in a patient’s history that might suggest identity theft.
Querying this kind of information provides a good example for comparing procedural thinking and set-based thinking when programming in SQL.
The Problem Statement
The generic programming problem is that the solution depends on the order of rows and requires the comparison of current row values with values in previous rows. This is a type of problem in which the procedural solution is intuitive, but the set-based solution isn't so obvious.
In this particular problem, we're looking for rows where a previous visit for the same patient has a height value that's different from the height on the current record. We want to return the patient’s unique medical record number, the date the change occurred, what the height was changed from, and what the height was changed to. We don't want to return any records that don't mark a change in height.
Listing 1 gives you the code to create and populate the tables in this example, if you'd like to run the example yourself.
Listing 1: Creating and populating the tables
We use the AdventureWorks sample database to create tables for our test but you may use another database by changing the USE statements in all 3 listings.
USE AdventureWorks;SET NOCOUNT ON;CREATE TABLE Dates(ID int, VisitDate datetime);--populate table with 20 visit datesDECLARE @i int, @startdate datetime;SET @i = 1;SET @startdate = GETDATE();WHILE @i <= 20BEGIN INSERT Dates (ID, VisitDate) VALUES (@i, @startdate); SET @startdate = DATEADD(dd,7, @startdate); SET @i = @i+1;ENDCREATE TABLE PatientHeight(PatientID int not null,Height int);-- populate table with 1000 patientids with heights between 59 and 74 inchesSET @i = 1;WHILE @i <= 10000BEGIN INSERT PatientHeight (PatientID, Height) VALUES (@i, @i % 16 + 59); SET @i = @i+1;ENDALTER TABLE PatientHeight ADD CONSTRAINT PK_PatientHeight PRIMARY KEY(PatientID);-- cartesian join produces 200,000 PatientVisit recordsSELECT ISNULL(PatientID, -1) AS PatientID, ISNULL(VisitDate, '19000101') AS VisitDate, HeightINTO PatientVisitFROM PatientHeightCROSS JOIN Dates;ALTER TABLE PatientVisit ADD CONSTRAINT PK_PatientVisit PRIMARY KEY(PatientID, VisitDate);-- create changes of heightSET @i = 3;WHILE @i < 10000BEGIN UPDATE pv SET Height = Height +2 FROM PatientVisit pv WHERE PatientID = @i AND pv.VisitDate = (SELECT TOP 1 VisitDate FROM Dates where id = ABS(CHECKSUM(@i)) % 19); SET @i = @i + 7;END/*-- return AdventureWorks to its previous state when you are finished-- with this example.DROP TABLE Dates;DROP TABLE PatientHeight;DROP TABLE PatientVisit;*/
A Procedural Approach
The intuitive, procedural way to attack this problem is to order the records by patient and visit date, then loop through the records for each patient one row at a time. We query the first record for the patient and save the patient’s original height in a variable. Then, we loop through subsequent records for the patient, comparing height values. If we find that the height is different on a subsequent record, we write an audit record, update the height variable with the current value, and continue looping through the rows. Then we move to the next patient.
Listing 2 contains the code for the cursor-based solution. The cursor method works, but it's very inefficient. It could pose a serious performance problem when working with a large number of rows. How can we do this in a set-based and presumably more efficient way?
Listng 2: the cursor-based solution (USE AdventureWorks)
CREATE TABLE #Changes( PatientID int, VisitDate datetime, BeginHeight smallint, CurrentHeight smallint);DECLARE @PatientID int, @CurrentID int, @BeginHeight smallint, @CurrentHeight smallint, @VisitDate datetime;SET @PatientID = 0;DECLARE Patient_cur CURSOR FAST_FORWARD FORSELECT PatientID, VisitDate, HeightFROM PatientVisitORDER BY PatientID,VisitDate;OPEN Patient_cur;FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;WHILE @@FETCH_STATUS = 0BEGIN-- first record for this patientIF @PatientID <> @CurrentIDBEGIN SET @PatientID = @CurrentID; SET @BeginHeight = @CurrentHeight;END IF @BeginHeight <> @CurrentHeightBEGININSERT #Changes ( PatientID, VisitDate, BeginHeight, CurrentHeight)VALUES(@PatientID, @VisitDate, @BeginHeight, @CurrentHeight);SET @BeginHeight = @CurrentHeight;ENDFETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;ENDCLOSE Patient_cur;DEALLOCATE Patient_cur; SELECT * FROM #ChangesDROP TABLE #Changes
A Set-Based Approach
The difference between a procedural and set-based solution boils down to the way you define the problem. Stated in its simplest form, the change we're interested in involves only two records: two consecutive visits by the same patient. Everything else is irrelevant.
We start by ordering the data by the patient’s ID number and then by visit date. In that way, the records of consecutive visits by the same patient are adjacent to each other. The problem is then reduced to finding a way to join consecutive records from this set.
When we understand the problem in that way, the solution isn't so difficult to discover. We need to create a sequence number for the sorted rows that can be used to join one record with the next in a self-join.
We can create a common table expression (CTE) populated with patient data sorted by PatientID and VisitDate, adding a sequential ID using the ROW_NUMBER() function.
We can self-join this temporary table like this:
… from CTE t1join CTE t2 on t2.ROWID = t1.ROWID + 1…
This will produce a set of records that represents every possible opportunity for the value of the patient’s height to change—that is, a set of records such that each contains the data from each set of two consecutive records in the original data set.
At this point, filtering out the records that don't represent a change is trivial. We simply review our statement of the problem: To qualify as a record of interest, the patient must be the same in consecutive visits but the two heights must be different. Listing 3 contains the code that implements this set-based method.
Listing 3: The set-based solution (USE adventureWorks)
WITH PV_RN AS( SELECT ROW_NUMBER() OVER (ORDER BY PatientID, VisitDate) AS ROWID, * FROM PatientVisit)select t1.PatientID,t2.VisitDate as DateChanged,t1.Height as HeightChangedFrom,t2.Height as HeightChangedTofrom PV_RN t1 join PV_RN t2 on t2.ROWID = t1.ROWID + 1 where t1.patientid = t2.patientid and t1.Height <> t2.Heightorder by t1.PatientID, t2.VisitDate;
Relative Performance of the Two Methods
In Listing 1, we created the PatientVisit table and populate it with 200,000 records containing the PatientID, VisitDate, and the Height recorded for that visit. The table contains about 2,600 records that represent a change in height for a patient.
We used SQL Profiler to capture execution statistics of the two methods. First, we flushed the buffers to get the cold execution statistics, then we re-ran the query to get hot execution statistics after the data was in cache. Both the cursor and the set-based code returned identical results. Table 1 shows the execution statistics for each. Notice the huge difference in logical reads. This 160:1 difference can be a show stopper in many situations. CPU and Duration are roughly eight times as high in the cursor solution.
Method | Execution | Duration | Reads | CPU |
Set-Based | Cold | 503 | 1298 | 515 |
Cursor | Cold | 4090 | 203646 | 3931 |
Set Based | Hot | 476 | 1248 | 484 |
Cursor | Hot | 3958 | 203728 | 3713 |
Table 1: Execution Statistics
The auditing requirements for a large healthcare provider can easily generate a million rows per day in the audit table. So, even if you run your audit reports for only a single day’s data, you'll have a lot of rows to process—far too many for a cursor or other looping mechanism to handle efficiently.
Set-Based Thinking
Note that the more efficient solution operates on whole sets of data, not on the individual rows. Compare this with the cursor solution, in which operations are repeated for each row in a set.
Nothing in this simple example is rocket science. You'll encounter SQL problems that are much more difficult to solve in a set-based way and some that are impossible. However, even this example requires a significant mental adjustment for programmers new to SQL programming. It requires a conscious effort to pull yourself out of your comfort zone and think in a new way. Even in the most difficult situations, don’t give up on a set-based solution until you've given it a fair amount of thought.
About the Author
You May Also Like