The Last non NULL Puzzle

Two solutions to a common problem

Itzik Ben-Gan

January 8, 2015

7 Min Read
missing puzzle piece

One of my fellow SQL Server MVPs, Simon Sabin, recently introduced a puzzle in a private group. Initially, the task seemed to be straightforward; however, as it turned out, solving the puzzle efficiently requires some creativity.

I asked Simon's permission to share this puzzle with SQL Server Pro readers. The need seems to be common enough that I think readers will be interested in a discussion of the topic, as well as finding efficient solutions.

The Puzzle

The puzzle is pretty simple. (Aren't simple puzzles the best ones?) Given a table T1, with a key column called id and a NULLable value column called col1, return the last non NULL col1 value based on id order.

Run the code in Listing 1 to create the table T1 and fill it with a small set of sample data.

-- DDL for T1SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;GOCREATE TABLE dbo.T1(  id INT NOT NULL CONSTRAINT PK_T1 PRIMARY KEY,  col1 INT NULL);-- Small set of sample dataTRUNCATE TABLE dbo.T1;INSERT INTO dbo.T1(id, col1) VALUES  ( 2, NULL),  ( 3,   10),  ( 5,   -1),  ( 7, NULL),  (11, NULL),  (13,  -12),  (17, NULL),  (19, NULL),  (23, 1759);

Figure 1 shows the desired result with the small set of sample data.

id      col1    lastval----------- ----------- -----------2       NULL    NULL3       10      105       -1      -17       NULL    -111      NULL    -113      -12     -1217      NULL    -1219      NULL    -1223      1759    1759

You might have faced a similar need yourself—but in case you haven't and are looking for a practical scenario, consider the following.

T1 stores a sequence of events, with the id column representing their order. An event has a number of attributes, one of which is represented by col1. When an attribute value changes, the event row shows the new value; for attributes whose values are unchanged, the event row shows NULLs.

Your task is to show with every event the applicable values (last non NULL) of all attributes. This is a simplified example of what Simon had in mind when he presented the puzzle. However, the simplified case is adequate in representing the fundamental need for an efficient solution for finding the last non NULL.

As always, I urge you to try to solve the puzzle yourself before looking at my solutions. It's an excellent challenge and can help sharpen your skills in using window functions, irrespective of the practicality of this specific case.

Use the small set of sample data that Listing 1 generates to check the validity of your solution. Compare the result that you get with the desired result shown in Figure 1.

Use the code in Listing 2 to populate the table with 10,000,000 rows so that you can test the performance of your solution.

-- Definition of GetNums: http://tsql.solidq.com/GetNums.txtTRUNCATE TABLE dbo.T1;INSERT INTO dbo.T1 WITH(TABLOCK)  SELECT n AS id, CHECKSUM(NEWID()) AS col1  FROM TSQLV3.dbo.GetNums(1, 10000000) AS NumsOPTION(MAXDOP 1);

The performance numbers that I present were measured against the large set of sample data, with results discarded.

Solution 1 Using Only Window Functions

The first solution that I discuss uses two layers of the MAX window aggregate function. There are two main steps in the solution. The following code implements the first step:

SELECT id, col1, relevantid,  MAX(relevantid) OVER( ORDER BY id            ROWS UNBOUNDED PRECEDING ) AS grpFROM dbo.T1  CROSS APPLY ( VALUES( CASE WHEN col1 IS NOT NULL THEN id END ) )    AS A(relevantid);

To make it easy to understand what this step does, follow my explanations while examining the output of this step, which Figure 2 shows.

id      col1    relevantid  grp----------- ----------- ----------- -----------2       NULL    NULL    NULL3       10      3       35       -1      5       57       NULL    NULL    511      NULL    NULL    513      -12     13      1317      NULL    NULL    1319      NULL    NULL    1323      1759    23      23

The code uses a CASE expression to compute a result column called relevantid. This column shows the value of id when the value of col1 changes and a NULL when it doesn't. The critical thing about id is that by definition its values keep increasing; that's not necessarily the case with the values in col1. The MAX window aggregate function returns the maximum relevantid value so far (call the result column grp).

Observe in Figure 2 that the group value (grp) changes to the current id whenever the col1 value changes, and it remains the same as long as the col1 value doesn't change. So, the maximum col1 value within each group of events is the applicable col1 value for those events.

The second step in the solution implements the last calculation, like so:

WITH C AS(  SELECT id, col1, relevantid,    MAX(relevantid) OVER( ORDER BY id              ROWS UNBOUNDED PRECEDING ) AS grp  FROM dbo.T1    CROSS APPLY ( VALUES( CASE WHEN col1 IS NOT NULL THEN id END ) )  AS A(relevantid))SELECT *,  MAX(col1) OVER( PARTITION BY grp          ORDER BY id          ROWS UNBOUNDED PRECEDING ) AS maxcol1ingrpFROM C;

Figure 3 shows the output of this step.

id      col1    relevantid  grp     maxcol1ingrp----------- ----------- ----------- ----------- ------------2       NULL    NULL    NULL    NULL3       10      3       3       105       -1      5       5       -17       NULL    NULL    5       -111      NULL    NULL    5       -113      -12     13      13      -1217      NULL    NULL    13      -1219      NULL    NULL    13      -1223      1759    23      23      1759

The result column maxcol1ingrp is the last non NULL you were looking for. Here's the complete solution after inlining the computation of relevantid:

WITH C AS(  SELECT id, col1,    MAX( CASE WHEN col1 IS NOT NULL THEN id END )  OVER( ORDER BY id        ROWS UNBOUNDED PRECEDING ) AS grp  FROM dbo.T1)SELECT id, col1,  MAX(col1) OVER( PARTITION BY grp          ORDER BY id          ROWS UNBOUNDED PRECEDING ) AS lastvalFROM C;

Figure 4 shows the plan for this solution against the large set of sample data.

Plan for Solution 1

The first window aggregate function relies on index order. However, because the second window aggregate function uses a different window specification than the first, it requires an explicit sort. That's the most expensive part of the plan. It took this solution 66 seconds to complete on my system.

The second solution that I present eliminates the need for explicit sorting.

Solution 2 Using Concatenation and a Window Function

The second solution can also be split into two steps. The following code implements the first step:

SELECT id, col1, binstr,  MAX( binstr ) OVER( ORDER BY id          ROWS UNBOUNDED PRECEDING ) AS mxFROM dbo.T1  CROSS APPLY ( VALUES( CAST(id AS BINARY(4)) + CAST(col1 AS BINARY(4)) ) )    AS A(binstr);

Again, follow my explanations while observing the code and its output, which Figure 5 shows.

id      col1    binstr         mx----------- ----------- ------------------ ------------------2       NULL    NULL           NULL3       10      0x000000030000000A 0x000000030000000A5       -1      0x00000005FFFFFFFF 0x00000005FFFFFFFF7       NULL    NULL           0x00000005FFFFFFFF11      NULL    NULL           0x00000005FFFFFFFF13      -12     0x0000000DFFFFFFF4 0x0000000DFFFFFFF417      NULL    NULL           0x0000000DFFFFFFF419      NULL    NULL           0x0000000DFFFFFFF423      1759    0x00000017000006DF 0x00000017000006DF

This step converts the id value and the col1 value to binary strings and concatenates them. The code names the result column binstr. Observe in Figure 5 that when col1 changes its value, binstr shows the concatenated strings; when col1 doesn't change its value, binstr is NULL. The MAX window aggregate function computes the maximum binstr value so far. The code calls the result column mx.

The right four bytes within mx represent the binary form of the last non NULL value of col1. The second step in the solution extracts those right four bytes and converts them to INT to return the desired result with the proper type. Here's the code that implements step 2, which is also the complete solution:

SELECT id, col1,  CAST(    SUBSTRING(  MAX( CAST(id AS BINARY(4)) + CAST(col1 AS BINARY(4)) )    OVER( ORDER BY id      ROWS UNBOUNDED PRECEDING ),  5, 4)    AS INT) AS lastvalFROM dbo.T1;

Figure 6 shows the plan for this solution.

Plan for Solution 2

The beauty of this solution is that it involves only one window aggregate function—and that function relies on index order, as you can see in the query plan. No explicit sorting is required. This solution took 33 seconds to complete on my system.

Conclusion

The last non NULL task is a beautiful puzzle for several reasons. It's a common and simple need, but apparently there's no straightforward solution. Creating an efficient solution requires a good understanding of window functions and how they are optimized, as well as some creativity.

I demonstrated two different solutions. One solution uses only window functions but involves an explicit sort in the plan. The other solution uses a combination of concatenation and a window function and doesn't involve an explicit sort in the plan.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like