Solving Gaps and Islands with Enhanced Window Functions
SQL Server 2012 provides new solutions
September 20, 2012
Gaps and islands are classic problems in SQL that involve identifying ranges of missing values and ranges of existing values in a sequence. Real-life examples of gap and island problems include nonavailability and availability reports, periods of inactivity and periods of activity reports, and so on. For extensive coverage of gap and island problems and solutions prior to SQL Server 2012, see "SQL Server MVP Deep Dives."
In this article, I cover new solutions to gap and island problems, using enhanced window functions that were introduced in SQL Server 2012. For details about the enhanced window functions in SQL Server 2012, see:
Sample Data and Testing
In my tests, I used a table called T1 with an integer column called col1 holding the sequence values. The code in Listing 1 (below) creates the table T1 and fills it with a small set of sample data -- allowing you to check the correctness of the solutions.
For performance testing, you need much larger sets of sample data. The code in Listing 2 (below) creates a helper function called GetNums that accepts two inputs called @low and @high and returns a sequence of integers in that range.
The code in Listing 3 (below) uses the GetNums function to populate T1 with values in the range 1 through 10,000,000 with only a few gaps (about 100). The code in Listing 4 populates T1 with values in the range 1 through 10,000,000 with many gaps (about 1,000,000).
When I tested the solutions covered in this article, I measured the performance of a single execution of the query, as well as a stress test with two iterations of 20 concurrent executions of the query. For the stress test I used Adam Machanic's SQLQueryStress tool. I'll provide performance results that I obtained for both the few gaps and the many gaps cases, and for both the single execution of the query and the average of the multiple executions in the stress test.
Gaps
The gaps problem involves identifying the ranges of missing values in the sequence. Figure 1 shows the desired result for the small set of sample data in T1.
Figure 1: Desired Result for Gaps Problem
First, I'll cover a solution using subqueries that's supported in all versions of SQL Server. Then, I'll cover a solution that uses features that are new in SQL Server 2012.
Solution to gaps problem, using subqueries.
The following query is a classic solution to the gaps problem, using subqueries:
SELECT col1 + 1 AS rangestart,(SELECT MIN(B.col1)FROM dbo.T1 AS BWHERE B.col1 > A.col1) - 1 AS rangeendFROM dbo.T1 AS AWHERE NOT EXISTS(SELECT *FROM dbo.T1 AS BWHERE B.col1 = A.col1 + 1)AND col1 < (SELECT MAX(col1) FROM dbo.T1);
This query filters only col1 values that appear before a gap (the value plus 1 doesn't exist). It also filters only the col1 values that are less than the maximum value in the table. Then in the SELECT clause the query matches to each value before a gap the value that appears right after the gap (the minimum value that's greater than the current value). What's left is to add 1 to the value before the gap and subtract 1 from the value after the gap to obtain the ranges of missing values.
This solution is quite efficient. I obtained the following performance results for the query:
Single query, few gaps: 3 seconds, 32577 logical reads
Single query, many gaps: 6 seconds, 3029043 logical reads
Stress test, few gaps: 13 seconds, 32577 logical reads
Stress test, many gaps: 27 seconds, 3029043 logical reads
Solution to gaps problem, using LEAD.
SQL Server 2012 provides a very simple and elegant solution to the gaps problem, using the new LEAD function. The LEAD function returns the requested element from the next row based on the indicated ordering. Here's the new solution:
WITH C AS(SELECT col1 AS cur, LEAD(col1) OVER(ORDER BY col1) AS nxtFROM dbo.T1)SELECT cur + 1 AS rangestart, nxt - 1 AS rangeendFROM CWHERE nxt - cur > 1;
The query that creates the CTE C returns for each current col1 value (aliased as cur) the next col1 value (aliased as nxt). Then the outer query against the CTE returns only pairs of cur and nxt values that are more than one apart. As you can guess, those pairs represent the values surrounding the gaps. The query then adds 1 to the cur value and subtracts 1 from the nxt value to return the actual ranges of missing values. Beautifully simple!
I obtained the following performance results for the query:
Single query, few gaps: 11 seconds, 16138 logical reads
Single query, many gaps: 9 seconds, 13718 logical reads
Stress test, few gaps: 45 seconds, 16138 logical reads
Stress test, many gaps: 45 seconds, 13718 logical reads
It's a bit disappointing to see that in both the single execution test and the stress test, the old solution was faster than the new one. However, the new solution requires significantly fewer I/O operations. In addition, it's much much simple and therefore easier to maintain.
Normal Islands
The normal islands problem involves identifying the ranges of existing values. Figure 2 shows the desired result for the small set of sample data in T1.
Figure 2: Desired Result for Normal Islands Problem
The following query is a classic solution to the normal islands problem, using features that are supported by SQL Server 2005 and later:
WITH C AS(SELECT col1, col1 - DENSE_RANK() OVER(ORDER BY col1) AS grpFROM dbo.T1)SELECT MIN(col1) AS rangestart, MAX(col1) AS rangeendFROM CGROUP BY grp;
The inner query that defines the CTE C uses an expression that uniquely identifies each island. This is achieved by using the DENSE_RANK function. Within each island, both the col1 values and the DENSE_RANK values keep incrementing by an interval of 1 integer. Therefore the difference between the two is constant within each island. Also, when you jump from one island to the next, the col1 value increases by more than 1, whereas the DENSE_RANK value increases by only 1. Therefore the difference for the next island will always be greater than the difference for the preceding island. In other words, the difference between col1 and the DENSE_RANK value is constant per island and unique per island. This means that it identifies the island. You can compute the starting and ending values per island by grouping the rows by the difference and returning the minimum and maximum col1 values per group.
This solution is very efficient and very elegant, but unfortunately it works only for the normal islands problem, in which there's a fixed interval between the entries. Any case in which one entry is greater than the previous entry by more than the fixed interval is considered a gap. To address special island problems, in which gaps of up to a certain size must be ignored, you need to find other solutions.
Special Islands
There are special island problems in which you need to ignore gaps of up to a certain size. For example, suppose that you need to identify islands in the col1 values in T1 but ignore gaps of up to a size of 2. For example, the values 7, 8, 9, and 11 are considered part of the same island because no pair has a difference of more than 2. However, the values 3 and 7 belong to different islands because the difference between the values is greater than 2. Figure 3 shows the desired result for this special case of the islands problem, for the small set of sample data in T1.
Figure 3: Desired Result for Special Islands Problem
First, I'll cover a solution that's supported by SQL Server 2005 and later. Then, I'll cover a couple of solutions that rely on new features in SQL Server 2012.
Solution to special islands problem, using subqueries.
The following query is a solution to the special islands problem. This solution is supported by SQL Server 2005 and later:
WITH S AS(SELECT col1,ROW_NUMBER() OVER(ORDER BY col1) AS nFROM dbo.T1 AS AWHERE NOT EXISTS(SELECT *FROM dbo.T1 AS BWHERE B.col1 >= A.col1 - 2AND B.col1 < A.col1)),E AS(SELECT col1,ROW_NUMBER() OVER(ORDER BY col1) AS nFROM dbo.T1 AS AWHERE NOT EXISTS(SELECT *FROM dbo.T1 AS BWHERE B.col1 > A.col1AND B.col1 <= A.col1 + 2))SELECT S.col1 AS rangestart, E.col1 AS rangeendFROM S JOIN EON S.n = E.n;
The query that defines the CTE S filters only the col1 values from T1 that start islands (a value greater than the current value minus 2 and less than the current value can't be found). The query assigns row numbers based on col1 ordering to the remaining values.
Similarly, the query that defines the CTE E filters only the col1 values from T1 that end islands (a value greater than the current value and less than the current value plus 2 can't be found). The query assigns row numbers to the remaining values based on col1 ordering. The outer query then relates to each start of an island the respective end of the island by matching the row numbers of the start and end values.
I obtained the following performance results for the query:
Single query, few gaps: 8 seconds, 66298061 logical reads
Single query, many gaps: 14 seconds, 56356747 logical reads
Stress test, few gaps: 142 seconds, 63798069 logical reads
Stress test, many gaps: 188 seconds, 56356747 logical reads
As you can see, the solution involves a significant number of logical reads.
Solution to special islands problem, using LAG and LEAD.
Recall from one of the previous solutions that the LEAD function lets you return an element from the next row based on the indicated ordering. Similarly, the LAG function lets you return an element from the previous row based on the indicated ordering. Both functions are new in SQL Server 2012. Here's a new solution to the special islands problem, using the LAG and LEAD functions:
WITH C1 AS(SELECT col1,CASE WHEN col1 - LAG(col1) OVER(ORDER BY col1) <= 2 THEN 0 ELSE 1 END AS isstart,CASE WHEN LEAD(col1) OVER(ORDER BY col1) - col1 <= 2 THEN 0 ELSE 1 END AS isendFROM dbo.T1),C2 AS(SELECT col1 AS rangestart,CASEWHEN isend = 1 THEN col1ELSE LEAD(col1, 1) OVER(ORDER BY col1)END AS rangeend, isstartFROM C1WHERE isstart = 1 OR isend = 1)SELECT rangestart, rangeendFROM C2WHERE isstart = 1;
The query that defines the CTE C1 uses the LAG and LEAD functions to compute the attributes isstart and isend. The former returns a 0 if the difference between the current and previous col1 values is less than or equal to 2; otherwise it returns a 1. The latter returns a 0 if the difference between the next and current col1 values is less than or equal to 2; otherwise it returns a 1. As you probably figured out, a 1 in the isstart flag means that the value is the start of an island, and a 1 in the isend flag means that the value is the end of an island.
The query that defines the CTE C2 filters only values that are starts or ends of islands and then uses a CASE expression to compute for each start value the respective end value. If there's only one value in the island (both isstart and isend are equal to 1), the CASE expression returns the current value as the end value; otherwise it uses the LEAD function to return the next value as the end value.
I obtained the following performance results for the query:
Single query, few gaps: 26 seconds, 16138 logical reads
Single query, many gaps: 24 seconds, 13718 logical reads
Stress test, few gaps: 117 seconds, 16138 logical reads
Stress test, many gaps: 120 seconds, 13718 logical reads
What's interesting is that the number of logical reads is significantly lower than in the previous solution. Observe that in the single query test, this solution is slower than the previous solution, but in the stress test it performs better -- likely due to less contention related to I/O resources.
Solution to special islands problem, using LAG and COUNT.
Here's another new solution to the special islands problem. This solution relies on new windowing capabilities in SQL Server 2012:
WITH C1 AS(SELECT col1,CASE WHEN col1 - LAG(col1) OVER(ORDER BY col1) <= 2 THEN NULL ELSE 1 END AS isstartFROM dbo.T1),C2 AS(SELECT col1, COUNT(isstart) OVER(ORDER BY col1 ROWS UNBOUNDED PRECEDING) AS grpFROM C1)SELECT MIN(col1) AS rangestart, MAX(col1) AS rangeendFROM C2GROUP BY grp;
The query that defines the CTE C1 uses the LAG function to generate an isstart flag similar to the one in the previous solution. When the value is the start of an island, the flag is set to 1. However, unlike in the previous solution, when the value isn't the start of an island, the flag is set to NULL.
The query that defines the CTE C2 then uses the enhanced COUNT window function to generate the attribute grp to identify each island. The COUNT function computes the running count of the isstart flag based on col1 ordering from the first row in the sequence and until the current row (ROWS UNBOUNDED PRECEDING). Because the COUNT function ignores NULLs, it generates the value 1 for all members of the first island, the value 2 for all members of the second island, and so on. Finally, the outer query groups the rows by the grp value and returns the minimum and maximum col1 values in each group as the start and end values of the range.
I obtained the following performance results for the query:
Single query, few gaps: 28 seconds, 16138 logical reads
Single query, many gaps: 31 seconds, 16186 logical reads
Stress test, few gaps: 146 seconds, 16138 logical reads
Stress test, many gaps: 154 seconds, 13718 logical reads
The run time results for this solution are a bit slower than in the previous solution. However, this solution also generates a significantly smaller number of logical reads compared with the SQL Server 2005–compatible solution.
More Options Lead to Better Solutions
Gap and island problems manifest themselves in many different forms and permutations in real-life cases. Some of them are specialized forms that require you to be creative. The more tools you have available, the more options you have to work with. The windowing improvements in SQL Server 2012 enhance your toolkit and let you create new, often improved solutions to such
problems.
Learning Path
For details about SQL Server 2012's enhanced window functions:
"How to Use Microsoft SQL Server 2012's Window Functions, Part 1"
"Microsoft SQL Server 2012: How to Write T-SQL Window Functions, Part 2"
"SQL Server 2012: How to Write T-SQL Window Functions, Part 3"
Listing 1: Small Set of Sample Data
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;CREATE TABLE dbo.T1( col1 INT NOT NULLCONSTRAINT PK_T1 PRIMARY KEY);INSERT INTO dbo.T1(col1) VALUES(2),(3),(7),(8),(9),(11),(15),(16),(17),(28);
Listing 2: Definition of GetNums Helper Function
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;GOCREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLEASRETURN WITHL0 AS (SELECT c FROM (VALUES(1),(1)) AS D(c)),L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5) SELECT @low + rownum - 1 AS n FROM Nums ORDER BY rownum OFFSET 0 ROWS FETCH FIRST @high - @low + 1 ROWS ONLY;GO
Listing 3: Large Set of Sample Data, Few Gaps
TRUNCATE TABLE dbo.T1;INSERT INTO dbo.T1 WITH (TABLOCK) (col1) SELECT n FROM dbo.GetNums(1, 10000000) AS Nums WHERE n % 100000 > 0AND n % 200000 > 1;
Listing 4: Large Set of Sample Data, Many Gaps
TRUNCATE TABLE dbo.T1;INSERT INTO dbo.T1 WITH (TABLOCK) (col1) SELECT n FROM dbo.GetNums(1, 10000000) AS Nums WHERE n % 10 > 0AND n % 20 > 1;
About the Author
You May Also Like