Splitting Separated Lists of Integers with HIERARCHYID
Itzik tries a new technique to split separated lists of values using the HIERARCHYID data type.
March 27, 2010
Splitting strings holding separated lists of values is a common problem that has been discussed many times in the past. The best source of information that I know of on the topic is Erland Sommarskog ’s series about Arrays and Lists.
I often like to revisit old problems and try completely new approaches. Sometimes such attempts lead to new solutions. Some of the new solutions end up being faster than existing ones and some aren’t. Either way, I find the exercise and the constant search for improvement a healthy thing.
During a class I delivered last week while discussing this problem it occurred to me that when the values in the lists are integers the strings are very similar to the canonical representation of HIERARCHYID values. A string holding a separated list of integers starting with a slash (/), ending with a /, and using either slashes or dots as separators is considered a valid canonical representation of a HIERARCHYID value and hence can be converted to one. Once converted, you can use the type’s methods and properties to manipulate the value.
Before long, I came up with a solution. The results in terms of performance were disappointing, though the exercise was very interesting. The solution requires an auxiliary table of numbers. You can use the following code to create one:
SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID('dbo.Nums', 'U') IS NOT NULL DROP TABLE dbo.Nums;
CREATE TABLE dbo.Nums(n INT NOT NULL PRIMARY KEY);
DECLARE @max AS INT, @rc AS INT;
SET @max = 100000;
SET @rc = 1;
INSERT INTO Nums VALUES(1);
WHILE @rc * 2 @max
BEGIN
INSERT INTO dbo.Nums SELECT n + @rc FROM dbo.Nums;
SET @rc = @rc * 2;
END
INSERT INTO dbo.Nums
SELECT n + @rc FROM dbo.Nums WHERE n + @rc @max;
Here’s the solution showing how to split a single string:
DECLARE @s AS VARCHAR(1000) = '-4,-6,1050,-2';
WITH C AS
(
SELECT CAST('/' + REPLACE(@s, ',', '/') + '/' AS HIERARCHYID) AS hid
)
SELECT hid.GetLevel() - n + 1 AS pos,
REPLACE(
hid.GetAncestor(n-1).GetReparentedValue(
hid.GetAncestor(n),
0x).ToString(),
'/', '') AS element
FROM C
JOIN dbo.Nums
ON n hid.GetLevel();
I used a local variable, but of course you can encapsulate this logic in an inline table-valued function with @s passed as an input argument. As for the solution’s logic, it starts by defining a CTE called C that generates an HIERARCHYID value out of the string in @s. This is achieved by replacing all occurrences of commas in the string with slashes, adding slashes in the beginning and end, and converting the result to HIERARCHYID. The result attribute is named hid.
The outer query filters as many numbers from the Nums table as the number of levels in hid (obtained from the GetLevel() method). The ordinal position of each element, as well as the element itself, are calculated in the outer query’s SELECT list.
The position of the element is calculated as the number of levels in the hid value minus n plus 1. The reasoning behind this calculation is that it’s convenient to think of the numbers from Nums as right-based ordinal positions of elements so that n - 1 can be used as input to the GetAncestor method later on. But the output should show left-based ordinal positions of elements. As an example, given a string with 4 elements, n = 1 represents the rightmost element whose left-based position is 4 - 1 + 1 = 4.
As for the element itself, you can get the path leading to the current element by invoking hid.GetAncestor(n-1). With the given input string (@s = '-4,-6,1050,-2') the canonical path of the converted hid value is /-4/-6/1050/-2/. So, for example, out of the four rows you get back from the join consider the row where n = 2. You’re supposed to extract the second element from the right (third element from left) which is 1050. Using hid.GetAncestor(n-1) you get an hid value whose canonical representation is /-4/-6/1050/. Now you need to get rid of the parent prefix (/-4/-6/). You can obtain the prefix by using hid.GetAncestor(n). You can then remove this prefix by using the GetReparentedValue() method passing the prefix as first input and an empty binary string as the second. Once the prefix is removed, the ToString() method returns /1050/ as the canonical representation of the returned value. What’s left is to remove the remaining slashes before and after the value by using the REPLACE method, substituting slashes with an empty string.
You can use very similar logic to handle multiple strings stored in a table at once, like so:
IF OBJECT_ID('dbo.Arrays', 'U') IS NOT NULL DROP TABLE dbo.Arrays;
CREATE TABLE dbo.Arrays
(
arrid INT NOT NULL IDENTITY PRIMARY KEY,
arr NVARCHAR(4000) NOT NULL
);
GO
INSERT INTO dbo.Arrays(arr) VALUES
(N'20,220,25,2115,14'),
(N'30,330,28'),
(N'12,10,8,8,122,13,2,14,10,9'),
(N'-4,-6,1050,-2');
WITH C AS
(
SELECT arrid, arr,
CAST('/' + REPLACE(arr, ',', '/') + '/' AS HIERARCHYID) AS hid
FROM dbo.Arrays
)
SELECT arrid, hid.GetLevel() - n + 1 AS pos,
REPLACE(
hid.GetAncestor(n-1).GetReparentedValue(
hid.GetAncestor(n-1),
0x).ToString(),
'/', '') AS element
FROM C
JOIN dbo.Nums
ON n hid.GetLevel();
But it appears that the use of the CLR type has high overhead and tests show that this solution is slower than some of the existing ones that you will find in Erland’s papers (e.g., using a CLR UDF or a pure T-SQL solution with a numbers table). Besides not being very fast, the solution using HIERARCHYID is limited to lists of integers only and works only in SQL Server 2008.
So this attempt wasn’t so successful as far as performance is concerned. But the way I see it, what’s important is to keep trying. As you do, you develop new techniques, some of which may end up turning useful in the future. Sometimes, if you’re lucky, your new solution will be better than the commonly used existing ones. A friend once told me that the path (no pun intended) is more important than the destination…
Cheers,
BG
About the Author
You May Also Like