Solutions to Related Tables Challenge
Itzik covers solutions to last week’s T-SQL challenge – Related Tables.
June 1, 2010
Last week I provided a T-SQL Challenge involving writing a recursive query that returns tables related to an input table directly or indirectly through foreign key relationships. You can find the challenge details here. The tricky part wasn’t really to come up with any solution that works, rather to come up with a solution using a recursive query. With this constraint the challenge became much more difficult—especially coming up with a solution that performs reasonably.
I’d like to thank all those who sent solutions, including Peter Larsson (Peso), Geri Reshef and Rubén Garrigós. Special thanks go to Peso for his help in benchmarking solutions in his environment and in sending corrections to other solutions. I know that many others tried to solve the puzzle and would like to thank you for your efforts—it is a difficult challenge!
The sample data I provided last week is far too simple to be used to test the performance of the solutions. Peso tested some of the solutions in his data warehouse with 160 tables and this way we got more realistic measures. I created a new set of tables—smaller than Peso’s environment, yet more complex than last week’s—and this was sufficient to show the performance differences between the solutions. Here’s the new sample data:
USE testfk;
CREATE TABLE dbo.T2(c2 INT PRIMARY KEY);
CREATE TABLE dbo.T3(c3 INT PRIMARY KEY);
CREATE TABLE dbo.T4(c4 INT PRIMARY KEY);
CREATE TABLE dbo.T5(c5 INT PRIMARY KEY);
CREATE TABLE dbo.T1(c1 INT PRIMARY KEY,
c2 INT REFERENCES dbo.T2,
c3 INT REFERENCES dbo.T3,
c4 INT REFERENCES dbo.T4,
c5 INT REFERENCES dbo.T5);
CREATE TABLE dbo.T6(c6 INT PRIMARY KEY,
c2 INT REFERENCES dbo.T2,
c3 INT REFERENCES dbo.T3);
CREATE TABLE dbo.T7(c7 INT PRIMARY KEY,
c3 INT REFERENCES dbo.T3,
c4 INT REFERENCES dbo.T4);
CREATE TABLE dbo.T8(c8 INT PRIMARY KEY,
c4 INT REFERENCES dbo.T4,
c5 INT REFERENCES dbo.T5);
CREATE TABLE dbo.T9(c9 INT PRIMARY KEY,
c2 INT REFERENCES dbo.T2,
c5 INT REFERENCES dbo.T5);
CREATE TABLE dbo.T10(c10 INT PRIMARY KEY,
c6 INT REFERENCES dbo.T6,
c3 INT REFERENCES dbo.T3);
CREATE TABLE dbo.T11(c11 INT PRIMARY KEY,
c3 INT REFERENCES dbo.T3,
c7 INT REFERENCES dbo.T7);
CREATE TABLE dbo.T12(c12 INT PRIMARY KEY,
c7 INT REFERENCES dbo.T7,
c4 INT REFERENCES dbo.T4);
CREATE TABLE dbo.T13(c13 INT PRIMARY KEY,
c4 INT REFERENCES dbo.T4,
c8 INT REFERENCES dbo.T8);
CREATE TABLE dbo.T14(c14 INT PRIMARY KEY,
c5 INT REFERENCES dbo.T5,
c8 INT REFERENCES dbo.T8);
CREATE TABLE dbo.T15(c15 INT PRIMARY KEY,
c5 INT REFERENCES dbo.T5,
c9 INT REFERENCES dbo.T9);
CREATE TABLE dbo.T16(c16 INT PRIMARY KEY,
c2 INT REFERENCES dbo.T2,
c9 INT REFERENCES dbo.T9);
CREATE TABLE dbo.T17(c17 INT PRIMARY KEY,
c2 INT REFERENCES dbo.T2,
c6 INT REFERENCES dbo.T6);
And here’s a graphical depiction of the relationships:
These tables are created in the testfk database used last week in addition to the other tables already there (tables A, B, C, D, E, F, G).
The performance measures I’ll provide are against this data model, using warm cache, not including parse and compile time. I used the following input for the performance test:
DECLARE @table AS NVARCHAR(261) = N'dbo.T1';
The expected output is (not necessarily in this order):
obj_schema_name obj_name
---------------- ---------
dbo T15
dbo T16
dbo T17
dbo T2
dbo T3
dbo T4
dbo T5
dbo T1
dbo T6
dbo T7
dbo T8
dbo T9
dbo T10
dbo T11
dbo T12
dbo T13
dbo T14
First, I’d like to emphasize that the constraint to come up with a solution based on a recursive query was just to increase the puzzle’s level of difficulty. In practical terms, it makes sense to address such need using a loop-based solution. Loop-based solutions in this case are far simpler and more efficient than recursive solutions. Here’s the one I showed last week (call it Itzik Loop) with a minor revision to return the input table when it has no related tables (remember to include in the code the declaration and initialization of the input @table provided earlier):
DECLARE @T AS TABLE ( id INT NOT NULL PRIMARY KEY );
INSERT INTO @T(id)
SELECT OBJECT_ID(@table) AS id
WHERE OBJECT_ID(@table) IS NOT NULL
UNION
SELECT referenced_object_id
FROM sys.foreign_keys
WHERE parent_object_id = OBJECT_ID(@table)
UNION
SELECT parent_object_id
FROM sys.foreign_keys
WHERE referenced_object_id = OBJECT_ID(@table);
WHILE @@ROWCOUNT > 0
INSERT INTO @T(id)
SELECT referenced_object_id AS id
FROM sys.foreign_keys AS FK
JOIN @T
ON FK.parent_object_id = [@T].id
UNION
SELECT parent_object_id
FROM sys.foreign_keys AS FK
JOIN @T
ON FK.referenced_object_id = [@T].id
EXCEPT
SELECT id FROM @T;
SELECT OBJECT_SCHEMA_NAME(id) AS obj_schema_name, OBJECT_NAME(id) AS obj_name
FROM @T;
The statistics for this solution on my system with the new sample data are: Duration: 4 ms, Reads: 587.
Here’s another loop-based solution provided by Peso (call it Peso Loop):
DECLARE @Tables TABLE
(
TableID INT PRIMARY KEY
)
INSERT @Tables
(
TableID
)
SELECT [object_ID] AS TableID
FROM SYS.TABLES
WHERE [object_ID] = OBJECT_ID(@table)
WHILE @@ROWCOUNT > 0
INSERT @Tables
(
TableID
)
SELECT DISTINCT TableID
FROM (
SELECT fk.REFERENCED_OBJECT_ID AS TableID
FROM SYS.FOREIGN_KEYS AS fk
INNER JOIN @Tables AS t ON t.TableID = fk.PARENT_OBJECT_ID
UNION ALL
SELECT fk.PARENT_OBJECT_ID AS TableID
FROM SYS.FOREIGN_KEYS AS fk
INNER JOIN @Tables AS t ON t.TableID = fk.REFERENCED_OBJECT_ID
) AS w
WHERE NOT EXISTS(SELECT * FROM @Tables AS t WHERE t.TableID = w.TableID)
SELECT
OBJECT_SCHEMA_NAME(TableID) AS obj_schema_name,
OBJECT_NAME(TableID) AS obj_name
FROM @Tables;
The stats I got for this solution are: Duration: 4 ms, reads: 728. In short, both loop-based solutions are pretty efficient as you can see.
As for recursive solutions, I’ll start with the ones I came up with, then I’ll provide also the ones by Peso, Geri and Rubén.
My first attempt at a solution using a recursive query is the following (call it Itzik Recursive 1):
WITH C AS
(
SELECT OBJECT_ID(@table) AS obj,
'.' + CAST(OBJECT_ID(@table) AS VARCHAR(MAX)) + '.' AS objpath
WHERE OBJECT_ID(@table) IS NOT NULL
UNION ALL
SELECT D.obj, D.objpath + CAST( D.obj AS VARCHAR(MAX) ) + '.'
FROM ( SELECT
CASE WHEN C.obj = F.parent_object_id
THEN F.referenced_object_id
ELSE F.parent_object_id
END AS obj, objpath
FROM C
JOIN sys.foreign_keys AS F
ON C.obj IN (F.parent_object_id, F.referenced_object_id) ) AS D
WHERE objpath NOT LIKE '%.' + CAST(obj AS VARCHAR(10)) + '.%'
)
SELECT DISTINCT
OBJECT_SCHEMA_NAME(obj) AS obj_schema_name,
OBJECT_NAME(obj) AS obj_name
FROM C;
The anchor member returns the input table. The recursive member returns a row for each counterpart of the tables that appeared in the previous round as either a referencing or a referenced table. For each edge, the code constructs a dot separated path with all of the object ids of the objects leading to the current. In each recursive round, the code filters out nodes that already appear in their parent node’s path.
This solution is short (in terms of amount of code) and elegant, but isn’t very fast. It creates all possible paths starting with the input node and made of unique nodes. As you can imagine, the number of paths can explode quickly to a large number with additions of new related tables.
The statistics I got for this solution are: Duration: 7,812, Reads: 2,949,749.
In order for a recursive solution to be really fast you need to avoid constructing multiple different paths and visit each node only once. The two main challenges are: 1. How to apply DISTINCT logic in the recursive member without using the DISTINCT clause (not allowed in recursive member). 2. Avoid building multiple different paths when multiple nodes are counterparts in a foreign key relationship with a single node from the previous round.
Both challenges have a solution. Challenge 1 can be handled by calculating a row number partitioned by the node and then filtering only rows with row number 1. Challenge 2 can be handled by using a FOR XML PATH query to concatenate multiple nodes into one. Here’s the complete solution code (call it Itzik Recursive 2):
WITH C AS
(
SELECT
CAST('.' + STR(OBJECT_ID(@table), 11) + '.' AS VARCHAR(MAX)) AS entirepath,
CAST('~' + STR(OBJECT_ID(@table), 11) + '~' AS VARCHAR(MAX)) AS lastpath
WHERE OBJECT_ID(@table) IS NOT NULL
UNION ALL
SELECT
REPLACE(entirepath, '~', '.') + '.',
SUBSTRING(entirepath, CHARINDEX('~', entirepath), LEN(entirepath)) + '~'
FROM (
SELECT CAST(entirepath AS VARCHAR(MAX)) AS entirepath
FROM (
SELECT element AS [text()]
FROM (
SELECT
CASE n WHEN 1 THEN entirepath ELSE obj END AS element,
ROW_NUMBER() OVER(PARTITION BY CASE n WHEN 1 THEN entirepath ELSE obj END
ORDER BY (SELECT NULL)) AS rownum
FROM (
SELECT
LEFT(entirepath, LEN(entirepath) - 1) AS entirepath,
'~' + STR(obj, 11) AS obj
FROM (
SELECT
C.entirepath,
CAST(
CASE WHEN C.lastpath LIKE '%~' + STR(F.parent_object_id, 11) + '~%'
THEN F.referenced_object_id
ELSE F.parent_object_id
END AS VARCHAR(10)) AS obj
FROM C
JOIN sys.foreign_keys AS F
ON C.lastpath LIKE '%~' + STR(F.parent_object_id, 11) + '~%'
OR C.lastpath LIKE '%~' + STR(F.referenced_object_id, 11) + '~%'
) AS D1
WHERE entirepath NOT LIKE '%.' + STR(obj, 11) + '.%'
) AS D2
CROSS JOIN (SELECT 1 UNION ALL SELECT 2) AS Nums(n)
) AS D3
WHERE rownum = 1
ORDER BY element
FOR XML PATH('')
) AS D4(entirepath)
WHERE entirepath IS NOT NULL
) AS D5
),
Split AS
(
SELECT
CAST(SUBSTRING(lastpath, 2, 11) AS INT) AS obj,
STUFF(lastpath, 1, 12, '') AS lastpath
FROM C
UNION ALL
SELECT
CAST(SUBSTRING(lastpath, 2, 11) AS INT) AS obj,
STUFF(lastpath, 1, 12, '') AS lastpath
FROM Split
WHERE lastpath '~'
)
SELECT
OBJECT_SCHEMA_NAME(obj) AS obj_schema_name,
OBJECT_NAME(obj) AS obj_name
FROM Split;
Observe in the CTE C that only one path is built in each iteration of the recursive query, concatenating the path built by the previous round and the qualifying items from the current round (concatenation done with FOR XML PATH). The code uses a trick to distinguish between the part of the path representing nodes accessed prior to current round and nodes added in this round (could be more than one). The trick is to use different separators for the two sections in the layer of the code that concatenates all parts—old (. as separator) and new (~ as separator). The reason to include both sections in the same string and not as different strings is that a FOR XML PATH query can return only one string. Then a layer of code on top can extract the last part of the path (section starting with ~) giving it the attribute name lastpart, and replace all ocurrences of the character ~ with . to return the entire path, giving it the attribute name entirepath.
Finally, another recursive query called Split extracts the node ids from the lastpath attribute calculated in each round. First round (anchor member) handles first node in each lastpath, second round (already recursive query) handles second node in each last path, and so on, until no more nodes exist.
Since this solution visits each node only once it is very fast. The performance stats I got for it are: Duration: 17, Reads: 2,129.
Following are solutions using recursive queries by Peso, Geri and Ruben and their performance statistics.
Here’s Peso’s solution (Call it Peso Recursive):
WITH cteTables (ChildTableName, ParentTableName)
AS (
SELECT CASE x.Permutation
WHEN 0 THEN c.TABLE_NAME
ELSE p.TABLE_NAME
END AS ChildTableName,
CASE x.Permutation
WHEN 0 THEN p.TABLE_NAME
ELSE c.TABLE_NAME
END AS ParentTableName
FROM (
SELECT CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,
UNIQUE_CONSTRAINT_SCHEMA + '.' + UNIQUE_CONSTRAINT_NAME AS UNIQUE_CONSTRAINT_NAME
FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS
) AS rc
INNER JOIN (
SELECT CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,
CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TABLE_NAME
FROM INFORMATION_SCHEMA.CONSTRAINT_TABLE_USAGE
) AS c ON c.CONSTRAINT_NAME = rc.CONSTRAINT_NAME
INNER JOIN (
SELECT CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,
CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TABLE_NAME
FROM INFORMATION_SCHEMA.CONSTRAINT_TABLE_USAGE
) AS p ON p.CONSTRAINT_NAME = rc.UNIQUE_CONSTRAINT_NAME
CROSS JOIN (
SELECT 0 UNION ALL
SELECT 1
) AS x(Permutation)
WHERE c.TABLE_NAME p.TABLE_NAME
), cteHierarchy(TableName, TablePath)
AS (
SELECT TableName,
CHAR(2) + TableName + CHAR(2) AS TablePath
FROM (
SELECT CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TableName
FROM INFORMATION_SCHEMA.TABLES
) AS d
WHERE TableName = @table
UNION ALL
SELECT t.ParentTableName AS TableName,
h.TablePath + t.ParentTableName + CHAR(2) AS TablePath
FROM cteHierarchy AS h
INNER JOIN cteTables AS t ON t.ChildTableName = h.TableName
WHERE h.TablePath NOT LIKE '%' + CHAR(2) + t.ParentTableName + CHAR(2) + '%'
)
SELECT TableName
FROM cteHierarchy
ORDER BY TablePath;
Statistics: Duration: 258,447, Reads: 19,714,744.
Here’s Geri’s solution (call it Geri Recursive):
With T1 As
(Select Distinct Schema_Name(P.schema_id)+'.'+P.name P,
Schema_Name(R.schema_id)+'.'+R.name R
From sys.foreign_key_columns F
Inner Join sys.objects P
On F.parent_object_id=P.object_id
Inner Join sys.objects R
On F.referenced_object_id=R.object_id),
T2 As
(Select ROW_NUMBER() Over(Order By Case When @table =P Then 1 When @table =R Then 2 Else 3 End,P,R) Mispar,
*
From T1),
T3 As
(Select Cast(0 As Int) Mispar,
P,
R,
P PR,
(-1) Idcun,
Cast(','+P+','+R+',' As Varchar(Max)) S_in
From T2
Where Mispar=1
And @table In (P,R)
Union
Select Cast(1 As Int) Mispar,
P,
R,
R PR,
(-1) Idcun,
Cast(','+P+','+R+',' As Varchar(Max)) S_in
From T2
Where Mispar=1
And @table In (P,R)
Union All
Select Cast(Mispar As Int),
P,
R,
Case When Idcun=1 Then R When Idcun=2 Then P Else '' End PR,
Case When Idcun=2 Then 1 Else Idcun End Idcun,
Cast(S_in+Case When Idcun=1 Then R+',' When Idcun=2 Then P+',' Else '' End As Varchar(Max)) S_in
From (Select T2.Mispar,
T2.P,
T2.R,
Case When T3.S_in Like '%,'+T2.P+',%' And T3.S_in Not Like '%,'+T2.R+',%' Then 1
When T3.S_in Not Like '%,'+T2.P+',%' And T3.S_in Like '%,'+T2.R+',%' Then 2
Else 0 End Idcun,
T3.S_in
From T3
Inner Join T2
On T3.Mispar+1=T2.Mispar
And T3.Mispar>0
Where T3.Idcun0) T
Union All
Select Cast(Mispar As Int),
P,
R,
Case When Idcun=1 Then R When Idcun=2 Then P Else '' End PR,
Case When Idcun=2 Then 1 Else Idcun End Idcun,
Cast(S_in+Case When Idcun=1 Then R+',' When Idcun=2 Then P+',' Else '' End As Varchar(Max)) S_in
From (Select T2.Mispar,
T2.P,
T2.R,
Case When T3.S_in Like '%,'+T2.P+',%' And T3.S_in Not Like '%,'+T2.R+',%' Then 1
When T3.S_in Not Like '%,'+T2.P+',%' And T3.S_in Like '%,'+T2.R+',%' Then 2
Else 0 End Idcun,
T3.S_in
From T3
Inner Join T2
On T3.Idcun=1
And T2.Mispar=1) T)
Select Distinct PR
From T3
Where PR''
OPTION(MAXRECURSION 0);
Statistics: Duration: 680, Reads: 156,951. Quite fast as you can see.
And here’s Ruben’s solution (call it Ruben Recursive):
WITH referenced(id,level)
AS
(
SELECT referenced_object_id AS id, (select count(*) from sys.foreign_keys) level
FROM sys.foreign_keys FK
WHERE referenced_object_id = OBJECT_ID(@table) or parent_object_id = OBJECT_ID(@table)
UNION ALL
SELECT
case
when (parent_object_id = id) then referenced_object_id
else parent_object_id
end
AS id,
referenced.level-1 level
FROM sys.foreign_keys FK
JOIN referenced ON (referenced_object_id = id or parent_object_id = id)
AND (referenced_object_idparent_object_id)
AND referenced.level >= 1
)
SELECT DISTINCT OBJECT_SCHEMA_NAME(id) AS obj_schema_name, OBJECT_NAME(id) AS obj_name
FROM referenced;
Statistics: I stopped it after it ran for 30 minutes.
Cheers,
BG
About the Author
You May Also Like