Graph Matching with T-SQL Part 2: Maximal Matching
Practical use cases for finding a maximum matching are quite obvious--for example, maximum utilization of agents. But what could be practical use cases for finding a maximal matching? Here's what you need to know.
September 18, 2017
Last month I started a series on graph matching where I introduced fundamental concepts in the space. Given a bipartite graph G, such as one representing possible relationships between applicants and jobs, matching M in G represents any subset of the edges in G such that no node is incident to more than one edge. This means that an applicant is allowed to fulfill no more than one job, and a job is allowed to be fulfilled by no more than one applicant. I used a table called G to represent the graph, with column x representing the applicant ID, column y representing the job ID, and a key defined on (x, y). I used a table called M to represent a matching in G, with one key defined on x and another on y.
I provided a challenge to write a stored procedure called MaximalMatching that populates M with a maximal matching. That is, a matching that cannot be extended with more edges from G. As a reminder, a maximal matching is not necessarily a maximum matching. The latter is a matching with the maximum possible number of edges. Figure 1 illustrates the different types of matchings.
Figure 1 - Graph matching examples
Practical use cases for finding a maximum matching are quite obvious--for example, maximum utilization of agents. But what could be practical use cases for finding a maximal matching? One such use case is as a preliminary step to finding a maximum matching. You start by finding a maximal matching—which is quite simple and cheap—and then, with a series of augmentations, convert it to a maximum matching. The larger the maximal matching that you start with, the quicker it will be to then convert it to a maximum matching. I’ll demonstrate this approach next month in Part 3 of the series.
This article’s focus is exploring various solutions to finding a maximal matching. Obviously, the faster the solution is, the better. Also, the more edges that it finds, the better. Depending on the run time difference, a slower solution that finds a larger maximal matching could be preferred. Ultimately, once you combine it with the rest of the work involved in finding a maximum matching, what counts is the overall run time.
Sample Data
When working on solutions for a T-SQL task, I usually like to use two sets of sample data—a small set to quickly check the correctness of the solution and a large set to measure its performance.
Use the following code to create a small set of sample data:
SET NOCOUNT ON;
USE tempdb;
-- Cleanup
DROP TABLE IF EXISTS dbo.M, dbo.G, dbo.X, dbo.Y;
GO
-- Set X, e.g., Applicants
CREATE TABLE dbo.X
(
x VARCHAR(10) NOT NULL CONSTRAINT PK_X PRIMARY KEY,
moreonx VARCHAR(20) NOT NULL
);
INSERT INTO dbo.X(x, moreonx) VALUES
('A', 'Applicant A'),
('B', 'Applicant B'),
('C', 'Applicant C'),
('D', 'Applicant D');
-- Set Y, e.g., Jobs
CREATE TABLE dbo.Y
(
y INT NOT NULL CONSTRAINT PK_Y PRIMARY KEY,
moreony VARCHAR(20) NOT NULL
);
INSERT INTO dbo.Y(y, moreony) VALUES
(1, 'Job 1'),
(2, 'Job 2'),
(3, 'Job 3'),
(4, 'Job 4');
-- Graph G = (X, Y, E), e.g., possible applicant-job connections
CREATE TABLE dbo.G
(
x VARCHAR(10) NOT NULL
CONSTRAINT FK_G_X FOREIGN KEY REFERENCES dbo.X,
y INT NOT NULL
CONSTRAINT FK_G_Y FOREIGN KEY REFERENCES dbo.Y,
CONSTRAINT PK_G PRIMARY KEY (x, y)
);
INSERT INTO dbo.G(x, y) VALUES
('A', 1),
('A', 2),
('B', 1),
('B', 3),
('C', 3),
('C', 4),
('D', 3);
GO
-- M is a matching in G
CREATE TABLE dbo.M
(
x VARCHAR(10) NOT NULL
CONSTRAINT UQ_M_x UNIQUE,
y INT NOT NULL
CONSTRAINT UQ_M_y UNIQUE,
CONSTRAINT PK_M PRIMARY KEY (x, y),
CONSTRAINT FK_M_G FOREIGN KEY (x, y) REFERENCES dbo.G(x, y)
);
Use the following code to create a large set of sample data:
-- Cleanup
SET NOCOUNT ON;
USE tempdb;
DROP TABLE IF EXISTS dbo.M, dbo.G, dbo.X, dbo.Y;
DROP FUNCTION IF EXISTS dbo.GetNums;
GO
-- Helper function dbo.GetNums
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM L5)
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;
GO
CREATE TABLE dbo.X
(
x VARCHAR(10) NOT NULL CONSTRAINT PK_X PRIMARY KEY,
moreonx VARCHAR(20) NOT NULL
);
CREATE TABLE dbo.Y
(
y INT NOT NULL CONSTRAINT PK_Y PRIMARY KEY,
moreony VARCHAR(20) NOT NULL
);
CREATE TABLE dbo.G
(
x VARCHAR(10) NOT NULL
CONSTRAINT FK_G_X FOREIGN KEY REFERENCES dbo.X,
y INT NOT NULL
CONSTRAINT FK_G_Y FOREIGN KEY REFERENCES dbo.Y,
CONSTRAINT PK_G PRIMARY KEY (x, y)
);
CREATE TABLE dbo.M
(
x VARCHAR(10) NOT NULL
CONSTRAINT UQ_M_x UNIQUE,
y INT NOT NULL
CONSTRAINT UQ_M_y UNIQUE,
CONSTRAINT PK_M PRIMARY KEY (x, y),
CONSTRAINT FK_M_G FOREIGN KEY (x, y) REFERENCES dbo.G(x, y)
);
GO
DECLARE @n AS INT = 10000000; -- ~ total num rows
SET @n = SQRT(@n * 2); -- num members of arithmetic sequence
INSERT INTO dbo.X(x, moreonx)
SELECT
RIGHT('000000000' + CAST(N.n AS VARCHAR(10)), 10) AS x,
'Applicant ' + RIGHT('000000000' + CAST(N.n AS VARCHAR(10)), 10) AS moreonx
FROM dbo.GetNums(1, @n) AS N;
INSERT INTO dbo.Y(y, moreony)
SELECT
N.n AS y,
'Job ' + CAST(N.n AS VARCHAR(10)) AS moreonx
FROM dbo.GetNums(1, @n) AS N;
INSERT INTO dbo.G WITH (TABLOCK) (x, y)
SELECT RIGHT('000000000' + CAST(X.n AS VARCHAR(10)), 10) AS x, Y.n AS y
FROM dbo.GetNums(1, @n) AS X
CROSS APPLY (SELECT TOP (@n - X.n + 1) n
FROM dbo.GetNums(1, @n)
ORDER BY NEWID()) AS Y;
With the large set of sample data, you initialize @n with the total number of rows you want in G. (You will get approximately that number in practice.) This number represents the sum of an arithmetic sequence. For example, if you initialize @n with 10, this number represents the sum of the arithmetic sequence 1 + 2 + 3 + 4. The assignment SET @n = SQRT(@n * 2) computes the number of members in the sequence (4 in our example). The code then populates the table with x running from 1 to @n, and matching the different x values with @n, @n-1, @n-2, …, 1 y values, selected randomly from the set 1 through @n. For example, one execution of this code with @n initialized with 10 produced the following rows in G:
x y
---------- -----------
0000000001 1
0000000001 2
0000000001 3
0000000001 4
0000000002 1
0000000002 2
0000000002 4
0000000003 1
0000000003 3
0000000004 3
Initializing @n with 10000000 populates G with 10001628 rows—a descent number to measure performance.
Solutions
I’ll present a number of solutions and provide the run time I measured on my machine against the large set of sample data (~10,000,000 rows). I’ll also provide the number of matches found. Remember, the faster the better, and the more matches the better.
To test the correctness of your solution, after running the MaximalMatching stored procedure, you can verify that there’s no edge in G with both x and y nodes free (not present in M). To achieve this, you can use the following query:
WITH C AS
(
SELECT TOP (1) 0 AS sortcol,
'Sorry, this is not a maximal matching. :('
+ ' For example, edge ('
+ x + ', ' + CAST(y AS VARCHAR(10))
+ ') from G could be added to M.' AS IsMaximalMatching
FROM dbo.G
WHERE NOT EXISTS ( SELECT * FROM dbo.M WHERE M.x = G.x )
AND NOT EXISTS ( SELECT * FROM dbo.M WHERE M.y = G.y )
UNION ALL
SELECT 1 AS sortcol,
'Congrats! This is a maximal matching. :)'
+ ' You found ' + CAST( (SELECT COUNT(*) FROM dbo.M) AS VARCHAR(10) )
+ ' matchings out of '
+ CAST( (SELECT COUNT(*) FROM dbo.G) AS VARCHAR(10) )
+ ' edges in G.' AS IsMaximalMatching
)
SELECT TOP (1) IsMaximalMatching
FROM C
ORDER BY sortcol;
For example, run this query now before executing any implementation of the MaximalMatching stored procedure. Since M is currently empty, it’s not a maximal matching in G. You will get output similar to the following (possibly with a different sample edge reported since the code populated the table used randomization, plus the TOP query isn’t ordered):
IsMaximalMatching
----------------------------------------------------------------------------------------------------------------
Sorry, this is not a maximal matching. :( For example, edge (0000000001, 1) from G could be added to M.
Solution 1: IGNORE_DUP_KEY, single INSERT statement
Before I wrote and tested Solution 1, I was certain that it would be among the fastest solutions possible. SQL Server supports the IGNORE_DUP_KEY option with unique indexes, causing INSERT statements to not completely fail when trying to insert duplicate keys, rather simply ignore the duplicates. So, I implemented the MaximalMatching stored procedure with the following steps:
1. Declare a table variable @M with columns x and y, with one unique index with the IGNORE_DUP_KEY option on x and another on y.
2. Truncate table M.
3. Insert into @M all rows from G (ignoring duplicates).
4. Insert into M all rows from @M.
Here’s the full stored procedure’s code implementing Solution 1:
CREATE OR ALTER PROC dbo.MaximalMatching
AS
SET NOCOUNT, XACT_ABORT ON;
DECLARE @M AS TABLE
(
x VARCHAR(10) NOT NULL
INDEX idx_x UNIQUE NONCLUSTERED WITH (IGNORE_DUP_KEY = ON),
y INT NOT NULL
INDEX idx_y UNIQUE NONCLUSTERED WITH (IGNORE_DUP_KEY = ON)
);
TRUNCATE TABLE dbo.M;
INSERT INTO @M(x, y) SELECT x, y FROM dbo.G;
INSERT INTO dbo.M(x, y) SELECT x, y FROM @M;
GO
Use the following code to execute the stored procedure:
EXEC dbo.MaximalMatching;
It took this stored procedure 11 seconds to complete on my machine, which is not too bad for finding a maximal matching in a graph with 10M edges. But, to my surprise, when I ran the verification query I provided earlier, it reported that what I have in M is not a maximal matching in G. In fact, M was populated with only a few hundred rows, even though with our sample data you’d expect it to be in the thousands. Apparently, the way SQL Server handles an insertion to a table with multiple indexes with the IGNORE_DUP_KEY option is not what I expected.
Say, for instance, that you try to insert to @M the following four rows:
DECLARE @M AS TABLE
(
x VARCHAR(10) NOT NULL
INDEX idx_x UNIQUE NONCLUSTERED WITH (IGNORE_DUP_KEY = ON),
y INT NOT NULL
INDEX idx_y UNIQUE NONCLUSTERED WITH (IGNORE_DUP_KEY = ON)
);
INSERT INTO @M(x, y) VALUES('A', 1),('A', 2),('B', 1),('B', 2);
SELECT * FROM @M;
I expected that at least from the perspective of the IGNORE_DUP_KEY option, the insertions would be processed in a progressive manner--that is, as four single row inserts. So, depending on the actual insertion order, which is nondeterministic, rows that are identified as having duplicates would be eliminated. With these four rows, irrespective of order, I expected to have two rows inserted. For example, if the insertion order happens to be the following:
('A', 1)
('A', 2)
('B', 1)
('B', 2)
I expected ('A', 1) and ('B', 2) to be inserted.
If the insertion order happens to be the following:
('A', 2)
('A', 1)
('B', 1)
('B', 2)
I expected ('A', 2) and ('B', 1) to be inserted. Any order that you consider, you should get two rows inserted here. However, to my surprise, the SELECT query showing the contents of @M returned just one row:
x y
---------- -----------
B 1
To figure out what happened, you need to understand how SQL Server optimizes an INSERT statement against a table with indexes with the IGNORE_DUP_KEY option. Craig Freedman explains this in a blog on the topic. Applied to our example with the small sample of four rows, SQL Server produced the following plan shown in Figure 2 (produced with SentryOne’s Plan Explorer).
The blue set of operators sorts the rows by y. Suppose that the output of the sort is the following:
x y
- -
B 1
A 1
B 2
A 2
The Left Semi Join operator computes a probe value marking dups with keys that are already present in the index on y (before the insertion). In our example, there are none since we start with an empty table @M. So, the output of this step is the following:
x y dup
- - ---
B 1 0
A 1 0
B 2 0
A 2 0
The Assert operator discards rows with a probe value indicating a duplicate, returning the following (again, no dups with already existing rows):
x y
- -
B 1
A 1
B 2
A 2
The Segment operator handles each y group separately:
x y
- -
B 1
A 1
---
B 2
A 2
The Top operator keeps only the first row in each segment. So, the blue set of operators emits the following:
x y
- -
B 1
B 2
The set of red operators then performs similar work on the remaining rows, only this time sorting by x:
x y
- -
B 1
B 2
Again, no dups with already existing keys in the index on x, so the Assert operator has nothing to remove. There’s only one segment in the above stream of rows, and the Top operator applied to this segment yields the following row:
x y
- -
B 1
As it turns out there’s a connect item opened by Steve Kass, suggesting that this behavior is a bug, but as you can read from Microsoft’s response to the post, they do not consider it a bug. So, unfortunately, Solution 1 isn’t valid and therefore we have to discard it.
Solution 2: IGNORE_DUP_KEY, multiple INSERT statements with cursor
Solution 2 starts similar to Solution 1 in that it declares a table variable called @M with one unique index with the IGNORE_DUP_KEY option on column x and another on column y. But instead of using a single multi-row INSERT statement to copy rows from G into @M, it uses a cursor to iterate through the rows in G, ordered by x, y. In each round, the code checks if the current x value is greater than the previously inserted x value, and, if so, it attempts to insert the current cursor row’s values into @M. If the y value is a duplicate, the insertion is ignored; otherwise, it succeeds. Once the code finishes processing all cursor rows, it copies the rows from @M into M. Here’s the full solution’s code:
CREATE OR ALTER PROC dbo.MaximalMatching
AS
SET NOCOUNT, XACT_ABORT ON;
DECLARE @x AS VARCHAR(10), @prevx AS VARCHAR(10) = '', @y AS INT, @C AS CURSOR;
DECLARE @M AS TABLE
(
x VARCHAR(10) NOT NULL
INDEX idx_x UNIQUE NONCLUSTERED WITH (IGNORE_DUP_KEY = ON),
y INT NOT NULL
INDEX idx_y UNIQUE NONCLUSTERED WITH (IGNORE_DUP_KEY = ON)
);
TRUNCATE TABLE dbo.M;
SET @C = CURSOR FORWARD_ONLY STATIC READ_ONLY FOR
SELECT x, y FROM dbo.G ORDER BY x, y;
OPEN @C;
FETCH NEXT FROM @C INTO @x, @y;
WHILE @@FETCH_STATUS = 0
BEGIN
IF @x > @prevx
BEGIN
INSERT INTO @M(x, y) VALUES(@x, @y);
IF @@ROWCOUNT > 0
SET @prevx = @x;
END;
FETCH NEXT FROM @C INTO @x, @y;
END;
INSERT INTO dbo.M(x, y) SELECT x, y FROM @M;
GO
This solution is deterministic since it processes the rows from G in order of x, y. It took this solution 235 seconds to finish on my machine, and found a maximal matching with 4415 matches. That’s both pretty slow, and there’s room for improvement in the cardinality of the maximal matching.
Solution 3: Cursor, NOT EXISTS
Solution 3 is similar in concept to Solution 2, but it doesn’t use an intermediary table variable with two indexes with the IGNORE_DUP_KEY option. Instead, as it iterates through the cursor rows, it uses two NOT EXISTS predicates against M to see if both vertices of the edge in G are free, in which case it adds the row to M. Similar to Solution 2, it first checks that the current row’s x value is greater than the previously inserted x value, before checking for existence in M. Here’s the complete solution’s code:
CREATE OR ALTER PROC dbo.MaximalMatching
AS
SET NOCOUNT, XACT_ABORT ON;
DECLARE @x AS VARCHAR(10), @prevx AS VARCHAR(10) = '', @y AS INT, @C AS CURSOR;
TRUNCATE TABLE dbo.M;
SET @C = CURSOR FORWARD_ONLY STATIC READ_ONLY FOR
SELECT x, y FROM dbo.G ORDER BY x, y;
OPEN @C;
FETCH NEXT FROM @C INTO @x, @y;
WHILE @@FETCH_STATUS = 0
BEGIN
IF @x > @prevx
IF NOT EXISTS ( SELECT * FROM dbo.M WHERE x = @x )
AND NOT EXISTS ( SELECT * FROM dbo.M WHERE y = @y )
BEGIN
INSERT INTO dbo.M(x, y) VALUES(@x, @y);
SET @prevx = @x;
END;
FETCH NEXT FROM @C INTO @x, @y;
END;
GO
This solution completed in 132 seconds on my machine. That’s almost half the run time of Solution 2. Since it processes the rows in the same order as Solution 2, it naturally finds the same maximal matching as Solution 2 with 4,415 matches in my case, and is also deterministic.
Solution 4: Loop, TOP with NOT EXISTS, ordered by x, y
Solution 4 is quite simplistic and highly efficient. It starts by using a TOP (1) query against G, ordered by x, y, storing the column values of the first row that it finds in local variables @x and @y. It then uses a loop to iterate once per distinct x value found, and in each iteration writes a new row into M with the values from the local variables. It then uses another TOP (1) query to look for the next row where x is greater than @x and where y doesn’t exist in M, writing its values into local variables.
Here’s the complete solution’s code:
CREATE OR ALTER PROC dbo.MaximalMatching
AS
SET NOCOUNT, XACT_ABORT ON;
DECLARE @x AS VARCHAR(10), @y AS INT;
TRUNCATE TABLE dbo.M;
SELECT TOP (1) @x = x, @y = y
FROM dbo.G
ORDER BY x, y;
WHILE @@ROWCOUNT > 0
BEGIN
INSERT INTO dbo.M(x, y) VALUES(@x, @y);
SELECT TOP (1) @x = x, @y = y
FROM dbo.G
WHERE x > @x
AND NOT EXISTS ( SELECT * FROM dbo.M WHERE M.y = G.y )
ORDER BY x, y;
END;
GO
Since this solution processes the rows in the same order as solutions 2 and 3, it finds the same maximal matching. So it’s just as deterministic as the other two. From a performance perspective, it’s much faster, though. It finished in 8 seconds on my machine! That’s not too bad, considering the fact that G has about 10,000,000 rows. There’s still room for improvement in terms of the matching’s cardinality (4,415 matches found in my case).
Note that for this solution to be as optimal as possible, between the x and y columns, you want the iteration element to be the one with the higher density (fewer distinct values). If you expect both to have similar densities, it doesn’t matter which one you choose. (In our example, we use x.) But if y had significantly higher density, you would have been better off iterating based on y, of course.
Solution 5: Loop, TOP with NOT EXISTS, ordered by cardinality
Solution 4 has great performance, but there’s still room for improvement in terms of the matching’s cardinality. Statistically speaking, without knowing anything about the nature of the data, if you process smaller x groups before larger x groups (assuming x and y have similar densities), there's a greater likelihood that you’ll find a maximal matching of a greater cardinality. That’s the main logic behind Solution 5. Note that this solution computes both aggregates and ranking, and can greatly benefit from batch mode processing. In order to enable batch processing, there must be a columnstore index present on the queried table, even if not used (for details, see the following article). Use the following code to create a dummy empty filtered columnstored index to enable batch processing:
CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs_dummy
ON dbo.G(x)
WHERE x = 'yes' AND x = 'no';
Here’s the complete stored procedure’s code implementing Solution 5:
CREATE OR ALTER PROC dbo.MaximalMatching
AS
SET NOCOUNT, XACT_ABORT ON;
DECLARE @n AS BIGINT, @x AS VARCHAR(10), @y AS INT;
TRUNCATE TABLE dbo.M;
CREATE TABLE #G
(
n BIGINT NOT NULL,
x VARCHAR(10) NOT NULL,
y INT NOT NULL,
INDEX idx_n_y CLUSTERED(n, y)
);
WITH C AS
(
SELECT x, y,
COUNT(*) OVER(PARTITION BY x) AS cnt
FROM dbo.G
)
INSERT INTO #G (n, x, y)
SELECT DENSE_RANK() OVER(ORDER BY cnt, x) AS n, x, y
FROM C;
SELECT TOP (1) @n = n, @x = x, @y = y
FROM #G
ORDER BY n, y;
WHILE @@ROWCOUNT > 0
BEGIN
INSERT INTO dbo.M(x, y) VALUES(@x, @y);
SELECT TOP (1) @n = n, @x = x, @y = y
FROM #G AS G
WHERE n > @n
AND NOT EXISTS ( SELECT * FROM dbo.M WHERE M.y = G.y )
ORDER BY n, y;
END;
DROP TABLE #G;
GO
The code queries G, counts the number of rows per x group (call the result column cnt), and assigns dense rank values to the rows ordered by cnt and x (call the result column n). The code populates a temporary table called #G with n (x cardinality), x and y, with a clustered index defined on n and y. The code then uses similar logic to Solution 4 to iterate once per distinct x value using a TOP (1) query in each round, only now processing the rows based on n, y order. This solution took 49 seconds to complete on my machine—clearly slower than Solution 4, but it found a maximal matching with a cardinality of 4,472 (compared to 4,415 found by Solution 4).
Make sure that you don’t drop the dummy columnstore index yet since it’s beneficial to Solution 6 as well.
Solution 6: Ranking based on random y order
Solution 6 uses a loop, and, in each iteration, keeps adding more edges from G into M. In each round, it uses a set-based query with two CTEs (C1 and C2) and an outer INSERT statement. C1 filters rows from G where neither x nor y are present in M (free vertices), assigns row numbers that are partitioned by x, and orders randomly within the partition (result column rnx). C2 then queries the rows from C1, filters only rows where rnx is 1 so you get one row per distinct x, and assigns row numbers that are partitioned by y with no particular order (result column rny) . The outer INSERT statement queries C2 and filters only rows from C2 where the rny is 1. This solution finds a subset of qualifying matches in each round, and adds those to M. As soon as the current round cannot find any more matches, it means that the matching in M is maximal.
Here’s the complete solution’s code:
CREATE OR ALTER PROC dbo.MaximalMatching
AS
SET NOCOUNT, XACT_ABORT ON;
TRUNCATE TABLE dbo.M;
WHILE 1 = 1
BEGIN
WITH C1 AS
(
SELECT x, y,
ROW_NUMBER() OVER(PARTITION BY x ORDER BY NEWID()) AS rnx
FROM dbo.G
WHERE NOT EXISTS ( SELECT * FROM dbo.M WHERE M.x = G.x )
AND NOT EXISTS ( SELECT * FROM dbo.M WHERE M.y = G.y )
),
C2 AS
(
SELECT x, y,
ROW_NUMBER() OVER(PARTITION BY y ORDER BY (SELECT NULL)) AS rny
FROM C1
WHERE rnx = 1
)
INSERT INTO dbo.M(x, y)
SELECT x, y
FROM C2
WHERE rny = 1;
IF @@ROWCOUNT = 0 BREAK;
END;
GO
Since this solution uses random order for the selection of first row within each x group, and then no particular order for the selection of first row within each y group, it is nondeterministic.
It took 53 seconds to complete on my machine, and produced a maximal matching with a cardinality of 4468 (vs. 4,415 with Solution 4 and 4,472 with Solution 5). Since it’s nondeterministic, each execution can find a different maximal matching of a different cardinality.
Solution 7: Quirky update
Solution 7 uses a quirky update (a multi-row UPDATE statement with variable assignment). It declares a character string variable called @s, which it initializes with a separator '.'. It then uses a quirky update against G, where per row it checks if @s doesn’t already contain x and y; if not it concatenates to @s the current x value, a ',' separator, and a '.' separator. When the update finishes, @s has one string with pairs of .x,y. values representing a maximal matching. The next, and last, step is a query using the STRING_SPLIT function that splits the string in @s into a row set with the pairs of x, y values, and an INSERT statement that inserts those rows into M.
Here’s the complete solution’s code:
CREATE OR ALTER PROC dbo.MaximalMatching
AS
SET NOCOUNT, XACT_ABORT ON;
DECLARE @s AS VARCHAR(MAX) = '.';
TRUNCATE TABLE dbo.M;
SELECT
@s +=
CASE
WHEN CHARINDEX('.' + x + ',', @s) > 0
OR CHARINDEX(',' + CAST(y AS VARCHAR(20)) + '.', @s) > 0
THEN ''
ELSE x + ',' + CAST(y AS VARCHAR(20)) + '.'
END
FROM dbo.G;
INSERT INTO dbo.M(x, y)
SELECT
LEFT(value, CHARINDEX(',', value) - 1) AS x,
CAST(RIGHT(value, LEN(value) - CHARINDEX(',', value)) AS INT) AS y
FROM STRING_SPLIT(CASE WHEN @s = '.' THEN NULL ELSE SUBSTRING(@s, 2, LEN(@s) - 2) END, '.');
GO
This solution is nondeterministic and, apparently, horribly slow. It took 17 seconds to complete with G populated with only 100K rows. Remember, all the other solutions were tested with 10M rows. It was fun to write this solution, though!
Your Next Mission
I’d say that our winners are solutions 4, 5 and 6. The first is much faster than the other two, but it also produces a maximal matching with a lower cardinality.
Your next mission, should you choose to accept it, is to write a solution that fills M with a maximum matching. Recall that a maximum matching is a maximal matching with the maximum possible cardinality. (See Figure 1 to make sure this is clear.)
As a reminder from Part 1 in the series, here’s one of the known algorithms for finding a maximum matching:
Repeatedly find an M-augmenting path in G and update M with the symmetric difference of that path and M.
Feel free, though, to use any algorithm that you wish. The faster the solution is and the more effectively it scales, the better. Good luck!
About the Author
You May Also Like