Dumping the Duplicates
Readers help Jim eliminate duplicate rows from a table's data.
January 8, 2002
Congratulations to Norman A. Armas, senior programmer analyst at Miami-based Apollo Ship Chandlers, and John McShane, DBA at Prophet 21 in Yardley, Pennsylvania. Norman won first prize of $100 for the best solution to the January Reader Challenge, "Dumping the Duplicates." John won second prize of $50. Here's a recap of the problem and the solution to the January Reader Challenge.
Problem
Jim wants to eliminate duplicate rows from a table's data, so he sets up a test table by writing the following query:
SELECT IDENTITY( int ) AS IDCol, a1.au_id, a1.au_fname, a1.au_lname, a1.phone, a1.address, a1.city, a1.state, a1.zip INTO #a FROM pubs..authors AS a1CROSS JOIN pubs..authors AS a2CROSS JOIN pubs..authors AS a3
Then, he creates the required indexes and a constraint on the table, as the following query shows:
CREATE CLUSTERED INDEX IX_a_name ON #a( au_lname, au_fname )ALTER TABLE #a ADD CONSTRAINT PK_a_IDCol PRIMARY KEY NONCLUSTERED( IDCol )
Next, Jim wants to eliminate duplicate rows from this table for every combination of au_lname and au_fname columns. He needs to keep only one row from each pair of duplicates. Help him write an efficient solution that works in all SQL Server releases.
Solution
Jim knows that he can use a GROUP BY query to determine the table's duplicate rows, so he can write the following query:
SELECT a.au_lname, a.au_fname FROM #a AS aGROUP BY a.au_lname, a.au_fnameHAVING COUNT(*) > 1
Then, he can incorporate the above query into a DELETE statement. Jim can keep one of each pair's duplicate entries by taking only the row with the highest IDCol value within every pair. The following DELETE statement eliminates the table's duplicate rows:
DELETE #a WHERE IDCol NOT IN ( SELECT MAX( a.IDCol ) FROM #a AS a GROUP BY a.au_lname, a.au_fname )
Alternatively, Jim can rewrite the DELETE query above by using an EXISTS subquery or a LEFT JOIN. The performance of each of this query's variations depends on the data, indexes, and resources available on the server. If the number of rows Jim needs to delete is fairly large, he can execute the DELETE statement in batches to reduce locking and logging. Similarly, if the majority of rows on a large table were duplicated, recreating the table by using a SELECT INTO statement would be efficient.
About the Author
You May Also Like