Dumping the Duplicates

Readers help Jim eliminate duplicate rows from a table's data.

2 Min Read
ITPro Today logo in a gray background | ITPro Today

Congratulations to Norman A. Armas, senior programmer analyst at Miami-based Apollo Ship Chandlers, and John McShane, DBA at Prophet 21 in Yardley, Pennsylvania. Norman won first prize of $100 for the best solution to the January Reader Challenge, "Dumping the Duplicates." John won second prize of $50. Here's a recap of the problem and the solution to the January Reader Challenge.

Problem


Jim wants to eliminate duplicate rows from a table's data, so he sets up a test table by writing the following query:

SELECT IDENTITY( int ) AS IDCol,       a1.au_id, a1.au_fname, a1.au_lname,       a1.phone, a1.address, a1.city, a1.state, a1.zip   INTO #a   FROM pubs..authors AS a1CROSS JOIN pubs..authors AS a2CROSS JOIN pubs..authors AS a3

Then, he creates the required indexes and a constraint on the table, as the following query shows:

CREATE CLUSTERED INDEX IX_a_name            ON #a( au_lname, au_fname )ALTER TABLE #a ADD CONSTRAINT PK_a_IDCol            PRIMARY KEY NONCLUSTERED( IDCol )

Next, Jim wants to eliminate duplicate rows from this table for every combination of au_lname and au_fname columns. He needs to keep only one row from each pair of duplicates. Help him write an efficient solution that works in all SQL Server releases.

Solution


Jim knows that he can use a GROUP BY query to determine the table's duplicate rows, so he can write the following query:

SELECT a.au_lname, a.au_fname   FROM #a AS aGROUP BY a.au_lname, a.au_fnameHAVING COUNT(*) > 1

Then, he can incorporate the above query into a DELETE statement. Jim can keep one of each pair's duplicate entries by taking only the row with the highest IDCol value within every pair. The following DELETE statement eliminates the table's duplicate rows:

DELETE #a   WHERE IDCol NOT IN ( SELECT MAX( a.IDCol )                   FROM #a AS a                  GROUP BY a.au_lname, a.au_fname )

Alternatively, Jim can rewrite the DELETE query above by using an EXISTS subquery or a LEFT JOIN. The performance of each of this query's variations depends on the data, indexes, and resources available on the server. If the number of rows Jim needs to delete is fairly large, he can execute the DELETE statement in batches to reduce locking and logging. Similarly, if the majority of rows on a large table were duplicated, recreating the table by using a SELECT INTO statement would be efficient.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like