Dumping the Duplicates

Readers help Jim eliminate duplicate rows from a table's data.

January 8, 2002

2 Min Read

Congratulations to Norman A. Armas, senior programmer analyst at Miami-based Apollo Ship Chandlers, and John McShane, DBA at Prophet 21 in Yardley, Pennsylvania. Norman won first prize of $100 for the best solution to the January Reader Challenge, "Dumping the Duplicates." John won second prize of $50. Here's a recap of the problem and the solution to the January Reader Challenge.

Problem

Jim wants to eliminate duplicate rows from a table's data, so he sets up a test table by writing the following query:

SELECT IDENTITY( int ) AS IDCol,       a1.au_id, a1.au_fname, a1.au_lname,       a1.phone, a1.address, a1.city, a1.state, a1.zip   INTO #a   FROM pubs..authors AS a1CROSS JOIN pubs..authors AS a2CROSS JOIN pubs..authors AS a3

Then, he creates the required indexes and a constraint on the table, as the following query shows:

CREATE CLUSTERED INDEX IX_a_name            ON #a( au_lname, au_fname )ALTER TABLE #a ADD CONSTRAINT PK_a_IDCol            PRIMARY KEY NONCLUSTERED( IDCol )

Next, Jim wants to eliminate duplicate rows from this table for every combination of au_lname and au_fname columns. He needs to keep only one row from each pair of duplicates. Help him write an efficient solution that works in all SQL Server releases.

Solution

Jim knows that he can use a GROUP BY query to determine the table's duplicate rows, so he can write the following query:

SELECT a.au_lname, a.au_fname   FROM #a AS aGROUP BY a.au_lname, a.au_fnameHAVING COUNT(*) > 1

Then, he can incorporate the above query into a DELETE statement. Jim can keep one of each pair's duplicate entries by taking only the row with the highest IDCol value within every pair. The following DELETE statement eliminates the table's duplicate rows:

DELETE #a   WHERE IDCol NOT IN ( SELECT MAX( a.IDCol )                   FROM #a AS a                  GROUP BY a.au_lname, a.au_fname )

Alternatively, Jim can rewrite the DELETE query above by using an EXISTS subquery or a LEFT JOIN. The performance of each of this query's variations depends on the data, indexes, and resources available on the server. If the number of rows Jim needs to delete is fairly large, he can execute the DELETE statement in batches to reduce locking and logging. Similarly, if the majority of rows on a large table were duplicated, recreating the table by using a SELECT INTO statement would be efficient.

About the Author

Umachandar Jayachandran

See more from Umachandar Jayachandran

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

Dumping the Duplicates

Problem

Solution

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

<span class="ArticleBase-LargeTitle">Dumping the Duplicates</span>Dumping the Duplicates

Problem

Solution

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Dumping the Duplicates