Learn How to Deduplicate Data
Remove duplicate rows from a table
November 24, 2014
Suppose you have a table that contains some duplicate rows that you need to remove. As an example, run the following code to create a table called T1 and populate it with about one million rows with some duplicates:
SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;GOSELECT n AS col1 INTO dbo.T1 FROM dbo.GetNums(1, 1000000) AS NumsUNION ALLSELECT n AS col1 FROM dbo.GetNums(1, 3) AS NumsOPTION (MAXDOP 1);
In our simple example, there's only one column in the table; in reality, you'd probably have a few columns. A subset of the columns would typically make the logical key. Assuming there were no constraint to enforce the key, you'd end up with the duplicates.
The method I like to use to deduplicate data, especially when the number of rows that needs to be deleted is small, is the following:
WITH C AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY (SELECT NULL)) AS rownum FROM dbo.T1)DELETEFROM CWHERE rownum > 1;
You write a query that computes a row number that's partitioned by the logical key (in our case col1) and ordered by (SELECT NULL), meaning that you don't care about order. The query will assign unique row numbers to different rows that have the same col1 value. You could think of the row number as marking a duplicate number. You then define a CTE based on that query, and you issue a DELETE statement against the CTE to delete all rows with a row number that's greater than 1. After the delete, you'll be left with only one row of each distinct col1 value.
When you're done, run the following code for cleanup:
IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;
About the Author
You May Also Like