Deleting Big Data: Which Option Works Best
Big data is critical to innovation, but what happens when you need to delete some of it? Here's what you need to know about large deletes.
In a previous article, I outlined how to gain hidden insights into Microsoft SQL Server transaction logs through use of specific Dynamic Management Objects. I did not think that I’d have the opportunity to use the information provided in that article to identify a course of action to one of the technology professionals I mentor.
The situation was this: Her client was logging activity used for billing into a table that had never been trimmed of unneeded, older data. This table was not used for auditing (the rules around long-term storage of that data are quite strict compared to what we’re discussing here). Rather, it was used to stage data to their billing system for processing. The table was approximately 20Tb and held six years of data. The use of the system had been steadily growing along with the success of her client’s company, so earlier years had fewer records. She was tasked with deleting all data in the table but for the last 13 months, with about 75% of the records targeted for deletion. The goal was to accomplish this task with as little impact on the overall system as possible. The IT pro asked me about different options for ordered deletes, which piqued my curiosity. This seemed like the perfect opportunity to put the Dynamic Management Objects that I wrote about to good use, to look for a couple of key factors: impact on the transaction log and spills to tempdb for processing.
Setting up the Test
I was interested in vetting out the options for “ordered” deletes, looking at both ordered deletes directly against tables and those same delete processes using views built on top of those tables. To add a twist to this test, I wanted to look at the differences between when an ordered delete is based on the clustering key or when it is based on a column that is specifically not the clustering key.
The data for this test is coming directly from the Wide World Importers sample database for Microsoft SQL Server. Using the code below, I set up the test environment by migrating a copy of that table into the new TBL_CI database. This database will represent ordered deletes directly against a table using the clustering key as the delete-ordering key. I then went on to backup and restore that database to create the three additional databases needed for the test as follows:
TBL_CI: Table-Based Delete Ordered by Clustering Key
TBL_XCI: Table-Based Delete Ordered by Column Other Than Clustering Key
VW_CI: View-Based Delete Ordered by Clustering Key
VW_XCI: View-Based Delete Ordered by Column Other Than Clustering Key
I then added a clustered index on the [temperature] column of temps table in both TBL_CI and VW_CI and clustered indexes on the [CollectionID] column in both TBL_XCI and VW_XCI on that table.
Finishing up the prep work, I created an ordered view on the same column of VW_CI.dbo.temps that our delete will be ordered on: [temperature]; VW_XCI.dbo.temps is given a similar view, but ordered on the [CollectionID] column instead. I backed up all four databases in case I wanted to repeat the test:
/* BUILD ENVIRONMENT *//* 01_CREATE INITIAL DATABASE */CREATE DATABASE [TBL_CI] ( NAME = N'WorldPeace', FILENAME = N'C:SQLDataWorldPeace.mdf', SIZE = 512MB, MAXSIZE = 1024MB, FILEGROWTH = 256MB ) LOG ON ( NAME = N'WorldPeace_log', FILENAME = N'C:SQLDataWorldPeace_log.ldf', SIZE = 256MB, MAXSIZE = 1024MB, FILEGROWTH = 256MB )GO/* 02_SEED INITAL DB WITH VALUES FROM SAMPLE DB */SELECT * INTO TBL_CI.dbo.tempsFROM WideWorldImporters.dbo.tempsGO/* 03_BACKUP THE INITIAL DB AND RESTORE 3 COPIES: :: TBL_CI: Table-Based Delete Ordered by Clustering Key :: TBL_XCI: Table-Based Delete Ordered by Column Other Than Clustering Key :: VW_CI: View-Based Delete Ordered by Clustering Key :: VW_XCI: View-Based Delete Ordered by Column Other Than Clustering Key*/BACKUP DATABASE TBL_CITO DISK = 'C:BackupWorldPeace_original.bak'WITH COMPRESSION;GORESTORE DATABASE [TBL_XCI]FROM DISK = N'C:BackupWorldPeace_original.bak'WITH FILE = 1, MOVE N'WorldPeace' TO N'C:SQLDataTBL_XCI.mdf', MOVE N'WorldPeace_log' TO N'C:SQLDataTBL_XCI_log.ldf', STATS = 100;GORESTORE DATABASE [VW_CI]FROM DISK = N'C:BackupWorldPeace_original.bak'WITH FILE = 1, MOVE N'WorldPeace' TO N'C:SQLDataVW_CI.mdf', MOVE N'WorldPeace_log' TO N'C:SQLDataVW_CI_log.ldf', STATS = 100;GORESTORE DATABASE [VW_XCI]FROM DISK = N'C:BackupWorldPeace_original.bak'WITH FILE = 1, MOVE N'WorldPeace' TO N'C:SQLDataVW_XCI.mdf', MOVE N'WorldPeace_log' TO N'C:SQLDataVW_XCI_log.ldf', STATS = 100;GO/* 04_TBL_CI AND VW_CI --> CLUSTERED INDEX ON SAME COLUMN OF TOP N ROW DELETE */CREATE CLUSTERED INDEX [ixTemps] ON TBL_CI.[dbo].[Temps]( [Temperature] ASC)WITH (DROP_EXISTING = ON, FILLFACTOR = 99)ON [PRIMARY]GOCREATE CLUSTERED INDEX [ixTemps] ON VW_CI.[dbo].[Temps]( [Temperature] ASC)WITH (DROP_EXISTING = ON, FILLFACTOR = 99)ON [PRIMARY]GO/* 05_TBL_XCI AND VW_XCI --> CLUSTERED INDEX ON DIFFERENT COLUMN OF TOP N ROW DELETE */CREATE CLUSTERED INDEX [ixCollectionID] ON TBL_XCI.[dbo].[Temps]( [CollectionID] ASC)WITH (DROP_EXISTING = ON, FILLFACTOR = 99)ON [PRIMARY]GOCREATE CLUSTERED INDEX [ixCollectionID] ON VW_XCI.[dbo].[Temps]( [CollectionID] ASC)WITH (DROP_EXISTING = ON, FILLFACTOR = 99)ON [PRIMARY]GO/* 06_VW_CI --> CREATE VIEW ORDERED BY SAME COLUMN OF TOP N ROW DELETE */USE VW_CI;GOCREATE VIEW dbo.view_temps AS (SELECT TOP 50000 * FROM dbo.temps ORDER BY temperature)GO
/* 07_VW_XCI --> CREATE VIEW, ORDERED BY DIFFERENT COLUMN OF TOP N ROW DELETE BUT NEED TO BRING IN ALL RECORDS FROM UNDERLYING TABLE OR ELSE WE CAN'T COMPARE PROPERLY SINCE THE RECORD SETS WILL NOT MATCH */USE VW_XCI;GO
CREATE VIEW dbo.view_temps AS (SELECT TOP 100 PERCENT /* TOP 50000 */ * FROM dbo.temps ORDER BY CollectionID)GO/* 08_BACKUP EVERYTHING. (CAN RESTORE TO RE-RUN TESTS) */BACKUP DATABASE [TBL_CI] TO DISK = N'C:BackupTBL_CI.bak'WITH NAME = N'TBL_CI-Full Database Backup', COMPRESSION, STATS = 100GO
BACKUP DATABASE [TBL_XCI] TO DISK = N'C:BackupTBL_XCI.bak'WITH NAME = N'TBL_XCI-Full Database Backup', COMPRESSION, STATS = 100GOBACKUP DATABASE [VW_CI] TO DISK = N'C:BackupVW_CI.bak'WITH NAME = N'VW_CI-Full Database Backup', COMPRESSION, STATS = 100GO
BACKUP DATABASE [VW_XCI] TO DISK = N'C:BackupVW_XCI.bak'WITH NAME = N'VW_XCI-Full Database Backup', COMPRESSION, STATS = 100GO
Ordered Deletes and Impact on Logging and tempdb
Each of the following four scripts was run independently against each database as shown. I didn’t commit the transactions until I ran the DMO queries to identify the impact of each:
Ordered Deletes Directly Against a Table;
Clustering and Delete Ordering Keys are Identical:
USE TBL_CI;GOBEGIN TRANSACTION TCI_Ordered DELETE FROM dbo.Temps WHERE Temperature IN ( SELECT TOP 50000 Temperature FROM dbo.Temps ORDER BY Temperature );--ROLLBACK TRANSACTION TCI_Ordered--COMMIT TRANSACTION TCI_Ordered
Ordered Deletes Indirectly Against a Table - Using a View;
Clustering and Delete Ordering Keys are Identical:
/* DELETE USING VIEW ORDERED BY COLUMN = TOP N */USE VW_CI;GOBEGIN TRANSACTION VW_CI_Ordered DELETE FROM view_temps;--ROLLBACK TRANSACTION VW_CI_Ordered--COMMIT TRANSACTION VW_CI_Ordered
Ordered Deletes Directly Against a Table;
Clustering and Delete Ordering Keys are Different:
/* DELETE USING TABLE CLUSTER KEY != TOP N */USE TBL_XCI;GOBEGIN TRANSACTION TXCI_Ordered DELETE FROM dbo.Temps WHERE Temperature IN ( SELECT TOP 50000 Temperature FROM dbo.Temps ORDER BY Temperature );--ROLLBACK TRANSACTION TXCI_Ordered--COMMIT TRANSACTION TXCI_Ordered
Ordered Deletes Indirectly Against a Table - Using a View;
Clustering and Delete Ordering Keys are Different:
/* DELETE USING VIEW ORDERED BY COLUMN != TOP N */USE VW_XCI;GOBEGIN TRANSACTION VW_XCI_Ordered DELETE FROM view_temps WHERE Temperature IN ( SELECT TOP 50000 Temperature FROM dbo.Temps ORDER BY Temperature );--ROLLBACK TRANSACTION VW_XCI_Ordered--COMMIT TRANSACTION VW_XCI_Ordered
Before identifying the least-impactful option, I want to take a minute to talk about the two options that involved deletes against the view instead of the table. One view was created on the TOP 50,000 records in the underlying table, and those 50,000 records were selected through the TOP/ORDER BY construct so the view is defined on the complete subset of rows we wish to delete from the underlying table in each of these tests. The last view had to be created on the entire table. If we created the XCI view on CollectionID we would not have identical views::tables for testing. This means we can do a DELETE command without a predicate of any kind for VW_CI, since the predicate we use for the DELETEs was part of the view definition. The other three tests share a similar DELETE command.
Surprising Results
Looking at each of the options side-by-side yields interesting conclusions. In terms of the overall transaction log impact, we have a narrow winner: Using the view defined as the DELETE we want to perform.
zWuIBl0LMMari-b2rhLtXKJlFwnP6cSXQxhkEIsS1A0EgcMe2B_b3CBbW9coLMeukhIiDruToEA5K-_tIaQJemtq1WsDKdOfXyhUq2i7ihH43rXVZwDc-T8g0c0t8neK_YjPq9FT
The code for the analysis comes from the transaction log Dynamic Management Objects:
/* FOR IMPACT ON LOG SPACE */SELECT TAT.[name] AS trx_name , SUM(TDT.database_transaction_log_record_count) AS trx_log_records , SUM(TDT.database_transaction_log_bytes_used) AS trx_log_bytes_used , SUM(TDT.database_transaction_log_bytes_reserved) AS trx_log_bytes_reserved , SUM(TDT.database_transaction_log_bytes_used_system) AS trx_log_bytes_used_system , SUM(TDT.database_transaction_log_bytes_reserved_system) AS trx_log_bytes_reserved_system , SUM(TDT.database_transaction_log_bytes_used + TDT.database_transaction_log_bytes_reserved + TDT.database_transaction_log_bytes_used_system + TDT.database_transaction_log_bytes_reserved_system) AS total_log_bytesFROM sys.dm_tran_active_transactions AS TAT INNER JOIN sys.dm_tran_database_transactions AS TDT ON TAT.transaction_id = TDT.transaction_idWHERE TAT.name LIKE '%ORDERED'GROUP BY TAT.[name]ORDER BY 7 DESC;
When it comes to impact on tempdb, there is only one option that doesn’t spill, and that just happens to be the same ordered view option:
tyRWkwm4CF_9uFzchVf1J9FbV9XOOmP4atwZPhsiNQLIrjxz3Ozuy3gJXJcsMQ5w80nhP2qOUMls6XEw5uFGuLA9NWe9rLmv3ogJE2bMqrXzqDCBcM3_6m4ExN2-ooUupaMLkqe
/* FOR SHOWING IF THERE ARE SPILLS INTO TEMPDB *//* FOR IMPACT ON LOG SPACE */SELECT TAT.[name] AS trx_name , TAT.transaction_id , DB_NAME(TDT.database_id) AS the_db , CASE TAT.transaction_type WHEN 1 THEN 'R/W' WHEN 2 THEN 'R/O' WHEN 3 THEN 'SYSTEM' WHEN 4 THEN 'DISTRIB' END AS TAT_type , CASE TAT.transaction_state WHEN 0 THEN 'STILL INITIALIZING' WHEN 1 THEN 'INITIALIZED, NOT STARTED' WHEN 2 THEN 'ACTIVE' WHEN 3 THEN 'ENDED' WHEN 4 THEN 'DIST COMMIT IN PROGRESS' WHEN 5 THEN 'PREPARED, AWAITING RESOLUTION' WHEN 6 THEN 'COMMITTED' WHEN 7 THEN 'BEING ROLLED BACK' WHEN 8 THEN 'ROLLED BACK' END AS active_state , CASE TDT.database_transaction_state WHEN 1 THEN 'TRX NOT INITIALIZED' WHEN 3 THEN 'INITIALIZED, BUT NO LOG RECORDS YET' WHEN 4 THEN 'TRX HAS GENERATED LOG RECORDS' WHEN 5 THEN 'TRX PREPARED' WHEN 10 THEN 'TRX COMMITTED' WHEN 11 THEN 'TRX ROLLED BACK' WHEN 12 THEN 'STILL BEING COMMITTED-LOG RECORD BEING GENERATED, BUT NOT MATERIALIZED YET' END AS database_state , SUM(TDT.database_transaction_log_bytes_used + TDT.database_transaction_log_bytes_reserved + TDT.database_transaction_log_bytes_used_system + TDT.database_transaction_log_bytes_reserved_system) AS total_log_bytesFROM sys.dm_tran_active_transactions AS TAT INNER JOIN sys.dm_tran_database_transactions AS TDT ON TAT.transaction_id = TDT.transaction_idWHERE TAT.name LIKE '%ORDERED'GROUP BY TAT.transaction_id , DB_NAME(TDT.database_id) , TAT.[name] , TAT.transaction_type , TAT.transaction_state , TDT.database_transaction_stateORDER BY 1, 2, 3;
As a bonus, looking at overall resource utilization and duration--using a defined, ordered view--also shows as the clear winner:
zSUFRKNSduqWclTNebqeJqQw7VkUeDsPd3CLiC-ITZlNUdRpaz3RYqsPElycYbrpGBjaE-WnUrxkbxrznrCfNsR4WsEcF6s39S5A7Csq9oav6iOsuriW8Yiy6zn4emezG9fXzEgW
SELECT DB_NAME(S.database_id) AS [db_name], S.cpu_time, S.memory_usage, S.total_scheduled_time, S.total_elapsed_time, S.reads, S.writes, S.logical_readsFROM sys.dm_exec_sessions AS S INNER JOIN sys.dm_exec_connections AS C ON S.session_id = C.most_recent_session_idWHERE S.database_id >= 24 AND S.database_id <= 27ORDER BY S.total_scheduled_time;
Conclusion
When it comes to deleting large recordsets, it appears that using a view defined on the data you wish to delete tends to perform better over directly deleting those records from the table. Enforcing the definition of the delete within the schema has its apparent advantages. I suggest you consider using this scenario when you’re faced with large deletes.
About the Author
You May Also Like