Consolidating Data from Various Sources

Readers help Gordon write a query that will eliminate duplicates and other transformations before storing the data in the data warehouse.

Umachandar Jayachandran

October 19, 2005

4 Min Read
ITPro Today logo in a gray background | ITPro Today

Congratulations to Narcissa Ramich, a Senior Database Developer in London, England, who won first prize of $100 for the best solution to the October Reader Challenge, "Consolidating Data from Various Sources." Here’s a recap of the problem and the solution to the October Reader Challenge.

Problem:


Gordon is the systems analyst in a company that sells software products. He is currently using SQL Server 2000 Service Pack 3 (SP3) to create a data-warehouse solution that consolidates information from various data sources in the company. Some of the data is customer information that comes from an IBM DB2 database. Below is the schema for the staging table containing the customer information with some sample data:

CREATE TABLE Customers_Work (  CustID nchar(8) COLLATE SQL_Latin1_General_CP1_CS_AS NOT NULL,  CustName varchar(100) NOT NULL)INSERT INTO Customers_Work (CustID, CustName) VALUES  ( N'cust0001', 'Customer #1')INSERT INTO Customers_Work (CustID, CustName) VALUES  ( N'Cust0001', 'Customer #2')INSERT INTO Customers_Work (CustID, CustName) VALUES  ( N'cust0002', 'Customer #3')INSERT INTO Customers_Work (CustID, CustName) VALUES  ( N'cust0003', 'Customer #4')INSERT INTO Customers_Work (CustID, CustName) VALUES  ( N'CUST0003', 'Customer #4')

Gordon is working on eliminating duplicates and other transformations before storing the data in the data warehouse. Help him do the following:

  1. Modify the schema of the table to support searching the data in the CustID column in a case-insensitive manner—without modifying the original CustID column or the data it contains.

  2. Allow efficient searches or a determination of duplicates based on the CustID column.

  3. Because this schema is used by other processes, be sure schema modifications are in addition to existing columns.

Solution:


Gordon can use a computed column that contains an expression that modifies the collation of the CustID column to case-insensitive. He can add the computed column by using the ALTER TABLE statement below:

ALTER TABLE Customers_Work ADD CustID_CI AS CustID COLLATE   SQL_Latin1_General_CP1_CI_AI

To allow efficient searches of this column, Gordon can create an index on it. Creating indexes on computed columns requires the expression to be deterministic in nature. SQL Server 2000 SP2 and later allows computed columns that contain indexing of COLLATE clauses on Unicode data. Gordon can use the CREATE INDEX statement below to create an index on the computed column:

CREATE INDEX Customers_Work_CustID_CI ON Customers_Work(CustID_CI)

Gordon can now efficiently perform queries like the following:

SELECT * FROM Customers_WorkWHERE CustID_CI = N'cust0001'SELECT CustID_CI  FROM Customers_Work  GROUP BY CustID_CIHAVING COUNT(*) > 1SELECT * FROM Customers_WorkWHERE CustID_CI IN (SELECT CustID_CI  FROM Customers_Work  GROUP BY CustID_CIHAVING COUNT(*) > 1)

NOVEMBER READER CHALLENGE:


Now, test your SQL Server savvy in the November Reader Challenge, " Grouping Sequential Changes " (below). Submit your solution in an email message to [email protected] by October 27. Umachandar Jayachandran, a SQL Server Magazine technical editor, will evaluate the responses. We’ll announce the winner in an upcoming SQL Server Magazine UPDATE. The first-place winner will receive $100, and the second-place winner will receive $50.

Problem:


Steve is a systems analyst in an IT department for a midsized company. As part of a batch process, data is extracted from his production database, which logs activity into a table at periodic intervals. Each activity entry contains a flag that indicates a particular state. The batch process performs two operations in an infinite loop. As part of the logging, the process includes the time when the operation was started and a flag indicating the state. A sample schema with data is shown below.

CREATE TABLE Activity ( CreateTime smalldatetime not null, Flag bit not null );INSERT INTO Activity values( '2005-10-01 06:00', 0);INSERT INTO Activity values( '2005-10-01 06:01', 0);INSERT INTO Activity values( '2005-10-01 06:02', 0);INSERT INTO Activity values( '2005-10-01 06:03', 1);INSERT INTO Activity values( '2005-10-01 06:04', 1);INSERT INTO Activity values( '2005-10-01 06:05', 0);INSERT INTO Activity values( '2005-10-01 06:06', 0);INSERT INTO Activity values( '2005-10-01 06:07', 1);INSERT INTO Activity values( '2005-10-01 06:08', 0);INSERT INTO Activity values( '2005-10-01 06:09', 1);SELECT * FROM Activity;

In our example, let us assume that the Flag column is a bit data type and the values toggle from 0 to 1 and vice versa. The CreateTime column is a smalldatetime value that’s always increasing. Steve is trying to do the following:

  1. Group changes in the Flag column in the sequence in which they occur based on the CreateTime value

  2. Determine the start and end values of CreateTime for each group

The expected output of the query should be:

StartTime

EndTime

Flag

-------------------

-------------------

------

2005-10-01 06:00:00

2005-10-01 06:02:00

0

2005-10-01 06:03:00

2005-10-01 06:04:00

1

2005-10-01 06:05:00

2005-10-01 06:06:00

0

2005-10-01 06:07:00

2005-10-01 06:07:00

1

2005-10-01 06:08:00

2005-10-01 06:08:00

0

2005-10-01 06:09:00

2005-10-01 06:09:00

1

Help Steve write a query that will get the expected results showing the changes in the Flag column value in the order in which they occurred and the start and end times when the changes occurred.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like