Developing Familiarity with BIDeveloping Familiarity with BI
Relational-database developers can use these tools to tap into SQL Server 2005’s BI capabilities
August 18, 2005
I'll admit it. I was one of those relational-database guys who didn't really care about Analysis Services. I thought an O. LAP cube was where Oliver Lap, from the accounting department, sat from 8:00 to 5:00. Dimensions were something that only physicists and devotees of the Twilight Zone need be concerned about.
That changed for me several years back when I learned how Analysis Services could benefit both me and my clients. If you've always been a traditional relational-database guy, your outlook might change when you try the SQL Server 2005 June Community Technical Preview (CTP), which you can download at http://www.microsoft.com/sql /2005/productinfo/ctp.mspx. Now, this article isn't designed to turn all of you relational folks into cubists. Instead, I want to make you aware of a few incredibly handy business intelligence (BI) features in SQL Server Integration Services (SSIS) that you can use as part of your relational applications. Two new transformation tasks will simplify your life when you're working with messy data, and a data-mining task will let you build a data model that adapts to business changes. These features don't require a large data warehouse or OLAP cube infrastructure. What they do require is SQL Server 2005 and a willingness to try something new.
Warm and Fuzzy
First, let's look at a pair of SSIS transformation tasks. In case you've been out of the loop, SSIS is the successor to Data Transformation Services (DTS). SSIS performs the same function as DTS, namely moving data from place to place. However, SSIS has more capabilities, is easier to use, requires less coding, and is extremely fast.
The transformation tasks we're going to examine first are Fuzzy Lookup and Fuzzy Grouping. As the names imply, these tasks let you perform lookup and grouping operations without requiring an exact data match, which can be helpful when importing data from legacy systems or from free-form data-entry fields.
Fuzzy Lookup. The Fuzzy Lookup task takes columns and looks up the values in an SSIS data flow reference table. The task doesn't look for an exact match. Instead, the Fuzzy Lookup task is like your favorite teacher; it grades on a curve and gives partial credit. The Fuzzy Lookup task searches for a close match between the column values and the items in the table.
Let's consider a situation in which you need to load a text file, which Figure 1 shows, from a legacy system address table. The city and state parts of the address in the target system are normalized with the city and state combinations stored in a separate lookup table. Unfortunately, the cities in this legacy system weren't validated, so matching the city and state to a record in the lookup table is going to be a challenge.
Enter SSIS and the Fuzzy Lookup transformation task. Figure 2 shows the data flow of a basic SSIS package. You create the package in Figure 2 by using drag-and-drop programming. Simply drag items from the toolbox, put them in the data flow, and set their properties. No coding is required. This package takes the data from the text file, performs a Fuzzy Lookup to find the foreign key for the city and state combination, then saves the information in the SQL Server Address table. As Figure 3 shows, the Fuzzy Lookup task did an excellent job of matching our fat-fingered text-file data with the appropriate cities in the lookup table.
The Fuzzy Lookup task creates two metrics as it works. The first is the similarity score. This number tells us how close a match was found. The task calculates the similarity score based on the edit distance between a value and its potential match. In other words, how many times do you need to insert, delete, or replace a letter in the value to end up with the word in the lookup list? The second metric is the confidence value, which indicates the amount of confidence the Fuzzy Lookup algorithm has in the match that it found. You can use this number to determine whether to accept the value the algorithm found or manually check the value.
One quirk to be aware of: The Fuzzy Lookup task assumes words in all capital letters are acronyms. Since each letter in an acronym is more significant than individual letters in a typical word, FCC and FTC are two different things. Even though these acronyms have an edit distance of one, the Fuzzy Lookup task doesn't try to match them. If you have legacy data entered in all upper case, transform it to lower case before performing the Fuzzy Lookup.
Fuzzy Grouping. The second data transformation task we're going to look at is also a real peach. (Get it? Fuzzy, peach?) The Fuzzy Grouping task works much like the Fuzzy Lookup task except it finds matches within one set of data instead of using a lookup table. Say, for example, you're importing data from a customer service call log like the one that Figure 4 shows. Customers might not give their names exactly the same way each time they call, and customer service representatives might enter names in different ways. In Figure 4, Cathy Jones, Kathryn Jones, and Kathy Jones are probably different entries for the same person, but how do you get the computer to realize that fact? The answer is, of course, the Fuzzy Grouping transformation task.
Figure 5 shows an SSIS package using a Fuzzy Grouping transformation task to find potential duplicates during the import process. Figure 6 shows the results of this import, which reveal potential duplicates. The Fuzzy Grouping task doesn't automatically remove the duplicates; you need to handle this as a separate function.
The Key_In column value that the Fuzzy Grouping task generates is a unique identifier for each record in the data flow. During the fuzzy matching process, the Key_ Out column value, along with a clean value for each column SSIS is processing, shows the groupings of potential duplicate records that the task detected. In this example, the Fuzzy Grouping task grouped the rows that had Key_In values of 2, 3, and 4 and identified row 4 as the Model Row (the row they should be combined into). Like the Fuzzy Lookup task, the Similarity/Score column shows just how close a match is based on edit distance. A similarity score of 1 indicates an exact match. A similarity score less than 1 indicates a fuzzy match.
The SSIS package in this example does a fuzzy match on the first and last name and an exact match on gender. We're assuming people know their own gender and the customer service representative can key in M or F. If you add more criteria to the Fuzzy Grouping, you increase your certainty that two rows are indeed duplicates. For example, you might add either fuzzy or exact matching on address, city, state, ZIP code, or phone number for increased faith in your de-duping process.
Digging for Gold
The Data Mining Prediction Query transformation task is another SSIS task that can help you predict values of unknown data during a data load. Some of you are saying, "Data mining: Doesn't that use complex mathematical algorithms and some weird query language?" The answer is, "Yes." However, before your eyes glaze over and you turn to the next article, be aware that you don't need to know complex mathematics to use data mining. You should also know that the Data Mining Extensions (DMX) query language is, just as it says, an extension to T-SQL. Therefore, most parts of a DMX query will look very familiar.
Data mining lets you find patterns and make predictions within a set of data. For example, a data-mining algorithm can examine your customers' buying habits and determine what goods or services a particular customer is likely to buy next or predict which of your customers might be likely to shop around for a different supplier. For our discussion here, let's use data mining to predict which products are most likely to be good sellers for a retail outlet.
Consider an online bookstore that gets a list of available books each week from a distributor. The online store pays a fee for each book that it lists on the site, so it wants to list only books that are likely to be good sellers. A sales analyst reviews sales information monthly and determines which books to list on the site. The bookstore wants to devise an automated process to identify books that are similar to the current hot sellers on national lists. Data mining and the Data Mining Prediction Query transformation task can do just that.
Note that the Data Mining Prediction Query transformation task requires a data-mining model. Because data-mining models live inside an Analysis Services database, you need to have an Analysis Services server available to you. Analysis Services is included in your SQL Server 2005 license and isn't difficult to add to an existing SQL Server 2005 installation.
First, you need to create the data-mining model that will make the predictions for the bookstore. SQL Server 2005 provides several different data-mining algorithms. This example uses the Decision Trees data-mining algorithm, which works well for the type of predictions the bookstore needs. The attributes you'll use to make the prediction will be the author, publisher, format (paperback, hardcover, or audio book), and genre (e.g., adventure, mystery, science fiction.) A predicted sales status of 1 indicates a top seller.
Once the data-mining model is defined, it needs to learn the pattern of current book sales so that it can make predictions for new best-selling books. The data-mining model uses the bookstore's sales information for the previous month as its training dataset to determine what criteria (author, publisher, format, and genre) identifies a top seller and to learn how to predict future sales.
The Mining Model Viewer in Figure 7 shows the patterns that the trained data-mining model discovered. The shading indicates the concentration of top sellers in a particular node of the tree; the darker the shading, the higher the concentration. You can see a fair concentration of top sellers published by Random House. An even larger concentration of top sellers is in mysteries from Random House. For this particular time period, you'd be wise to list additional Random House mysteries on the bookstore site. Of course, what's hot this month isn't what will be hot next month, so you'll want to retrain the data model frequently to reflect current buying trends.
You can use the trained data-mining model as part of the SSIS package that's importing the distributor book list. As Figure 8 shows, the control flow part of the process picks up the text file from the distributor's FTP site, retrains the data-mining model by using the current sales data, and finally executes the data flow that performs the actual import.
The data-flow process that Figure 9 shows, reads in the list of available books from the text file, then uses the Data Mining Prediction Query transformation task to predict the sales status of each new book. The task adds books that fit the criteria for top sellers (i.e., have a sales status of 1) to the Books table and makes them available to the online store.
Figure 10 shows the DMX query that the Data Mining Prediction Query transformation task uses. The query looks much like a SELECT statement that uses a two-table join. The difference is the use of a PREDICTION JOIN rather than the INNER or LEFT OUTER JOIN you see in typical T-SQL statements. In this case, the PREDICTION JOIN takes the criteria from the SSIS's package data flow @InputRowset and feeds it into the data-mining model Books-DecisionTrees. The JOIN condition specifies how the data table's fields match with the data-mining model inputs. The FLATTENED keyword instructs the DMX query to produce a rows and columns dataset instead of the default hierarchical dataset. The Predict() function in the field list returns the predicted value for sales status.
Figure 11 shows the result of the prediction query: this week's top sellers added to the online store. Not surprisingly, this week's additions are all Random House mysteries. Next week, however, the top sellers might fit a different set of criteria and be a completely different selection. That's the beauty of data-mining algorithms and automating the processes in SSIS: You adapt to changing situations by continually retraining your model.
Great Courage
For those of us who came up through the ranks on the relational side of SQL Server, SQL Server's BI tools might seem complex, intimidating, or simply irrelevant. However, SQL Server 2005 offers features that should excite both BI and relational developers. So, don't be afraid to explore the adaptive capabilities of the Fuzzy Lookup and Fuzzy Grouping functions as well as the Data Mining Prediction Query transformation task. These new tools let developers quickly automate complex analysis of existing data and mine new life from it. You can find a good example of how to use these tools in James MacLennan's Webcast "Technology Overview: Business Technology Data Mining" on the SQL Server 2005 Beta Resource Kit DVD. Give BI a try; download the latest CTP and use it! Your courage will be greatly rewarded.
About the Author
You May Also Like