Data-Mining Terminology

Learn some common data-mining terminology.

+2
Kamal Hathi, JIM Yangand 2 more

October 23, 2001

2 Min Read
ITPro Today logo

To understand how you can use data mining to address the scenarios we present in the main article, among others, you need to understand some common terminology that we use throughout the article. Microsoft introduced this terminology in the OLE DB for Data Mining specification.

Data-mining model. A data-mining model is similar to a relational table: It contains key columns, input columns, and predictable columns. Each model is associated with a data-mining algorithm on which the model is trained. Training a mining model means finding patterns in the training data set by using the specified data-mining algorithm. During the training stage, the data-mining model stores patterns that the data-mining algorithm discovered about the data set. You can think of a data-mining model as a "truth table" containing rows for every possible combination of the distinct values for each of the model's columns. Once you've trained a model, you can use it for prediction.

Columns. A column in a data-mining model is similar to a column in a relational table; it's also called a "variable" or "attribute" in statistical terminology. A data-mining model can have three types of columns: an input column, a predictable column, or a column that's both input and predictable. A data-mining model uses the set of input attributes to predict the output attributes. The predictable column is the target of the mining model.

States. Associated with each attribute is a set of possible values. These values are the states of the attribute. For example, the column Gender has two states: Male and Female.

Cases. Data mining is about analyzing cases—a case is the basic entity of information. A case can be simple. For example, when you're using customer demographic information to analyze the customer loan risk, each case is a customer. Or a case can be more complicated. For example, when you're analyzing customers' purchasing behavior based on the customers' demographic data as well as purchase history, each case is a customer together with the list of products the customer has purchased. This type of case is a nested case. Statistically speaking, the cases that make up a data set are assumed to be random and drawn from a fixed underlying distribution.

Case tables and nested tables. A case table is the table containing the case information that's related to the non-nested part of the data. A nested table is the table that contains information related to the nested part of the data. A nested table is similar to a transaction table in database terminology. Table A shows two input tables to the mining model. The table that contains information about customer demographics is a case table. The other table, which contains information about customer purchases, is a nested table.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like