Data Mining - Case Study

Français

Please check this page regularly for updates, corrections, and answers to frequently-asked questions!

Introduction

Data mining is the process of discovering previously unknown, actionable and profitable information from large consolidated databases and using it to support tactical and strategic business decisions.

The statistical techniques of data mining are familiar. They include linear and logistic regression, multivariate analysis, principal components analysis, decision trees and neural networks. Traditional approaches to statistical inference fail with large databases, however, because with thousands or millions of cases and hundreds or thousands of variables there will be a high level of redundancy among the variables, there will be spurious relationships, and even the weakest relationships will be highly significant by any statistical test. The objective is to build a model with significant predictive power. It is not enough just to find which relationships are statistically significant.

Consider a campaign offering a product or service for sale, directed at a given customer base. Typically, about 1% of the customer base will be "responders," customers who will purchase the product or service if it is offered to them. A mailing to 100,000 randomly-chosen customers will therefore generate about 1000 sales. Data mining techniques enable customer relationship marketing, by identifying which customers are most likely to respond to the campaign. If the response can be raised from 1% to, say, 1.5% of the customers contacted (the "lift value"), then 1000 sales could by achieved with only 66,666 mailings, reducing the cost of mailing by one-third.

Data

This example was provided by Gary Saarenvirta, formerly of The Loyalty Group, now with IBM Canada.

Each case is one account. The account numbers have been removed.

The objective variable is a response variable indicating whether or not a consumer responded to a direct mail campaign for a specific product. "True" or "response" is 1, "False" or "non-response" is 0.

The data were extracted from a much larger set with a response rate of about 1%. All 1079 responders were used, together with 1079 randomly-chosen non-responders, for a total of 2158 cases.

There are 200 explanatory variables in the file: v137, v141 and v200 are indicators for gender "male," "female," or "unknown," respectively, and v1-v24, v138-v140 and v142-v144 are recency, frequency, monetary type data for the specific accounts; v25-v136 are census variables, and v145-v199 are demographic "taxfiler" variables. Most of the variables have been normalized.

A table with some variable descriptions is attached. Some of the product-specific variables have been blinded. "p##" means product, "rcy" means recency (no of months since most recent transaction), "trans" means number of transactions, "spend" means dollars spending. For example: p01rcy means product 1 recency. Note that zero recency means that the account was active for that product in the most recent month. "Never active" would be indicated by the largest possible value for recency, as determined by the first month in which the business collected data.

The census and taxfiler variables are summary statistics for the enumeration area in which the account holder's address is located. They generally give total or average numbers of individuals or families or dollars in the categories indicated. A table of taxfiler variable descriptions is attached. You may be able to guess the census variables from their names, but tables with longer descriptions of the census variables are attached: Group "a" and Group "b" are listed separately. You are welcome to contact us if you aren't sure of any.

You can get the data as an Excel 97/98 Workbook gary.xls (5.9 Mb), as an Excel 97/98 Workbook compressed into a ZIP archive gary_xls.zip (2.4 Mb), or as text files in a ZIP archive gary.zip (1.3 Mb).

There are two text files in gary.zip. The data are in a fixed-width ASCII file Sasvar.txt and the data file description is in imtrans.txt. If you choose to work with the text files you MUST use the column positions in imtrans.txt to import the data into SAS or Splus because some columns are contiguous. Be careful with the line endings; for example, if you unzip the text files in UNIX there will be a line feed character at the end of each line that will have to be included that when computing the record length.

Suggestions for Analysis

We expect that you will be using Splus or SAS for the analysis, however, not all of the methods suggested here are readily available in Splus. If you have a SAS licence, the Enterprise Miner module will conveniently automate many of the analyses and you may be able to get an evaluation copy inexpensively from SAS. IBM's Intelligent Miner is also recommended but it is less likely to be available to you.

For all the analyses below, you should create a training set and a validation set. As the data were stratified to 50/50 you should create an unstratified validation set with the original proportion of 1% "True" for the objective variable. You would, of course, get better validation sets if you had the complete sample of around 100,000 accounts, 99% of them non-responders, but the file is too large for us to distribute conveniently. Validation sets constructed from the 50/50 stratified sample should be adequate for the purposes of this exercise.

Your results should be plotted on a gains chart, either tabular or graphical. A gains chart is a plot of the % of the responders reached (ordinate) against the % of the customer base contacted (abscissa). If the campaign is directed at randomly-chosen individuals the plot will be a straight line with unit slope through the origin. If the campaign preferentially targets responders, the gains curve will lie above the diagonal except, of course, at 0% and 100% where it necessarily touches.

The performance of a predictive model is measured by looking at the % responders at 10%, 20% or 30% of customers mailed. A good model will get 1.5 to 3.5 times as many as random over this range so, for example, mailing to 10% of the customer base will reach 15% to 35% of the responders. Less than this means the data are not very predictive, more than this likely means that you have overfitted or there is a strong bias in the data.

Some things you could try with these data include:

Try some simple linear correlations, Spearman and Pearson, against the objective variable and reduce the number of variables. With the reduced set of variables, build logistic regression models. Don't forget to remove colinear variables.
Break the variables into blocks of 10-20 and build logistic models on each of the blocks. After all the models are built, pool the variables that were left in the models and create new blocks of 10-20 and redo until there is only one block of variables left. Don't forget to remove colinear variables.
Create PCA factors from the set of variables (don't include the objective variable!). Select a reduced set of variables from the PCA factors (using cumulative % of variation explained) and build a model from the factors. Compare this result with using all the factors, noting the effect of overfitting. Don't forget to remove colinear variables.
Perform a varclus with all variables. This procedure clusters the variables into hierarchical groups using the PCA factors. Select variables from the bottom-level of the hierarchical groups and build a logistic model. Don't forget to remove colinear variables.
Create multiple training and test samples. Use bootstrapping to estimate the error bounds on the model coefficients and gains chart performance. Try sampling with and without replacement to see how sensitive logistic regression is to the data set configuration.
Use SAS to construct a Radial Basis Function regression. Use all the above methods to reduce the variable set and compare RBF results to logistic.
It is possible to implement a decision tree with SAS using the CART algorithm. Run this algorithm against all the variables. Build multiple training sets using sampling with replacement. This should improve the tree performance by a few percent.
Other modeling techniques to try include neural networks and genetic algorithms.

All model results should be analyzed for gains chart performance with the following measures:

What is the response rate (as % of responders in the customer base reached), compared to random, for campaigns mailed to 10%, 20% or 30% of all customers? At these points, the random response would be 10%, 20% or 30% respectively. Most campaigns are mailed to 10% to 30% of the customer base. Good models can achieve 1.5 to 3.5 times the random response rate in this range.
Monotonicity and smoothness: do the response rates by quantile group form a smoothly decreasing profile? Any waviness is indicative of bias, overfitting or unmodelled effects.
Ease of model explanation. It is very important for prospective clients to understand why the model is working!

Resources

You can get copies of some of Gary Saarenvirta's work online at www.db2mag.com. He has also written The Intelligent Miner for Data Applications Guide, found at www.redbooks.ibm.com.

An Internet search on "data mining" will find a number of commercial products similar to Intelligent Miner and Enterprise Miner.

Back to the SSC 2000 Case Studies

Last modified 2000-11-13 17:10