![]() |
![]() |
![]() |
|
![]() |
|
![]() |
|
The pros and cons of control charts versus data mining (November 17, 2007)
In a talk I gave in December 2006, I highlighted how in the analysis of adverse event data, control charts can augment more complex statistical tools like data mining. Here's a summary of the pros and cons of using control charts.
Advantages of control charts. Control charts were originally proposed by Walter Shewhart in the 1920's. There is a lot of history behind the control charts, allowing for lots of experience to prove their usefulness and adaptability in a wide range of applications.
The long history of the control chart also makes it a tool that is familiar and comfortable to a lot of people. While most of the applications are in industrial areas, a book published a decade ago,
- Measuring Quality Improvement in Healthcare: A Guide to Statistical Process Control Applications. Carey RG, Lloyd RC (1995) New York: Quality Resources.
highlights numerous applications of control charts in health care.
Finally, the control chart is easy to use. Even with some of the recent enhancements and extensions, control charts remain a relatively simple and accessible tool. You don't need a lot of state-of-the-art statistical tools like you do for a data mining project.
This means that you don't need a lot of statistical and computational expertise to use control charts. There are only a small number of people who have the qualifications and the expertise to do a good job with a data mining model. By placing control charts in the hands of a larger number of people, you increase the number of eyes that look at a problem and (in theory) increase the chances that safety problems are found early.
Disadvantages of control charts. The control chart is an exploratory tool. If the control chart shows a point out of control, the chart won't explain to you WHY it is out of control.
The control chart won't help to identify a subgroup at greater risk if you did not have the foresight to monitor that group. It also won't identify an adverse event that was unexpected. With a control chart, you have to know what you're looking for.
While there are some adaptations of control charts for multivariate data, seasonal data, and other complexities, the control chart is not easily adapted to these types of complexities.
Advantages of data mining models
Data mining models excel in situations where the data streams are large and complex. Some of the data mining methods are adept at handling ambiguous data and missing data. They can also detect subtle non-linearities and interactions that most other statistical methods might miss.
While the data mining methods are not as easy to use as their proponents claim (the old saw "easy to use is easy to say" certainly applies here), the researchers in this field go to great lengths to automate key components of the data mining process. Many methods will incorporate methods like cross validation that allow you to instantly hone in on a model that is neither too complex nor overly simple.
There is a wealth of data mining tools, each with its own particular strengths, so a sophisticated modeler can apply a variety of data mining methods to rapidly triangulate on an accurate solution.
Finally, data mining models are just a lot of fun. Or am I the only one who thinks this sort of thing is cool?
Disadvantages of data mining models. While some of the disadvantages of data mining models are highlighted above (the need for highly trained personnel and specialized software), perhaps two additional disadvantages can be summarized by a couple of personal anecdotes that I originally discussed in a January 6, 2005 weblog entry.
The first story was told to me by a doctor here at Children's Mercy, Jay Portnoy. He was describing a data mining model that was fed images of both cars and trucks (a training set, in the parlance of data mining) to see if it could develop a rule for identifying whether a future image was either a car or a truck based just on mathematical properties of that image. It did a pretty good job of finding factors in the training set that distinguished between cars and trucks. But it failed miserably on the first new image it was trying to classify. It was an image of a car on a snow covered highway. The data mining algorithm said that this was almost certainly a truck. What the researchers then realized is that in the training set, anytime there was snow in the background, it was a truck that was being shown and never a car. I suppose it is the tendency of marketing to always show trucks in rugged, primitive, and/or dangerous driving conditions. So the data mining model seized on a key relationship (color of the background) that existed only accidentally in the training set, rather than focusing on those aspects, such as the shape and size of the vehicle, that most of us would use to distinguish cars from trucks.
Moral from anecdote #1. Even the most sophisticated data mining models cannot overcome deficiencies in your data.
The second story was one I heard in a training class by Richard DeVeaux on data mining models that dealt with the question "so what?". He mentioned one of the earliest findings in a data mining model world (though he is uncertain if this is a true story or an urban legend) was that there was an unusual association seen in sales patterns at convenience stores. It seemed that people who came in to buy beer almost always ended up buying diapers at the same visit. This is the classic sort of thing that data mining models are supposed to find: unusual and unexpected associations in a very large data set. So he posed this question to a group of managers: what would you do with this information? A common response was: stock the shelves so that the beer and the diapers are close together to make the trip for the customer faster and more convenient. Another common response was: put the beer and the diapers at opposite ends of the store so that customers would have to spend more time in the store, increasing the chances for impulse purchases. Another common response was a shrug of the shoulders. In fact, we often don't know what to make of the associations found by data mining models.
Moral from anecdote #2. Significant findings from a data mining model are not guaranteed to provide appropriate clinical guidance.
The bottom line. No one statistical tool or method is going to provide you with everything you need. The broader range of methods that you bring to bear on a problem, the better your chances of success.
This webpage was written by Steve Simon on 2007-11-11, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Adverse events in clinical trials