Skip to content

25 years of conventional evaluation of data analysis proves worthless in practice

So-called 'intelligent' computer-based methods for classifying patient samples, for example, have been evaluated with the help of two methods that have completely dominated research for 25 years. Now Swedish researchers at Uppsala University are revealing that this methodology is worthless when it comes to practical problems. The article is published in the journal Pattern Recognition Letters.

Today there is rapidly growing interest in 'intelligent' computer-based methods that use various classes of measurement signals, from different patient samples, for instance, to create a model for classifying new observations. This type of method is the basis for many technical applications, such as recognition of human speech, images, and fingerprints, and is now also beginning to attract new fields such as health care.

"Especially in applications in which faulty classification decisions can lead to catastrophic consequences, such as choosing the wrong form of therapy for treating cancer, it is extremely important to be able to make a reliable estimate of the performance of the classification model," explains Mats Gustafsson, Professor of signal processing and medical bioinformatics at Uppsala University, who co-directed the new study together with Associate Professor Anders Isaksson.

To evaluate the performance of a classification model, one normally tests it on a number of trial examples that have never been involved in the design of the model. Unfortunately there are seldom tens of thousands of test examples available for this type of evaluation. In biomedicine, for instance, it is often expensive and difficult to collect the patient samples needed, especially if one wishes to analyze a rare disease. To solve this problem, many different methods have been proposed. Since the 1980s two methods have completely dominated research, namely, cross validation and resampling/bootstrapping.

"This has entailed that the performance assessment of virtually all new methods and applications reported in the scientific literature in the last 25 years has been carried out using one of these two methods," says Mats Gustafsson.

In the new study, the Uppsala researchers use both theory and convincing computer simulations to show that this methodology is worthless in practice when the total number of examples is small in relation to the natural variation that exists among different observations. What is considered a small number depends in turn on what problem is being studied-­in other words, it is impossible to determine whether the number of examples is sufficient.

"Our main conclusion is that this methodology cannot be depended on at all, and that it therefore needs to be immediately replaces by Bayesian methods, for example, which can deliver reliable measures of the uncertainty that exists. Only then will multivariate analyses be in any position to be adopted in such critical applications as health care," says Mats Gustafsson.

September 3, 2008

Comments

Comments

September 4, 2008 by Anonymous, 1 year 10 weeks ago
Comment id: 31811

The previous comments have showed critical thinking and familiarity with the subject at hand. This is surprisingly refreshing.

I particularly agree with the response stating the conclusion has been stretched to the point of sensationalism; but it is good that the models and method of implementing systems are being reviewed. Finding problems and patterns to avoid is one of the first steps to improving system design. Also it certainly should not news to any programmer, I will admit to being one, that without adequate design, concept validation, code testing, maintenance and disclosure of the limitations of the system it should not be expected that the end product would produce reliable results.

Did anyone bother to read the article?

September 4, 2008 by Anonymous, 1 year 11 weeks ago
Comment id: 31807

I'll try to spell it out for you.

The article is not saying that classification methods don't work, and isn't disparaging any particular method or combination of methods. It is not even stating that you cannot achieve arbitrarily small error. The article is ONLY saying that you cannot reliably estimate the resulting classifier accuracy (out in the real world) using cross validation or bootstrapping on your dataset.

If your data sanple is large and diverse enough, you could. But you just won't ever know if that is the case or not.

Personally, I think their 'main conclusion' stretches the point a bit, probably for sensationalism's sake.

Close to the Mark

September 4, 2008 by Anonymous, 1 year 11 weeks ago
Comment id: 31806

I think the real problem is the blind application of various models without knowing when the model is applicable or not and other constraints of the model.

"All models are wrong, some are useful." --George Box

What?

September 4, 2008 by Anonymous, 1 year 11 weeks ago
Comment id: 31805

Conventional evaluation went out the window years ago dude.

Jiff
www.anonymize.us.tc

Stating the fine print

September 4, 2008 by Anonymous, 1 year 11 weeks ago
Comment id: 31795

Like the last reviewer said, everybody involved in computer science research knows this already, any model that we come up includes a fine print of positive/negative constraints and includes error variable.

It's upto the application to decide what error is acceptable, for example if the research in next 10 or 20 years comes up with a model which has smaller error than human error (essentially increases the survival probability of patient over the entire world - which includes advanced diagnostics available to first world countries and simple diagnostics for third world) then you might as well use it for third world countries.

I don't agree...

September 4, 2008 by Anonymous, 1 year 11 weeks ago
Comment id: 31794

Pattern Recognition is viable only for particular cases. However, for the cases where test sets cannot be generated (or there's no benefit to generating them), wouldn't it be better to use CART, c4.5, c5, or other similar clustering algorithms? These algorithms are widely used in both academic and commercial sectors, and I would even venture that Pattern Recognition is lower on the totem pole than these.

Yes, we know!

September 4, 2008 by Anonymous, 1 year 11 weeks ago
Comment id: 31791

Pattern recognition program is only as good as your test data set. Good data sets are hard to come by, that is why there is always errors in computer recognition. It is also why a small sample size can not describe the entire population. These professors are just stating the obvious. Anyone that is in the pattern recognition field knows better methods for recognition like multiple Gaussian or a combination of methods. To say that the entire methodology is worthless is a bold statement. As each method is tailored to a specific task. Prof. Gustafsson's issue is with the data set(s), there is nothing wrong with the methods. Can i have my 5 minutes back now.

Post new comment



About us

Science Blog was started in August 2002. It lives, breathes and eats press releases from research organizations around the globe. Most of what you read here are press releases from the outfits named in the stories themselves. Got a news story you think belongs here? Let's talk. The other half of the equation is blog posts from readers like you. So if you have an interest in science, please register and join others like you in an ongoing, vibrant dialog about what makes the world tick. Meantime, please take a minute to read our Privacy Policy and Site Disclaimer.