Monday, 7 December 2009

Validating your statistical methods


(image by kevindooley)

I'm in the business of creating 'clever' new statistical methods. It's what I most enjoy, research-wise. But I'm quite 'applied' about it, because I'm interested in using these 'clever' methods to do something useful.

I've recently been thinking a lot about how to identify the methods that work well and/or improve on the best existing methods. And I've been coming to the conclusion that it's very easy to do this poorly.
The problems I see time and again (and to be fair, I struggle to avoid myself sometimes) are that new 'clever' methods are tested on one or two standard test data-sets (which are real data if you're lucky, and synthetic if you're not), shown to improve some chosen test metric over the existing methods and then left at that.

This is fine, as far as it goes. But it sometimes doesn't go very far. Will the method work well more generally? Are the metrics measuring anything useful? What about types of data that have different noise characteristics? And how long will it take to run if you have a data-set one hundred times bigger? This is not to say that the results as given have no merit; it's just to highlight that this first wave of testing is far from the be-all and end-all.

So, what's the solution. Validation. On new, interesting data-sets. For which someone wants/needs to know the answer. In ways where the performance will matter. This last part is crucial. If the result matters, you instantly have a good metric for how well your 'clever' method is doing. As well as ensuring that you're developing something that will be useful.

A really good example of this (full disclosure: with which I have no experience whatsoever :-) ) is the use of statistical methods in financial trading. In this case there is a well-defined metric of success (how much money you make) and it's straightforward to generate new validating data - you just have to use the method to do some trading, then look at how well it performs. Ambiguous result? Just rinse and repeat until you've tested enough to convince even the most robust skeptic.

My suspicion is that a lot of this is terribly obvious in other, more applied disciplines. For example, Google's search engine algorithm has huge numbers of people (both Google engineers and volunteer testers) using it in real, creative ways to find the flaws, so they can be fixed. This is a whole iterative loop that doesn't occur that much in an academic context. Of course, we don't have anywhere near that level of resources, and we often don't have the luxury of a quick supply of new data (as new experiments will need to be performed first etc). Nevertheless, pushing on beyond the first paper's worth of work (and testing), applying your method to new data-sets, then using what you learn to make further improvements, can really make a difference to just how clever your 'clever' methods are.


No comments:

Post a Comment