Wednesday, 22 February 2012

The point of clever (statistical) methods

What's the point of clever statistical methods?

As someone who spends a significant proportion of their working life creating new statistical machine learning methods, it's a question that I dwell on.  Possibly even brood about.  And with good reason.

The subtext to this question is the suggestion that maybe for a given data analysis task, 'vanilla' methods might be perfectly adequate and that there's no reason to go to all the effort of inventing something new.  Want to find sub-groups in your data?  Why not just use k-means clustering?  Want to make some predictions?  Surely a support vector machine or a neural net would suffice.  Feature selection?  Random Forest.  Dimensionality reduction?  Principal component analysis is pretty good.

This is potentially vexing for someone (like me) who enjoys developing clever new ways of analysing data.  But it's entirely a valid point.  And for many of the tasks I find myself doing day-to-day analysing scientific data, these standard methods are indeed perfectly adequate.

So that begs the question, is there actually any point to developing clever statistical methods?

Happily, I think the answer is still (sometimes) yes, for two reasons.

The first is that novel statistical methods can constitute research in their own right.  There is always merit to adding to the pool of human knowledge.

The second is much more central to the scientific side of my research.  It's not always true, but sometimes there is something that the 'vanilla' methods simply do not do.  A trick that they miss.  Some quality of the data to which they're insensitive.

I'm not talking about some incremental gain where your super-clever classifier outperforms a support vector machine by 2.7% in prediction.  Who cares, right?  I'm talking about a novel method that infers the most likely number of clusters in the data, rather than relying on user input.  Or the classifier that learns how to optimally combine information from an arbitrary number of different data sets.  Or the latent Dirichlet allocation model for data where we know there are clusters, but we also know that they overlap.

I think this second reason is the key.  For some types of data, we know more about the underlying structure than a standard method can encode.  When this is the case, by developing a more advanced statistical model we can better analyse and understand our data.  We have to be careful to identify these cases.  Sometimes, the standard existing method really is good enough.  But sometimes that extra level of complexity will pay for itself with an extra level of understanding.  And just occasionally, that's vital.


I realised that I wrote a similar post a few years ago (!).  Interesting to see how my thoughts have developed :-)

No comments:

Post a Comment