Tuesday, 21 July 2009

Turning data into knowledge

At some level, science is about turning data into useful knowledge. When the number of data is small (and especially when the signal is strong), just looking at the data can be enough to gain new understanding. The essence of modern science however, is making this transformation with very large amounts of data. And this requires a particular set of approaches.

The blindingly obvious...
Sometimes you'll get lucky and the knowledge will be obvious from the available data. For example, you might have a scientific image with 108 pixels, but the object you're looking at is imaged to high resolution and to huge signal-to-noise ratio. From this, you can probably learn a lot without doing anything more than looking at the image (and perhaps making a few basic measurements). But it's rare that you can't learn more by more fully exploring all those pixels, and to do that you'll need some better tools.

Vanilla methods
If you do need to do something to your data in order to extract some useful knowledge, your first port of call might be "vanilla" methods. These are the bulk-standard, well-understood tools of the data analysis trade. Taking an average, finding a p-value, fitting a regression line, clustering using k-means etc. You are now into the regime where your data cannot all fit into your brain at once, so you have to start using tools to help you extract useful scientific knowledge. Vanilla methods are by definition widely used and tend to be well-understood and easy to interpret. Your aim here is to use these tools to spot the patterns in large amounts of data. Do your genes group into distinct clusters? Are there significant sources in your astronomical image? What's the most likely curve for your measurements, given the noisy measurements you've made? If you can reduce a billion data-points to a hundred clusters, a thousand point sources or a curve defined by ten parameters, you have already made a lot of progress in understanding what your data are telling you.

Clever methods
Of course, you can also try to be more clever than that. If you have a good idea of the sort of general structure you expect in your data then you can build a method that can target that type of structure. Perhaps you have a good physical model of what's going on? The power spectrum of the Cosmic Microwave Background (CMB) radiation is a good example - the major structures in this curve are well-defined by the underlying physics.

You can try to build clever methods that do more of the donkey work for you. Are you clustering your data? Into how many clusters should you be dividing the data? The right clever method can apply a robust principle to determine this optimally, leaving you free to consider the results in greater detail.

You can also build very general clever methods that are capable of spotting a large range of different types of structure (for example, Bayesian non-parametric techniques, and splines for curve fitting). Care must be taken to not simply identify every noise spike as structure, but this can in prinicpal be a great way to spot the unexpected.

Statistical inference
All of this can be viewed as statistical inference. Inference is the extension of logical deduction to include uncertainty (because probability theory extends mathematical logic to include degrees of uncertainty). Indeed, there's a view that the scientific method is all statistical inference. With very certain observations (the sun rises every morning), we are just left with logical deduction. With uncertain/noisy observations, we are left with statistics, maximum likelihood techniques, Bayesian methods, the need for repeated experiments and the like. And consider Occam's razor (often held up as an important part of the scientific method): probability theory actually provides a mathematical derivation of Occam's razor, via Bayesian model selection (if two models fit equally well, the simpler model will have a higher Evidence value, meaning it's more likely given the data).

Science as data compression...?
While we've drifted into philosophy-of-science territory, there's another interesting idea that's highlighted by the advent of data-sets too large for a human brain:

One could consider science as a series of attempts at data compression.

Think about this for a moment.

What are we looking for, as scientists? We're looking for generalisations about the area in which we're working. We want to know how metals behave as we heat them. We want to know how the universe expands over time. We want to know how our favourite set of genes interact with one another in different conditions. We can make a vast number of observations about any one of these, but what we're after is a set of rules that tells us how these things behave and we want those rules to be as general as possible.

Once we find (and test) such a rule, we've encoded the essence of all those observations into one (often simple) rule. Think about Newton's law of gravity; it goes a long way to describing the the motion of a hundred billion stars in our galaxy, but it's just an inverse-square law with a couple of masses and a gravitational constant. In terms of bits of information, that's a pretty awesome compression factor.

So what? Well, we're talking about the need for methods for converting large data-sets into something more interpretable by a human brain. These are compressions in themselves. So we're using algorithms/statistical methods to partially automate the scientific method. Which leads me to wonder how much more of it we could automate, if we really put our minds to it....

No comments:

Post a Comment