Tuesday, 9 February 2010

The Information Hierarchy

Rands In Repose posted an interesting article which included a concept called the Information Hierarchy (also known as Wisdom or Knowledge Hierarchy) which I'd not previously encountered.

The idea is this: information can be classified in a 4-level hierarchy.
  • Data - the raw material of knowledge
  • Information - data that have been organised/presented
  • Knowledge - information that has been acquired and understood
  • Wisdom - distilled and integrated knowledge and understanding
It strikes me that this gives us some insights into the process of science, especially nowadays. At the most basic scientific level, we're simply trying to gather data about the world and progress through the hierarchy to build up information, knowledge and ultimately wisdom about the world/universe/multiverse/whatever in which we live.

In the original version of this process, every stage was carried out by people. This no longer has to be the case, however. Much data gathering is now automated to at least some degree. Even if scientists are ultimately responsible for building and running the experiments/instruments, a lot of the heavy lifting is now carried out by automated or semi-automated systems, with data reduction carried out by software pipelines.

I would argue that we are also able to automate aspects of the second level of the hierarchy, the production of information. Specifically, I think one can regard statistical modeling and machine learning as doing just that. We live in an era of phenomenal scientific data production, so we now routinely use (and create) statistical methods for extracting the useful information from these giant data-sets.

So I think this begs an interesting question: I wonder how much of this process we might ultimately be able to automate, and in what ways? (and what would the implications be of automated systems capable of the Knowledge and Wisdom levels?)

Friday, 5 February 2010

Fishing for significance

I recently read a very thought-provoking article (sorry, I think a subscription to the journal 'Bioinformatics' is required for this link) by Anne-Laure Boulesteix. The paper's title is 'Over-optimism in bioinformatics research' and one of the points the author makes is that there can be an effect called 'fishing for significance' when developing (for example) new statistical methods.

This problem is a version of what happens when you make multiple hypothesis tests. Imagine that you test 100 different genes to see whether or not they're differentially expressed between two different experiments and that for each gene you compute a regular, single-test p-value. If you decide to keep the genes that have a p-value <0.05, then you would expect to keep about 5 genes that aren't differentially expressed at all, but crop up by chance (false positives). This leads statisticians to make p-value corrections when making multiple hypothesis tests, to avoid getting (so many) false positives.

With that idea in mind, consider the typical development process for a new statistical method. Imagine you have a good idea for a new type of statistical method; it's clever, it should be really useful scientifically, and you're very keen to spend some time working on it. You build some software to implement your idea, analyse an example data-set and produce some results and, while it works quite quite well, the results give you several ideas about how to improve it.

You apply each of these ideas in turn, keep the good ones, discarding the bad ones and this improves the results. It also leads to even more good ideas, which you try out in a similar fashion. And eventually you have a method that's producing pretty impressive results on the test data-set. You write up the results, publish them and move on with the next project.

This is all well and good, and people certainly use this approach to produce genuinely good statistical methods. But there is an element of multiple testing in what I've just described. By trying out a range of ideas on our test data-set, then keeping the 'good' ones, we're optimising our approach to do well on the test data.

What's really happening in this process is two forms of apparent improvement are going on. The first is due to our developing genuinely improved statistical models that simply work better. The second is that we're over-fitting to our test data-set, finding models that just happen to work well on these particular data. This second one is a problem because it's an illusion; it won't help us with any other data (it will generalise poorly).

So, what's the solution? Probably the most robust way is to validate any new method on independent data once the model has been finalised, measuring performance using metrics that were decided upon in advance. This isn't always easy, because suitable data aren't always easy to obtain, but I think this has to be the goal to aspire to.

Interestingly, this highlights a merit in turning a clever new statistical method into a tool for people to use: this is a great way to test said method on a wide range of different data-sets. Of course, it can be substantially more effort to develop such a tool, and it can be a bit intimidating to subject your new method to such vigorous scrutiny. But maybe this is the only way to find new methods that really are improvements over the current state of the art.

How to...optimise Matlab



[caption id="" align="alignleft" width="300" caption="image from wikipedia"]image from wikipedia[/caption]

Matlab is a language that's used a lot in science and for good reason.  It's quick to code in, plus there are loads of built-in functions and packages available for many of the tasks that Programmer-Scientists might find themselves involved in.  The main issue that a language like Matlab faces is that it's slower than languages like C++ and FORTRAN, which can be inconvenient when dealing with resource-intensive tasks like data analysis, statistical modelling or numerical simulation.  But fear not!  We proudly present our guide to how to optimise your Matlab code.