Tuesday, 9 February 2010

The Information Hierarchy

Rands In Repose posted an interesting article which included a concept called the Information Hierarchy (also known as Wisdom or Knowledge Hierarchy) which I'd not previously encountered.

The idea is this: information can be classified in a 4-level hierarchy.
  • Data - the raw material of knowledge
  • Information - data that have been organised/presented
  • Knowledge - information that has been acquired and understood
  • Wisdom - distilled and integrated knowledge and understanding
It strikes me that this gives us some insights into the process of science, especially nowadays. At the most basic scientific level, we're simply trying to gather data about the world and progress through the hierarchy to build up information, knowledge and ultimately wisdom about the world/universe/multiverse/whatever in which we live.

In the original version of this process, every stage was carried out by people. This no longer has to be the case, however. Much data gathering is now automated to at least some degree. Even if scientists are ultimately responsible for building and running the experiments/instruments, a lot of the heavy lifting is now carried out by automated or semi-automated systems, with data reduction carried out by software pipelines.

I would argue that we are also able to automate aspects of the second level of the hierarchy, the production of information. Specifically, I think one can regard statistical modeling and machine learning as doing just that. We live in an era of phenomenal scientific data production, so we now routinely use (and create) statistical methods for extracting the useful information from these giant data-sets.

So I think this begs an interesting question: I wonder how much of this process we might ultimately be able to automate, and in what ways? (and what would the implications be of automated systems capable of the Knowledge and Wisdom levels?)

Friday, 5 February 2010

Fishing for significance

I recently read a very thought-provoking article (sorry, I think a subscription to the journal 'Bioinformatics' is required for this link) by Anne-Laure Boulesteix. The paper's title is 'Over-optimism in bioinformatics research' and one of the points the author makes is that there can be an effect called 'fishing for significance' when developing (for example) new statistical methods.

This problem is a version of what happens when you make multiple hypothesis tests. Imagine that you test 100 different genes to see whether or not they're differentially expressed between two different experiments and that for each gene you compute a regular, single-test p-value. If you decide to keep the genes that have a p-value <0.05, then you would expect to keep about 5 genes that aren't differentially expressed at all, but crop up by chance (false positives). This leads statisticians to make p-value corrections when making multiple hypothesis tests, to avoid getting (so many) false positives.

With that idea in mind, consider the typical development process for a new statistical method. Imagine you have a good idea for a new type of statistical method; it's clever, it should be really useful scientifically, and you're very keen to spend some time working on it. You build some software to implement your idea, analyse an example data-set and produce some results and, while it works quite quite well, the results give you several ideas about how to improve it.

You apply each of these ideas in turn, keep the good ones, discarding the bad ones and this improves the results. It also leads to even more good ideas, which you try out in a similar fashion. And eventually you have a method that's producing pretty impressive results on the test data-set. You write up the results, publish them and move on with the next project.

This is all well and good, and people certainly use this approach to produce genuinely good statistical methods. But there is an element of multiple testing in what I've just described. By trying out a range of ideas on our test data-set, then keeping the 'good' ones, we're optimising our approach to do well on the test data.

What's really happening in this process is two forms of apparent improvement are going on. The first is due to our developing genuinely improved statistical models that simply work better. The second is that we're over-fitting to our test data-set, finding models that just happen to work well on these particular data. This second one is a problem because it's an illusion; it won't help us with any other data (it will generalise poorly).

So, what's the solution? Probably the most robust way is to validate any new method on independent data once the model has been finalised, measuring performance using metrics that were decided upon in advance. This isn't always easy, because suitable data aren't always easy to obtain, but I think this has to be the goal to aspire to.

Interestingly, this highlights a merit in turning a clever new statistical method into a tool for people to use: this is a great way to test said method on a wide range of different data-sets. Of course, it can be substantially more effort to develop such a tool, and it can be a bit intimidating to subject your new method to such vigorous scrutiny. But maybe this is the only way to find new methods that really are improvements over the current state of the art.

Tuesday, 12 January 2010

Science and the Internet

There are some interesting points in a recent online article by Martin Rees (president of the Royal Society and a very well-regarded astrophysicist). The article as a whole is very interesting and well worth a read, but a few ideas particularly grabbed me.
  • the Internet enables wider participation in front-line science
  • it allows new styles of research (for example, mining large publicly available data-sets)
  • scientific discoveries can now be made by 'brute force' number crunching (e.g. exhaustive computational searches), as well as the more traditional methods of experiment, insight (and I would add theoretical calculation to the list)
I would also add another point to this list.
  • the Internet gives us faster access to resources, so we can get science done more quickly. For example, literature searches are much easier and faster to do when the papers are online and can be found via Google Scholar or similar.
Science achieved so much in the 20th century, but all of the above makes me really enthused about how much (more?) we're doing right now and what we'll achieve in the next couple of decades...

Monday, 11 January 2010

Deliberate Practice

There's an excellent post on Deliberate Practice/Serious Study, over at the Study Hacks blog. I've been interested for a while now in the idea that it takes humans about 10,000 hours to become an expert at something, but this post goes more into specific detail about the hows and whys.

The key idea is a thing called "Deliberate practice", which are activities specifically designed to improve your performance at something. I was aware of this idea in a sporting context, so it's cool to discover that it's a more general psychological principle. The idea is that to improve at something, you not only need to practice, that practice has to have certain characteristics. For example, it needs to push the boundaries of what you're capable of, and it needs to provide feedback so that you know whether or not you did well in a given instance. 10,000 hours of coasting in your comfort zone won't do much for you.

It seems to me a Very Good Thing to focus on improving how well you do your research, but I suspect that most scientific researchers don't really do this (not explicitly, anyway). Maybe they should...

The Study Hacks article looks well-researched and links to a range of other materials, so it's well worth a look!

Wednesday, 6 January 2010

E. W. Dijkstra's three golden rules for scientific research

Very interesting post up over at the Successful Researcher blog, based on some original text by the man himself which gives a bit more discussion about the reasoning behind them. In short:

  1. Raise your quality standards as high as you can and try to always work at the boundary of your abilities.
  2. Ideally, your work should be socially relevant and scientifically sound. If it can't be both, scientific soundness should prevail.
  3. Never tackle problems that are (or will soon be) addressed by people who are equal/better equipped than you to do so.
These all seem very good advice to me. The first one means that not only are you always producing work of the highest possible quality (that you're capable of), you're also pushing the boundaries of what you're capable of. In other words, the way to improve is to push yourself.

The second one is nicely explained in E. W. Dijkstra's original post. Scientific rigour is all-important, because if you don't have that then social relevance (or anything else) isn't going to be useful. Being scientifically sound is a foundation.

The third one is the consideration, "If I didn't work on this, would my efforts be missed". If the answer is no, then go and find something else to work on. Ideally, we should all be making contributions that we're uniquely well-suited to make.