21st Century Scientist: July 2009

Tuesday, 21 July 2009

Turning data into knowledge

At some level, science is about turning data into useful knowledge. When the number of data is small (and especially when the signal is strong), just looking at the data can be enough to gain new understanding. The essence of modern science however, is making this transformation with very large amounts of data. And this requires a particular set of approaches.

The blindingly obvious...
Sometimes you'll get lucky and the knowledge will be obvious from the available data. For example, you might have a scientific image with 108 pixels, but the object you're looking at is imaged to high resolution and to huge signal-to-noise ratio. From this, you can probably learn a lot without doing anything more than looking at the image (and perhaps making a few basic measurements). But it's rare that you can't learn more by more fully exploring all those pixels, and to do that you'll need some better tools.

Vanilla methods
If you do need to do something to your data in order to extract some useful knowledge, your first port of call might be "vanilla" methods. These are the bulk-standard, well-understood tools of the data analysis trade. Taking an average, finding a p-value, fitting a regression line, clustering using k-means etc. You are now into the regime where your data cannot all fit into your brain at once, so you have to start using tools to help you extract useful scientific knowledge. Vanilla methods are by definition widely used and tend to be well-understood and easy to interpret. Your aim here is to use these tools to spot the patterns in large amounts of data. Do your genes group into distinct clusters? Are there significant sources in your astronomical image? What's the most likely curve for your measurements, given the noisy measurements you've made? If you can reduce a billion data-points to a hundred clusters, a thousand point sources or a curve defined by ten parameters, you have already made a lot of progress in understanding what your data are telling you.

Clever methods
Of course, you can also try to be more clever than that. If you have a good idea of the sort of general structure you expect in your data then you can build a method that can target that type of structure. Perhaps you have a good physical model of what's going on? The power spectrum of the Cosmic Microwave Background (CMB) radiation is a good example - the major structures in this curve are well-defined by the underlying physics.

You can try to build clever methods that do more of the donkey work for you. Are you clustering your data? Into how many clusters should you be dividing the data? The right clever method can apply a robust principle to determine this optimally, leaving you free to consider the results in greater detail.

You can also build very general clever methods that are capable of spotting a large range of different types of structure (for example, Bayesian non-parametric techniques, and splines for curve fitting). Care must be taken to not simply identify every noise spike as structure, but this can in prinicpal be a great way to spot the unexpected.

Statistical inference
All of this can be viewed as statistical inference. Inference is the extension of logical deduction to include uncertainty (because probability theory extends mathematical logic to include degrees of uncertainty). Indeed, there's a view that the scientific method is all statistical inference. With very certain observations (the sun rises every morning), we are just left with logical deduction. With uncertain/noisy observations, we are left with statistics, maximum likelihood techniques, Bayesian methods, the need for repeated experiments and the like. And consider Occam's razor (often held up as an important part of the scientific method): probability theory actually provides a mathematical derivation of Occam's razor, via Bayesian model selection (if two models fit equally well, the simpler model will have a higher Evidence value, meaning it's more likely given the data).

Science as data compression...?
While we've drifted into philosophy-of-science territory, there's another interesting idea that's highlighted by the advent of data-sets too large for a human brain:

One could consider science as a series of attempts at data compression.

Think about this for a moment.

What are we looking for, as scientists? We're looking for generalisations about the area in which we're working. We want to know how metals behave as we heat them. We want to know how the universe expands over time. We want to know how our favourite set of genes interact with one another in different conditions. We can make a vast number of observations about any one of these, but what we're after is a set of rules that tells us how these things behave and we want those rules to be as general as possible.

Once we find (and test) such a rule, we've encoded the essence of all those observations into one (often simple) rule. Think about Newton's law of gravity; it goes a long way to describing the the motion of a hundred billion stars in our galaxy, but it's just an inverse-square law with a couple of masses and a gravitational constant. In terms of bits of information, that's a pretty awesome compression factor.

So what? Well, we're talking about the need for methods for converting large data-sets into something more interpretable by a human brain. These are compressions in themselves. So we're using algorithms/statistical methods to partially automate the scientific method. Which leads me to wonder how much more of it we could automate, if we really put our minds to it....

The basics of ... Python

: Image via Wikipedia

Python (named after the TV series Monty Python's Flying Circus, not the snake) is a high level programming language that aims to have a clear syntax and only one correct way of doing something. This post will look at how it can be used for scientific computing.

The Scientist-Progammer

I also co-run a blog called Programming for Scientists. A theme that's developed over the time we've been blogging there is the idea of the Scientist-Programmer.

I think a big part of the reason for the development of this theme is that I identify myself with it. I've always been a scientist first and foremost, but since starting my research career I've come to have a real love for the craft of coding and it's always been a big part of the work I've done (for reasons you can see below in "The need to get it right!"). I also think it applies to a lot of other scientists nowadays, and our numbers are growing :-)

The nature of the Scientist-Programmer
The Scientist-Programmer is a scientist first and foremost, but one who spends a lot of time coding and (crucially), takes a professional approach to the programming part of their work. Nowadays, there are areas of science that are critically dependent on software (anything to do with simulation of analysis of large data-sets, for example), so it's an aspect that scientists must take seriously and work hard at.

The programming skill-set is as important to modern science as 'wet' laboratory skills, electrical engineering (for building those big physics machines) and maybe even mathematics. Because some areas of science rely on software, it's a huge advantage to have researchers who are expert at both the science and programming, because they understand both the problem domain (i.e. the science) and also how to go about building the required software.

The need to get it right!
Here is why programming is important: computers are inevitable at some level in modern science because of the large data-sets we need to work with. Computers imply the necessity of programming. And if we're programming to do science, our science relies critically on the fact that our code works properly!

Consider the Conference Test: Imagine you're presenting your work at a major international conference. Do you really want to be stood in front of 200 eminent scientists at a conference when the world-leader in your field spots that your graph must be off by 10% because of a numerical error in your code? The way to avoid this is to take a professional approach to the software that you need to write.

Scientist-Programmer or Programmer-Scientist
Just as you can be a scientist who programs, you can also be a programmer who writes code for science. So, what's the difference? I think there's actually continuum that runs all the way from from "scientist who avoids programming" to "programmer who doesn't work in science" and that both the Scientist-Programmer and Programmer-Scientist are somewhere in the middle. It probably comes down to similar skill-sets but slightly different core interests.

In conclusion...
I've sometimes found it difficult to communicate to other scientists exactly what I am (professionally-speaking :-) ). I think the trouble is that while programming used to be another handy skill that a scientist might pick up in passing, it's grown in significance in recent years and is now much more important. The Scientist-Programmer is the result of this process, a scientist who spends a lot of their time doing science via a computer and strives to be (more or less) a professional-level programmer.

Monday, 6 July 2009

10,000 hours and the Scientist-Programmer

[caption id="" align="alignleft" width="240" caption="Photo by ** rosa **"]

[/caption]

The concept of 10,000 hours effort as a benchmark to become an expert has recently become pretty well known. The idea is this: experts are made and not born, and the way that they're made is to accrue 10,000 hours of hard work at the subject in question. This sounds to us like somewhat of a rule-of-thumb, but it's interesting in how many areas of human endeavour it seems to crop up. Sportsmen, musicians, certainly scientists and programmers (and Scientist-Programmers) and many others seem to require of order 10,000 hours experience to reach the top of their game.

In this article, we're going to consider how the idea of 10,000 hours relates to you, the Scientist-Programmer.

What is an expert?
The circular definition is "someone who has at least 10,000 hours experience in a given field". But obviously, that's not very helpful :-) We suspect you can define "expert" in all sorts of different ways, but for the Scientist-Programmer it is someone who who routinely produces high quality code and does so efficiently. We're not aware of any studies of programming productivity in science, we suspect that expertise helps a lot and for the Scientist-Programmer, this translates directly into generating more and better science.

21st Century Scientist

Tuesday, 21 July 2009

Turning data into knowledge

The basics of ... Python

Wednesday, 15 July 2009

The Scientist-Progammer

Monday, 6 July 2009

10,000 hours and the Scientist-Programmer

Search This Blog

.

Recent Posts

About me

Topics