21st Century Scientist: Data-intensive Science

I've recently been noticing a number of articles and blog posts on "data science" (or "data-intensive science") and a speculated 4th paradigm for doing science (the first three being experimentation, theory and more recently simulation).

While there is debate about precise terminologies, there is definitely a new approach to science that's being used in various fields. It looks something like this:

1 - design a big, powerful experiment that will make measurements over a whole region of scientific parameter space
2 - run the experiment, reduce/process the data and make it available to people
3 - mine the data for interesting science
4 - (optionally) follow up these discoveries with new experiments

This is already the norm in big astrophysics and particle physics projects and works very well. There's also a number of medical/biological projects going in the same direction.

So why is this happening? Short answer: because it's a good way of making new discoveries.

Longer answer: it's the result of two key drivers. One is that (in the area of concern), someone has invented one or more measurement technique that's capable of generating huge volumes of useful data. The second is that we have computers and machine learning/statistics algorithms that are capable of extracting useful information from such large data sets.

This approach has several advantages.

1 - it can be systematic (astrophysicists have discovered whole new classes of celestial object simply by surveying the whole sky to a given sensitivity)
2 - it can be statistically very powerful (sheer volume of data can give small error bars and good signal-to-noise)

3 - there's a wisdom-of-the-crowds aspect to having many scientists working on the same data set (and if the data set is rich enough, it's worth having many people working on it)

Some interesting links on this sort of thing:

Data science Venn diagram

A taxonomy of data science

Skills of data geeks

Rise of the data scientist

What is data science?

Data-intensive science