21st Century Scientist: June 2009

Monday, 29 June 2009

The point of clever methods

The phrase "clever methods" is my label for statistical methods and/or algorithms that go beyond basic and/or standard approaches (which I think of as "vanilla methods"). Those of us whose research involves methodological work aim to write papers that detail new clever methods. Clever methods aim to go beyond the capabilities of the relevant vanilla methods in some meaningful way, as well as hopefully doing so without becoming intractably complicated or slow to run. In the same way that software engineers might be trying to craft better software for a given task, in researching clever methods we're trying to find better mathematical/statistical/algorithmic ways of doing something.

(By way of full disclosure, I should mention that I'm a fan of clever methods in that I really enjoy working on them and finding cunning and sneaky ways to make a method work better. This is great for motivation, but comes with the health warning to be careful to not make something more complicated just for the sake of it :-) )

Performance vs. complexity
Clever methods tend to be more complex than the equivalent simple methods. So, our goal in researching clever methods is often an attempt to trade off complexity for performance. The trick then becomes to minimise the increase in complexity, while maximising the improvement in performance. This is the benefit that most vanilla method possess; they provide reasonable performance in a very uncomplicated way.

In many case where reasonable performance is all we need, this is a very good solution. For example, if you're trying to detect local stars in an astronomical image, the signal-to-noise ratio of your image might be very high. In which case, even a basic method should be able to detect them all with little problem, meaning that such a choice will do everything you require and do so in a simple and easy-to-understand way (which is often a hidden benefit of vanilla methods).

When creating clever methods, it's very easy to ignore the complexity aspect and simply go all-out for performance (there are many, many papers for which this is true). While this can be okay if performance is so vital (relative to handling the complexity), usually this leads to methods that are so narrow in their application that they're not very useful.

Happily, there are also many cases where a little extra complexity gives you a significantly better method with which to work. And there can be other benefits; for example, if you generalise a vanilla method, the resulting clever method will probably be more complex but it may also be more reliable or allow the automation of some parts of its use. Consider a clustering method that has a well-defined, automated way for choosing the number of clusters into which to partition the data; the user no longer has to worry about doing this, thus saving them time. They probably don't care that the underlying maths is more complicated.

And of course very occasionally, you'll manage to create a clever method that's no more complex (or in extreme cases, less complex) than the simple method/s. Congratulations, you've probably discovered something genuinely importance!

The 10% improvement
Often, clever methods can provide and order 10% improvement (in whatever metric is important) over the simple method. The question then becomes, "is this worth the effort?"

The answer is "it depends". If you're trying to extract a signal from some noise, but the signal-to-noise ratio (SNR) is already 105 then increasing it by 10% may well be irrelevant. If on the other hand you're trying to detect signals right at the detection limit of your data, then it might be vital in uncovering that Nobel-winning new class of whatever. I've worked on astronomical source extraction where the data-set has had an effective cost of tens of millions of pounds (actually quite common when the data come from a space telescope). In this case (and assuming Gaussian noise), a 10% improvement in SNR using the simple extraction methods would require 20% extra data, at a cost of millions of pounds. Or you can just use the clever extraction algorithm.

One very important consideration in all of this is that if you develop a clever method that is reasonably general, then that 10% improvement will be a benefit many, many times.

The undiscovered country
So far I've focused on the mundane benefits of clever methods. There is also another aspect to consider. If your method of analysis is too simple, you might miss something important.

Think about a very rich, complex data-set where it's not obvious how to model the data. Gene expression measurements of whole genomes are a good example. We can certainly use simple methods to analyse this and to get some useful scientific results. But what if there is structure in the data to which our choice of method is insensitive? Imagine what would happen if you only fitted your data with straight lines! You'd miss peaks, troughs, oscillations and all manner of other interesting structure in your data. Your methods need to be able to account for all the interesting structure in given data-set and if that structure is complex, a vanilla method may well miss it.

A related point is that a good way of spotting complex patterns can be to use your eyes to look at the data. There are many examples where the best signal detection method is a person (eg. objects in an image, CAPTCHAs). But this doesn't work if your data-set is too big for a person to meaningfully do this. In this case, you need a clever method.

Clever methods as their own research discipline
There is justification for creating clever methods simply because doing so adds to the sum total of human knowledge. This is especially useful when that clever method extends an existing method and/or when it can be further built upon by you or other people. Whole new areas of methodology can be uncovered in this way, whether through being created or through making some existing ideas more widely known. And often reading a clever idea in one context can spark a thought in someone's mind about their own area of research (this is why it's very important to be well-read as a researcher).

If, like me, your research involves creating new clever methods, a burden of proof falls to you. Because there are infinitely many clever methods one could create, it's important to find the ones that are actually useful (defined as having superior performance to the vanilla methods, at the very least). And this means that you need to test your methods and compare them to other existing ones. This is actually one of the real tricks of methodological research; figuring out as many ways as you can to test a new method, to see if it's worth using. A few things I think are really important for this include:

Test in many different ways
Test on many different data-sets
Test using many different metrics
Testing on synthetic data can be good because you know the right answer
Testing on real data is very important; real data will always contain more junk than synthetic data
Real data where you know the right answer (eg. from some other source of information) are wonderful to have
Realistically simulated data can be very useful. But it takes a lot of effort to build a software/hardware simulation of most types of data
Use your methods. Do some science with them (or help other people to do so), because in the process you'll learn more about how the methods work and how to improve them

In conclusion...
The creation of clever methods is a craft, a balancing act between performance and complexity. But the right method in the right context can be a powerful solution and even open up whole new areas of research.

Monday, 22 June 2009

Doing it for yourself - deciding whether to use someone else's code

[caption id="" align="alignleft" width="300" caption="Photo by jurvetson"]

[/caption]

Using someone else's code can be great. Or it can be horrible. If it does exactly what you need it to, doing so with no bugs and no ambiguity, then this is awesome. If you need to make changes to buggy, uncommented code written by someone who thinks the GOTO statement "really isn't that bad", then we don't envy you one bit.

So, here are some of the things you need to consider when deciding whether or not to use someone else's code.

Do you work on important problems?

I recently read the transcript of a talk given by Richard Hamming and realised many of us may have been missing a trick. He makes the point that we should aim to work on the important problems in our field. This is one of those ideas that seems obvious once you read it, but is somehow easy to overlook during the bustle of day-to-day research.

Why is it important to work on important problems?
Imagine you have a miracle year of work. You're in the zone 100% of the time, every hunch you have turns out to be right and every project you touch turns to gold. Now consider the difference in the impact of your work depending on whether you had been working on important problems or unimportant ones. In the first case, you might have done truly great, maybe Nobel-worthy work. In the second case, you've still done good work but it's not going to change the world. Which would you rather have happen?

What if you work on unimportant problems?
I'm not suggesting that you should exclusively identify important problems and only work on them. After all, you can't be 100% sure your list of "important" is complete and/or completely accurate. However, if all you work on are problems that are unimportant, by definition you limit the impact and value that your work will ever have. Unimportant problems are just that. By all means spend a bit of time tinkering with such problems if they really interest you, but don't waste your career on them.

What are the important problems in your field?
This really boils down to being able to identify what the important problems are in your field/s. This is more difficult than one might imagine, but with some time and thought you can make some headway. Take time to think about what you consider the important problems to be. Read some articles for inspiration. See if any other academics in your area have posted on this topic on the Web. And ask people! A great question over coffee or at a conference dinner is, "What do you think the important problems are in our field?".

Some of these problems will be intractable. Finding an exact O(n) solution to the travelling salesman problem would be awesome, but seems unlikely. And formulating a Grand Unified Theory of physics would be nice, but many people have tried and no-one has just succeeded. It can be tricky to distinguish between problems that are difficult and ones that are (or are likely to be) impossible to solve. There are judgement calls to be made here as to how much time to spend on each problem.

And then there are the problems about which we're uncertain. Is it important or not? Again, this is a judgement call. If the problem is interesting to you, plus you think you can make good progress on it quite quickly, then it's probably worth working on just in case it's important.

Nowhere so far have I mentioned a couple of other considerations that I think are also important. You should probably work on problems that inspire/enthuse you. And you should work on problems to which you're suited, in terms of abilities, skills andtemperament. If you're no mathematician, you shouldn't be trying to work on the Reimann hypothesis. And if you thrive on a sense of rapid progress and individuality, perhaps it's not such a great idea to work on that huge physics project that won't begin generating data for 5 more years.

In conclusion...
It's not the only consideration, but deciding on the important problems in your field and working on them is a pretty good starting point for scientific research.

Monday, 15 June 2009

Why start a research blog?

As I've just started a research blog, I thought "Why do this?" was a sensible question to contemplate. (actually, I thought about it before I started, which seemed even more sensible...)

Sharing some thoughts...
Sometimes the best way to learn or come up with good new ideas is to talk to a fellow scientist. This is why chats over coffee and meeting up a conferences is so important (and why a few glasses of wine at a conference dinner can lead to valuable conversations). This works because of the ideas that are being shared.

There's no reason why talking should be the only medium through which to communicate in this way. And while a blog is less two-way (although please feel free to leave a comment!), it has the great advantage of being able to potentially reach a huge audience. It's difficult to have a conversation with more than a few people, and even a lecture/conference talk is unlikely to have an audience of more than a couple of hundred people (web-casts excepted). But there's nothing to stop thousands upon thousands of people reading a blog post!

Knowing your own mind
There's also a second benefit to writing one's ideas down in a blog post. It helps to clarify them. Writing something for an audience forces you to thing about what you're writing, to work it into a form that will be understandable by the reader, even to challenge your own assumptions. This can develop your own ideas and trains-of-thought in ways that simply thinking about them won't manage.

Returning the favour
Another very important reason is that I've found other people's research blogs very helpful in learning more about how to be a scientist (see myblogroll for some excellent examples). Not every blog will be useful for every reader, but it strikes me that if every researcher kept a blog where they posted their thoughts, observations and experiences, collectively that would form a great body of knowledge for other people to explore.

In conclusion...
There are several great reasons why a scientific researcher should consider keeping a research blog. And there may well be others that I've not thought of yet, but that will occur as I post on more topics. Which is itself an illustration of why research blogging is a good idea :-)

Saturday, 13 June 2009

Science in the 21st century

The nature of scientific research changes with time. Each epoch has its own particular characteristics, the result of a blend of factors ranging from our current state of knowledge and available technologies to the particular needs of our society and the world at that time. The early 21st century (ie. "right now") is no different.

Big data
The sum total of human data is growing more-or-less exponentially and scientific data are no exception. A decade ago, gigabyte-scale data-sets were the sort of thing the large physics experiments were producing. Currently, scientists talk about terabytes quite happily and it's far from stupid to be talking aboutpeta- and exa -scale data-sets. After all, we'll be able to handle routinely that size of data in the next decade of two (organisations like Google probably already do so).

The key point about big data is this: modern scientific data-sets are typically too large to fit in a human brain.

Think about that for a moment. As soon as you can't fit all your data in your brain at once, you need to start doing something new or you're going to have to start throwing data (and hence information) away. This leads to whole new areas of research into how to handle any given type of big data.

Computers
Computers can do things that people cannot. Ever tried adding a million numbers together in a faction of a second? Exactly.

There are two ways of looking at this. The first is that we need computers because otherwise we would be unable to handle the Big Data we are now generating. The other is that computers give us possibilities that didn't exist before, for example there are many applications of Bayesian statistical inference nowadays that were always technically possible, but were simply impractical in terms of the amount of computation required. That is often no longer a problem.

Computational science (ie. doing science using a computer) has become a whole distinct area of scientific research, which means that computing skill-sets have become valuable in a scientific context. In much the same way that scientists with specialist lab skills, mathematical skills, electrical engineering skills (eg. building the big physics experiments) and the like are vital to modern science, this is also true of scientists with specialist programming and other computer skills.

The flow of knowledge...
How rapidly scientific knowledge flows is key to the rate at which science advances. A world-changing idea will likely only do so once it's reached a substantial number of people. The Internet has become a game-changer for this. 20 years ago, a new paper would only typically become available when the hard copy of the journal reached your university library. Now I can scan the abstracts of a hundredpre-prints a day over a cup of coffee, via an RSS feed, months before they appear in the journals. A literature search that might occupy days of library time can now get under way in seconds via Google Scholar and the websites of other academics, and be completed in short order via downloadedPDFs. And even if I can't make it to a conference, there's a good chance I can access the slides online (or even see a webcast of the talk) and email the speaker if I have any questions.

All of this removes overheads from the process of learning about new scientific knowledge. And that makes a big difference to the amount of science you can get done.

Interdisciplinarity
In some sense, science has always featured what we now called "interdisciplinarity". Some of the best new ideas simply span more than one discipline. However, it seems to me that this is particularly true right now. The body of scientific knowledge has become large enough that no one person can know even a moderate proportion of it. This means that 'Eureka!' moments involving ideas from different disciplines are harder to find. So it's become very important to have people who are expert in one discipline who go and talk to people in other disciplines. This even extends to interdisciplinary centres, which have the benefit of putting people from different disciplines in the same office/meeting/seminar on a daily basis. A lot of science is driven by the conversations you have over coffee...

Advancing rapidly...
If I had to pick one thing to characterise modern science, it would be rapidity of its advance. Driven both by the speed of communication and by the rate of improvement of underlying technologies (eg. computers, the cost-per-bit of to generate useful data), we're making new discoveries at an amazing rate. And one of the most striking features is the speed with which new discoveries can be applied to, for example, new technologies - consider Moore's law for an obvious example.

The need for multiple skill-sets
Science is a large field, nowadays. Gone are the days when a gentleman scientist could be the master of all disciplines. Today there are many distinct specialisms, each of which benefits from (often requires) professional-level skill-sets. For example:

chemical/biological/physics lab skills
software engineering
electrical/mechanical engineering (eg. building the big physics machines such as telescopes, particle accelerators)

This leads to there being real value in multi-skilled scientists. For example, not just a scientist who can write a bit of code, but a scientist who is also a professional (or near-professional) level software engineer. Or not just a physicist who knows some things about electrical circuits, but one who could just as easily earn a living as an electrical engineer.

Weight of numbers....
I don't have any concrete numbers for this, but my guess is that we have more scientists now than at any point in history. There are several reasons for this intuition. Firstly, the world's population is bigger that it's ever been (and growing...). Secondly, more countries have developed economies to the point where they can afford significant programmes of scientific research. Thirdly, there are big private companies that have programmes of scientific research.

I would love to see some properly researched numbers on this. And I wonder what a graph of total-science-budget versus time would look like for the world as a whole...

In conclusion...
In the 20th century, science discovered some of the fundamental laws of nature (eg relativity and quantum theory), developed the standard models of cosmology and particle physics, unravelled the secrets of life (DNA) and beat the majority of infectious diseases (antibiotics). We know how to turn lead into gold (in a nuclear reactor) and while eternal life is trickier, theUK's life expectancy has risen by 30 years over the course of the last century. And science is now progressing faster than it did in the last century (maybe a lot faster).

Anyone else excited by the possibilities...?

Wednesday, 10 June 2009

Stay on target! Ways to help yourself work

: Image via Wikipedia

In a previous post we talked about how to keep your brain in tip top condition and staying in the 'zone'. We recommended some simple techniques like removing distracting email, IM or twitter traffic but sometimes the problem isn't staying in the zone, it's getting into the zone in the first place. In this post we recommend some tools and techniques to help you get going. Everybody has different techniques so we'd also like to hear from you about how you beat procrastination and get working.

Welcome!

This is my research blog about conducting scientific research in the 21st century. I'm going to be writing on any and all topics that I think are interesting/relevant to modern scientific research. I hope you'll find some articles that interest you!

21st Century Scientist

Monday, 29 June 2009

The point of clever methods

Monday, 22 June 2009

Doing it for yourself - deciding whether to use someone else's code

Wednesday, 17 June 2009

Do you work on important problems?

Monday, 15 June 2009

Why start a research blog?

Saturday, 13 June 2009

Science in the 21st century

Wednesday, 10 June 2009

Stay on target! Ways to help yourself work

Monday, 8 June 2009

Welcome!

Search This Blog

.

Recent Posts

About me

Topics