Tuesday, 8 January 2013

What is Science?

I think a lot about science and the scientific method.  These are very complex topics, but the more I think about it the more I'm convinced their fundamental basis is very simple.

Science is evidence-based reasoning.

Everything else is implementation detail.

This is a very powerful thought because it's hugely general.  We have a provably unique set of mathematical rules that underpin it (probability theory), the human brain has an excellent track record at generating new scientific ideas, and we're getting better and better at generating new empirical evidence (data).

It also begs an interesting question:  If this is the essence of science, how can we improve our ability to do it effectively?

Food for thought...


Friday, 26 October 2012

Fixing modern healthcare?

(caveat: this post post will be fairly UK-centric, because that's the healthcare system with which I'm familiar!)

Modern healthcare is the victim of its own success.  We've been so successful over the last few decades at developing amazing new medical treatments that healthcare systems can no longer afford to give everyone every single treatment that might help their ailments.  

Now, at some level this is a glass-is-half-empty view of the situation.  Just because we can't afford to use every single viable treatment in every single case doesn't mean we're in a bad way.  Quite the contrary.  We live in an age where people are routinely cured of diseases that were fatal a few short decades ago, and modern treatments improve the lives of millions who would previously simply have to suffer their ongoing conditions.

However, this is no reason to not be ambitious.  We also live in an age of huge innovation, where information technologies such as computers and the internet are radically changing and improving many areas of life.  Why should healthcare be any different?

Here in the UK, the NHS has within it a body called the National Institute for Clinical Excellence (NICE).  Part of NICE's job is to assess the cost effectiveness of different medical treatments.  Because the NHS has finite resources to be shared across the UK population, this is very important - if money is wasted in one area, it's to the detriment of people elsewhere.  Getting the best overall medical return per unit cost is therefore key.  But this begs an interesting question:

What if certain medical costs dropped to zero?

There would be an immediate knock-on effect that the saved money could be used elsewhere in the NHS to treat people (in an insurance-based healthcare system, I suppose what should happen is that insurance costs come down, so more people can afford it).  There's also a more subtle consequence, depending on what the cost in question is - for example, if the cost to test for a given disease drops to zero, we can test everyone for it, as often as we like.  If the disease in question is cancer, say, early detection will also hugely improve survival rates and make the required treatments a lot cheaper (you need a lot less chemo- and radiotherapy if your tumour is small and unmetastasised to begin with).

That would be great, but why on Earth might we expect the cost of anything to drop to zero?  Well, by analogy to what's happened in other industries that have been radically changed by information technologies. The minimum costs to effectively market and distribute books, music and video have dropped essentially to zero - this is why people can self-publish books, release albums without needing a record label, or run entire TV channels over YouTube.  Scientific journals can now be run entirely online, eliminating the need to spend money on marketing or distributing hard copies, leading to new journals that can be run at greatly reduced costs.  And modern search engines let us, for free, efficiently access a reasonable approximation of the sum total of human knowledge.

In medicine, there are a couple of obvious trends that are likely to radically reduce certain costs.  The first is genome sequencing.  The first time we sequenced a human genome, it cost several billion dollars and took a decade and a half.  Currently, it costs about $1000 and takes maybe a week.  Not bad progress considering the (more or less) complete first human genome was only announced 12 years ago!  Post-genomic medicine is taking a little time to arrive, but arriving it is.  And who knows how huge the impact will be when we can routinely sequence the DNA of not only every person, but every pathogen that they have.  Immediately.  For minimal cost.

The second trend is machine learning.  Medical diagnostics and prognostics can be very expensive.  But what if we can use all the easily available information that we have on a given patient (from their history, previous medical notes, and cheap-to-administer measurements) to spot diseases early and to identify which treatments are likely to be most effective?  There may be a whole range of diseases for which we can test at essentially no cost, if we marshal the information appropriately and build effective algorithms to analyse the data.  And many diseases have a range of possible treatments available - what if we can use the information we know about a given patient to accurately predict which treatment will work best for them.   This is personalised medicine, but it could also be very cost-effective medicine.

Of course, this is all pretty speculative.  But maybe it's something on which we should be focusing.  Anything medical that has negligible cost is something from which everyone in the world can benefit, as well as freeing up resources to help elsewhere.  And this is a pretty worthwhile goal.

Tuesday, 16 October 2012

The Sage Bionetworks - DREAM Breast Cancer Prognosis Challenge

I've been competing in the  Sage Bionetworks/DREAM Breast Cancer Prognosis Challenge.   The submission deadline was a few hours ago (early start for those of us in the UK!), so I thought now was a good time to share some of my thoughts on what has been a very interesting experience.  I enjoyed it a lot and I think the folks at Sage Bionetworks are really onto something with this as a concept.

The goal of the challenge is to develop machine learning models that can predict survival in breast cancer.  We've been given access to a remotely-hosted R system on which to develop our models, and (on said system) use of molecular and clinical data from the Metabric study of breast cancer.  We run our models on this remote system and they're scored using concordance index, a nonparametric statistic for survival analysis that is sensitive to the ranking of predictions.

What makes this challenge a bit different is that it's both competitive and also collaborative.  Not only are we competing against one another to get the best-performing model, but once someone has submitted a model, I can download it and inspect their code to see how it works!  This is very ambitious (and certainly not without its issues), but aims to create a hybrid competition/crowdsourcing approach that can produce very strong solutions to the scientific goal of interest.

Having put a lot of hours into working on this challenge over the last few months, I have developed some opinions on it.  So, in no particular order, here are my thoughts on the challenge:


  1. Incentives for academics.  In addition to some small financial prizes along the way, the big prize on offer is co-authorship on a journal paper.  This is a very big incentive for academics (such as myself) who want to compete in a challenge like this.  I'm very enthusiastic about the whole concept and would love to join in with future challenges.  However, in order to justify spending my work time on it, there needs to be some kind of academic return.  Co-authorship fits the bill nicely!  Currently, I think the top couple of teams (?) get this prize, but I think extending this would be a good plan.  Certainly, my experience in this challenge is that there are many more than 2 academic teams who have contributed significantly to the success of the challenge.
  2. Incentives for non-academics.  Of course, it's also important to have rewards on offer for non-academics.  The real strength of such a challenge comes from having a diverse community of competitors.  I presume the small financial prizes are nice in this regard; it'd be really interesting to hear from some of the non-academic competitors what their views are on this.
  3. Sharing of code.  This has been a very innovative (and brave!) aspect of the challenge.  I don't think the organisers quite nailed every aspect of it, but I think the general approach is very powerful and certainly worth persisting with.  I wonder if the sharing should be constrained in some way - perhaps code can only be accessed 48 hours after it is submitted?
  4. Blitzing the leaderboard?  In this challenge we could make as many submissions as we liked to the leaderboard (of which I'm as guilty as anyone :-) ).  This worries me as it could lead to a lot of over-fitting.  Maybe in future challenges there should be a limit - say 5 submissions per day?
  5. Challenge length.  In total it was approx 3 months long.  2 - 3 months feels about right to me.
  6. Competitive vs. collaborative.  Another research model that's relevant here is the Polymath Project.  Essentially, one can imagine a sliding scale between competition and collaboration.  Polymath lives at one end, with things like the Netflix Prize and Kaggle competitions at the other.  This challenge lives somewhere in the middle.   I like the idea of blending the two concepts.
  7. Populations of ideas vs. monoculture.  A competition is great for generating a wide range of ideas.  Once people start sharing, I expect (as happened in this challenge) the pool of ideas tends towards more of a monoculture.  
  8. Building an ongoing community.  This challenge has been a great way of starting up a research community (a smart mob :-) ).  It would be great to harness this community in an ongoing basis.      

Sharing code

Sharing code means sharing ideas, and this has allowed us to benefit from each others ideas during the challenge.  I'm sure this has led to better overall results.  However, it has also has some quirks that might need tweaking.

First is the phenomenon of 'sniping'.  Someone else can spend a month developing an awesome model, but once it's been submitted to the leaderboard I can download it straight away, spend 30 minutes applying my favourite tweak and then resubmit the (possibly improved) new model, jumping above the hardworking other competitor on the leaderboard.  Of course, overall this leads to better models, which is the collective aim of the challenge.  But I think care needs to be taken to ensure that credit (and reward in general) is given where it's due.  It can be a bit dissatisfying when this happens to you!

The other consideration is that after a while of sharing models, we end up with a monoculture.  Examinations of the high-ranking models over the last couple of weeks show that almost all the models are based on those of the Attractor Team (with some chunks of my own code scattered around, I was gratified to see!).  This is probably not surprising, as the Attractor Team won both the monthly incremental prizes, but it's probably an indication that we've got about as far as we can with the challenge when this happens.  Now is probably a good time to stop :-)

So, what would I change?  I might suggest something a bit more like the following:

A possible model for future challenges

The 21st Century Scientist Speculative Future Challenge (21SFC) would look like this:

Stage 1 (initial competition) - a month long competition to top the leaderboard.  No-one can access other people's code and at the end of the month, a prize is awarded on the basis of a held-out validation set.  After the deadline, all code for Stage 1 is made available.

Stage 2 (competition/code sharing) - another month long competition to top a new leaderboard.  Everyone has access to the Stage 1 models, but Stage 2 code is either unavailable or only accessible 48 hours after is has been submitted.  At the end of the month, a prize is awarded on the basis of a held-out validation set. 

(it might be worth re-randomising the training, test, validation sets for stage 2)

Stage 3 (collaboration) - A non-competitive stage.  The aim here is to work as a team to pull together everything that has been learned, produce 1 (or a small number) of good, well-written models and to publish a paper of the results.

The author for the paper is "21SFC collaboration", with an alphabetised list of people given.  There can be different ways to qualify for authorship:
  • Placing in the top-n in either stage 1 or 2
  • Making significant contributions in stage 3 (the criteria for this would need to be established)

This structure uses initial competition to generate a lot of good ideas, then uses a second stage of competition to combine/evolve those ideas.  It then has a final, collaborative phase where everyone who wants to pulls everything together to produce, publish and release the code for the challenge's solution to the problem.  The challenge doesn't take too long to complete and the contributors get rewarded for their efforts in various ways.

--------

This post has turned into a long one and I hope I've communicated the intended positive tone.  I enjoyed the Sage/DREAM BCC a great deal and I think this is a hugely powerful way of getting answers to scientific problems.  I'm certainly going to take a look at whatever the next challenge is (I know there are some in the pipeline) and I would certainly recommend you doing the same.

Tuesday, 17 July 2012

Open access science

The UK's publicly-funded scientific research is going open access.

This is a Very Good Thing.  And I'm quite impressed that the UK government has been pretty decisive about this; I was emailed by the MRC (my main funder) a while ago, telling me that publishing in an open access way was now a condition of funding.

Leaving aside the potential practical complications in making this happen (which other people have considered more than I have),  why is this such a good thing?  There are a number of reasons.


Firstly, there's the basic consideration of who's paying.  In this case, it's the tax-paying UK public.  So it seems entirely reasonable that they should have access to the research that they've paid for.  It actually seems faintly ridiculous that this was ever not the case (although this was because pre-Web, the cost of distribution wasn't insignificant).

Secondly, and very importantly, it accelerates the pace of scientific research (which relates to this blog post).  When I write a scientific paper, I want it to be accessible as rapidly as possible to as many people as possible, and as easy to access as possible.  This means that my research can be read, assessed and acted upon as rapidly as possible, which means that people can benefit from my work and/or find ways to refine the ideas therein.  Faster is better.

Thirdly, there's a very important consideration of public engagement with science.  Science and scientific research are getting more and more complex with time, both because we're already discovered a lot of the easy stuff and also because we come up with progressively more clever ways in which to advance.  This is great, but it does mean that it's increasingly difficult for the non-specialist to understand a lot of the good science that goes on.  There are all sorts of efforts that are going on to try to address this, but one really good one is to try to make sure that anyone can access any piece of research.

My anecdotal view is that governments are in general pretty rubbish at having a clue about internet-related developments.  Politicians are very busy people, making it hard to keep up with developments.  I also suspect that the demographic profile of the current generation of politicians means there is a high proportion of Hapless Techno Weenies.  But in this instance, they seem to have correctly identified an important internet trend and acted on it.  Kudos for that.

Tuesday, 10 July 2012

Rapid Research Prototyping

Does research have to be this slow?

I find the pace of a lot of research projects frustrating.  Sure, some things just take time, but I have become increasingly suspicious that a lot of bottlenecks in research are problematic simply because we haven't taken the time to find a faster way of doing something.

This can be very challenging.  Experiments and data, for example, can be a lengthy and painstaking process.  Data analysis, simulations/modeling, discussions with collaborators and even writing the paper can all take many months to complete.  And of course we try to be efficient, but should we really care so much if things take a while to get done.


Yes.


And here's why.


Consider what science is.  Science is the generation of new and improved memes for describing the natural world, whose fitness is judged using empirical evidence.  It's a memetic process.  This means that we should be thinking in terms of evolution of ideas.  And one of the key ways in which we can accelerate any evolutionary process is the shorten the generation time.  Because the sooner you get new research into the public domain, the sooner other people can benefit from it and the sooner you can get feedback.


There is also a second key point here.  Scientific ideas gain most of their value from being tested by other people.  And this can't happen until it's been released into the public domain.  We should be thinking of our newly-minted science meme as no more a than prototype, that needs to be poked and prodded by as many other people as possible before it can even start to be thought of as being robust.
The idea of spending years or decades on a scientific magnum opus is the wrong plan; getting your work into the public domain is everything.


This absolutely does not mean lowering our standards with regards to quality; there is so much research being produced nowadays that we need to avoid drowning each other in mediocre research.  But a single researcher or group can only make a science meme so good.  Beyond a certain point, your idea needs to be tested by other people.  At that point, faster is better.  Much, much better.


What form, then, should science take in the 21st century.  It should be about rapid research prototyping -  the production and dissemination of new high-quality prototype science memes, as rapidly as possible.

Publish early, publish often.  And optimise your bottlenecks.