Monday, 31 March 2014

Big Data - in the process of growing up

A very good article was published in the FT online a few days ago.  Its title is 'Big Data: are we making a big mistake', and it's a commendably well thought out discussion of some of the challenges of Big Data.  And perhaps a bit of a warning to not get too carried away :-)

You should go and read the full article, because it's got loads of great stuff in it.  A few of the concepts that leapt out at me in particular were the following.

One of the interesting things about modeling data in order to make predictions (as opposed to explain the data), is that features that are correlated to (but not causal for) the outcome of interest are still useful. But, the FT article makes the really good point that even in this case, correlations can be more fragile than genuinely causal features.  This is because while a cause is likely to remain a cause, correlations can more easily change over time (covariate drift).  This doesn't mean we can't use correlation as a way to inform predictions, but it does mean that we need to be much more careful and be aware that the correlations may change.

The article also discusses sample variance and sample bias.  This is in many ways the crux of the matter for Big Data.  In principle, very large data sets offer us the chance to drive sample variance towards zero.  But it really has much less to offer in terms of sample bias, and indeed (as the article points out) many of largest data sets, because of the way they're generated, are actually very vulnerable to high levels of sample bias.  This is not to say that one can't have both (the large particle physics and astrophysics data sets are great examples where both sample variance and sample bias are addressed very seriously), but it is a warning that just because your data set is huge, it doesn't mean that is is free from sample bias.  Far from it.

I've felt for a while that 'Big Data' (and data science) are currently going through a rapid phase of growing up, which I suppose is pretty inevitable because they're in many ways new disciplines.  They've gotten up to speed on the algorithmic/computer science side of things very rapidly, but are still very much in the process of learning many of the lessons that are well-known to the statistics community.  I think this is just a transition phase (these things take a bit of time), but it seems clear that a big part of the maturation of Big Data/data science lies in getting fully up to speed on the knowledge and skills of statistics.

Friday, 20 December 2013

Programming for Scientists

I used to co-write a blog called 'Programming for Scientists'.  All the posts from that blog can now be found here, under the tag Programming for Scientists.  Go take a look!

My co-author on that blog was my good friend Ben, who's a professional programmer, so between the two of us (programmer-interested-in-science and scientist-interested-in-programming), there should be some articles there that have stood the test of time :-)

The 'About' for this blog was this:

Just another blog about programming?

Well, not so much.  This is a blog about building better (scientific) software.  We’re making two important distinctions here.  The first is that developing a piece of software is about much more than just writing the code, important though that is.  The second is that for many people, scientists for example, building software is something they do in order to achieve their primary goal.  What we mean by this is that scientists’ primary goal is to do great science.  But to do this, they sometimes need to write some software.  So while they need as many software development skills as possible, they really need the ones that will allow them to build the tools they need and, critically, do a good job as quickly as possible so they can produce more good science.  We suspect there are lots of people for whom this is true, not just scientists.  

The aim of this blog is to explore ways of building better software, but to do so bearing in mind that the ultimate goal is to produce as much great science as possible.  This doesn’t mean coding as quickly as possible, which produces buggy, horrible software; it means drawing from the huge body of software development knowledge for minimising the amount of time it takes to build really reliable software that you can then use confidently to do your science.  We won’t be telling you the “best” way to do anything (there’s usually more than one) and we won’t be handing down stone tablets saying that you must do something without good reason.  What we will do is present a range of techniques, thoughts and considerations and try to explain why we think they’re helpful.  You should use somewhere between none and all of these, depending on what you find most useful.  

We’re not here to tell you how you should go about developing your software, we’d just like to suggest a few things that other people, ourselves included, have found useful. 

We hope you enjoy the blog and find it useful!  

Thursday, 12 September 2013

The DREAM Toxicogenetics Challenge

So, we entered a team (WarwickDataScience) into the DREAM Toxicogenetics Challenge and we've managed to win the leaderboard stage!  It's a really interesting (and challenging) problem, so we're really pleased :-)

I wrote a blog post detailing our approach to the modelling, which can be found here.

Tuesday, 8 January 2013

What is Science?

I think a lot about science and the scientific method.  These are very complex topics, but the more I think about it the more I'm convinced their fundamental basis is very simple.

Science is evidence-based reasoning.

Everything else is implementation detail.

This is a very powerful thought because it's hugely general.  We have a provably unique set of mathematical rules that underpin it (probability theory), the human brain has an excellent track record at generating new scientific ideas, and we're getting better and better at generating new empirical evidence (data).

It also begs an interesting question:  If this is the essence of science, how can we improve our ability to do it effectively?

Food for thought...


Friday, 26 October 2012

Fixing modern healthcare?

(caveat: this post post will be fairly UK-centric, because that's the healthcare system with which I'm familiar!)

Modern healthcare is the victim of its own success.  We've been so successful over the last few decades at developing amazing new medical treatments that healthcare systems can no longer afford to give everyone every single treatment that might help their ailments.  

Now, at some level this is a glass-is-half-empty view of the situation.  Just because we can't afford to use every single viable treatment in every single case doesn't mean we're in a bad way.  Quite the contrary.  We live in an age where people are routinely cured of diseases that were fatal a few short decades ago, and modern treatments improve the lives of millions who would previously simply have to suffer their ongoing conditions.

However, this is no reason to not be ambitious.  We also live in an age of huge innovation, where information technologies such as computers and the internet are radically changing and improving many areas of life.  Why should healthcare be any different?

Here in the UK, the NHS has within it a body called the National Institute for Clinical Excellence (NICE).  Part of NICE's job is to assess the cost effectiveness of different medical treatments.  Because the NHS has finite resources to be shared across the UK population, this is very important - if money is wasted in one area, it's to the detriment of people elsewhere.  Getting the best overall medical return per unit cost is therefore key.  But this begs an interesting question:

What if certain medical costs dropped to zero?

There would be an immediate knock-on effect that the saved money could be used elsewhere in the NHS to treat people (in an insurance-based healthcare system, I suppose what should happen is that insurance costs come down, so more people can afford it).  There's also a more subtle consequence, depending on what the cost in question is - for example, if the cost to test for a given disease drops to zero, we can test everyone for it, as often as we like.  If the disease in question is cancer, say, early detection will also hugely improve survival rates and make the required treatments a lot cheaper (you need a lot less chemo- and radiotherapy if your tumour is small and unmetastasised to begin with).

That would be great, but why on Earth might we expect the cost of anything to drop to zero?  Well, by analogy to what's happened in other industries that have been radically changed by information technologies. The minimum costs to effectively market and distribute books, music and video have dropped essentially to zero - this is why people can self-publish books, release albums without needing a record label, or run entire TV channels over YouTube.  Scientific journals can now be run entirely online, eliminating the need to spend money on marketing or distributing hard copies, leading to new journals that can be run at greatly reduced costs.  And modern search engines let us, for free, efficiently access a reasonable approximation of the sum total of human knowledge.

In medicine, there are a couple of obvious trends that are likely to radically reduce certain costs.  The first is genome sequencing.  The first time we sequenced a human genome, it cost several billion dollars and took a decade and a half.  Currently, it costs about $1000 and takes maybe a week.  Not bad progress considering the (more or less) complete first human genome was only announced 12 years ago!  Post-genomic medicine is taking a little time to arrive, but arriving it is.  And who knows how huge the impact will be when we can routinely sequence the DNA of not only every person, but every pathogen that they have.  Immediately.  For minimal cost.

The second trend is machine learning.  Medical diagnostics and prognostics can be very expensive.  But what if we can use all the easily available information that we have on a given patient (from their history, previous medical notes, and cheap-to-administer measurements) to spot diseases early and to identify which treatments are likely to be most effective?  There may be a whole range of diseases for which we can test at essentially no cost, if we marshal the information appropriately and build effective algorithms to analyse the data.  And many diseases have a range of possible treatments available - what if we can use the information we know about a given patient to accurately predict which treatment will work best for them.   This is personalised medicine, but it could also be very cost-effective medicine.

Of course, this is all pretty speculative.  But maybe it's something on which we should be focusing.  Anything medical that has negligible cost is something from which everyone in the world can benefit, as well as freeing up resources to help elsewhere.  And this is a pretty worthwhile goal.