21st Century Scientist

Thursday 28 January 2016

Speeding up time-to-clinic

One thing I'm struck by, the more I work on medical projects, is how long everything takes. Now, partly this is a a reflection on tempo of academic research, and partly it's that these projects are challenging. But, frustratingly, in many cases there isn't a good reason - could often happen much faster, if we put our minds to it.

That begs the question of how we can more rapidly translate medical research into the clinic, where it can actually help someone.

One major limiting factor is the time it takes to get the resources to get a new project off the ground. If you need grant funding, the proposal itself can take months to put together, especially if you need to generate pilot data. Then the process of your submitted proposal being assessed will take months more, with the associated chance of rejection in most cases. And by the time a funded project gets started, hires the relevant people etc, over a year will have passed since you were first in principle ready to get started. There are now funding streams which try to speed things up, but more are needed.

This couples to another problem, that medical research is often very expensive. It's hard for funders to dole out a research grant quickly if it's for millions of pounds. They have a responsibility to the public/their donors/etc to ensure the money is used wisely. It is much easier for a funder to fund compact projects costing £100k than it is to dole out multi-million pound grants. Cheaper ways to do the research would help accelerate this process.

Then there's the research project itself. This can take a lot of time for all sorts of reasons. And I'm afraid that academics are often very bad at focusing on getting stuff done promptly. This is a mistake if one is concerned with maximising the value-per-unit-time one creates. At every stage, delays creep in because people are busy etc. If it takes a month to arrange a telecon to discuss something, that's 8% of a year wasted. Such delays add up. And I think academia can lack urgency, which in medical research is another mistake. If one aspires to develop something that can save 100 lives a year, a delay of 12 months has killed 100 people...

And even once all of the above is dealt with, we may well not even have begun the process of translation. There needs to be evidence that the research outcomes will work in the clinic (as opposed to a research lab), there needs to be a way of turning the research into a product that can then be deployed (e.g. commercialisation). Perhaps this even needs to be built into the research project from the word go, so the definition of a successful project is that something new has been deployed to the clinic. If all you've done is draw some conclusions and write a paper, maybe who cares?

Every single one of these steps is slow, ponderous, and flaky. What exactly a better system looks like isn't (yet) clear, but I'm pretty sure we want something a lot closer to an exponential organisation.

So, what should we do about all this?

Well, even a simple consideration of process optimisation would help. For example:

1. Remove all unnecessary steps. If there is some funding already in place (e.g. for pilot studies), the whole grant-application step can sometimes be avoided. If infrastructure is already in place (e.g. for data acquisition, sample collection), there is no need to have to build it.

2. Parallelise as many steps as possible. If we have a hundred candidate biomarkers for a disease, why not set up a project that can systematically test them all at the same time. And for clinical trials, multi-arm, multi-stage trials end up being hugely efficient in terms of treatments tested per unit time.

3. Make each remaining step as fast as possible.

How hard can it be...?

Tuesday 26 January 2016

Translational Medicine with Data Science

I've been thinking a lot recently about how to use all the medical data that are being generated to actually help patients. We can (and do) measure a lot of medically-relevant things about a given patient, but that doesn't automatically lead to an improvement in care or outcome.

So, the question is: What are the challenges to using data science to improve medical treatment?

Small Data

Many medical data sets end up being small to the point of being underpowered. Reasons for this include the cost of collecting samples and making the measurements, and practical considerations of recruiting patients to the study. This misses a trick, both because they can be underpowered, and also because of the 'unreasonable effectiveness of data' (the observation that larger training sets can lead to significant improvement in predictive ability). This is even more of an issue with many new data types, many of which are very high-dimensional.

The Right Patient Cohort?

Does the cohort of your training data match the cohort of people who you'll actually want to help, clinically? The answer to this is often 'no', and this may seriously affect the generalisation of your model. Is it possible to use transfer learning to help offset this effect?

NHS-Quality Data?

Data from research labs is not enough! The quality can be too low, and will typically not be certified to NHS standards. Trust in data quality is vital, but also your clever data science method will not be
allowed into clinical practice if the data aren't NHS-certified.

Changes in Measurement Technology

We are now in an era where medical measurement technologies are developing very rapidly. For example, if you have a method for cancer diagnosis that uses next-generation sequencing technology, will it still work with whatever platform is being used in 5 years time? We need to learn how to future-proof, as much as possible.

Do You Trust Your Models?

If one is using a data science method to help treat someone medically, the price for the method failing could be very high (possibly even death). Therefore, do you *really* trust your model? Is the parameter fitting robust? It the model robust to poor data (outliers etc).

Do You Trust Your Code?

Similarly, do you trust your code? Really? Because a bug in your code in this context could kill someone. There are other disciplines that have expertise in this kind of thing (e.g. control systems for
aircraft), but this is a huge challenge for medical data science.

Data Integration

An open methodological question: how best to combine multiple sources of information, in e.g. statistically principled ways. Medicine is now routinely producing multiple data types for many patients, so we need reliable (automated) ways to get the most out of these.

Covariate Drift

Our goalposts will move over time. For example, if we're screening for the early detection of a disease, the more successful we are, the more the at-risk cohort will change over time. Can our models adapt to account for this?

Ongoing Data Aggregation

As we use a medical data science system, it will aggregate more data (i.e. old test examples can subsequently be added to the training set). Can we use these data to improve e.g. the predictive ability of our models?

Are You Solving The Right Problems?

Are we solving problems that will actually improve clinical practice. It's all very well to develop a statistical model that can predict which cancer patients will die sooner, but if one has no treatments
available which can change this, there is no value to the prediction.

There are a lot of challenges that we've barely begun to address, when it comes to the idea of getting data science methods into clinical practice. But at the same time, there are so many potential benefits
that it is well worth the effort. There's much to be done...

Friday 20 March 2015

Big Data in Cancer

We recently organised a workshop on 'Big Data in Cancer', as part of a year-long programme to launch Warwick's new Data Science Institute. It was a fascinating day, with four brilliant speakers (Florian Markowetz, Andrew Teschendorff, Paul Moss, and Sean Grimmond) and covered a lot of important topics.

Florian was first to speak, and led off with some conceptual and philosophical points about Big Data, which I think are hugely important. There is, of course, a lot of hype surrounding Big Data and what it's capable of. The more 'enthusiastic' supporters even argue that it makes the scientific method obsolete. I share Florian's view that this is kind of silly; no matter how much data you have, there is still a difference between correlation and establishing a causal link, for example. And patterns in data are not the same as establishing underlying general rules. Yes, you can map out a rule using data if you have enough of it, but wouldn't you rather just have a simple formula? Rather, Big Data is a technical challenge (how do we handle it), with the pay-off being smaller variance in our estimates, more detail about the things we are studying and the like. For example, getting a much more detailed description of the heterogeneity and evolutionary history of a given tumour.

Andrew told us about some of his work in epigenomics. This is a fascinating topic that I'm trying to learn more about, but is seems that tumours come with an explosion of epigenetic modification throughout the genome, which contains potentially all sorts of information that may allow us to diagnose cancer, its type, and its likely progression. The early detection of cancer is a hugely important area of translational research, which makes this doubly exciting.

Paul gave us a whistle-stop tour of some areas of cancer research. He talked a lot about the potential for electronic health records and Health Episode Statistics data, something in which the Queen Elizabeth Hospital in Birmingham is really leading the way, with a serious informatics infrastucture. To my mind, this is part of a coming revolution in healthcare where all relevant medical data are stored electronically in integrated databases, which turns medical research into a software/algorithms problem. Evidence from other areas of human endeavour suggest that this will be a huge driver of the pace of innovation.

Finally, we were treated to a great talk by Sean, who's heavily involved in the International Cancer Genome Consortium and has recently moved to the UK to pursue the translational aspects of genomic medicine (note: the NHS and the UK's science base makes us world-leading in this area, and this means we can attract the world's best researchers). I was really grabbed by just how rapidly genomic medicine is scaling, in terms of data volume. The cost to sequence a whole human genome has dropped from ~$100m (2001) to ~$1000 (2014), an incredible 5 orders of magnitude in 13 years (think about that for a moment...). With the UK's 100,000 genome project already underway, the US recently announcing a 1,000,000 genome project, and China undertaking a similar scale project, we are just about to be awash in genomic data.

The limiting factor in translational cancer research is about to become our ability to effectively handle and use the flood of data that we're just seeing the first trickles of. This is a hugely exciting time to be involved in cancer research, but we all might need to buy bigger computers...

Monday 31 March 2014

Big Data - in the process of growing up

A very good article was published in the FT online a few days ago. Its title is 'Big Data: are we making a big mistake', and it's a commendably well thought out discussion of some of the challenges of Big Data. And perhaps a bit of a warning to not get too carried away :-)

You should go and read the full article, because it's got loads of great stuff in it. A few of the concepts that leapt out at me in particular were the following.

One of the interesting things about modeling data in order to make predictions (as opposed to explain the data), is that features that are correlated to (but not causal for) the outcome of interest are still useful. But, the FT article makes the really good point that even in this case, correlations can be more fragile than genuinely causal features. This is because while a cause is likely to remain a cause, correlations can more easily change over time (covariate drift). This doesn't mean we can't use correlation as a way to inform predictions, but it does mean that we need to be much more careful and be aware that the correlations may change.

The article also discusses sample variance and sample bias. This is in many ways the crux of the matter for Big Data. In principle, very large data sets offer us the chance to drive sample variance towards zero. But it really has much less to offer in terms of sample bias, and indeed (as the article points out) many of largest data sets, because of the way they're generated, are actually very vulnerable to high levels of sample bias. This is not to say that one can't have both (the large particle physics and astrophysics data sets are great examples where both sample variance and sample bias are addressed very seriously), but it is a warning that just because your data set is huge, it doesn't mean that is is free from sample bias. Far from it.

I've felt for a while that 'Big Data' (and data science) are currently going through a rapid phase of growing up, which I suppose is pretty inevitable because they're in many ways new disciplines. They've gotten up to speed on the algorithmic/computer science side of things very rapidly, but are still very much in the process of learning many of the lessons that are well-known to the statistics community. I think this is just a transition phase (these things take a bit of time), but it seems clear that a big part of the maturation of Big Data/data science lies in getting fully up to speed on the knowledge and skills of statistics.

Friday 20 December 2013

Programming for Scientists

I used to co-write a blog called 'Programming for Scientists'. All the posts from that blog can now be found here, under the tag Programming for Scientists. Go take a look!

My co-author on that blog was my good friend Ben, who's a professional programmer, so between the two of us (programmer-interested-in-science and scientist-interested-in-programming), there should be some articles there that have stood the test of time :-)

The 'About' for this blog was this:

Just another blog about programming?

Well, not so much. This is a blog about building better (scientific) software. We’re making two important distinctions here. The first is that developing a piece of software is about much more than just writing the code, important though that is. The second is that for many people, scientists for example, building software is something they do in order to achieve their primary goal. What we mean by this is that scientists’ primary goal is to do great science. But to do this, they sometimes need to write some software. So while they need as many software development skills as possible, they really need the ones that will allow them to build the tools they need and, critically, do a good job as quickly as possible so they can produce more good science. We suspect there are lots of people for whom this is true, not just scientists.

The aim of this blog is to explore ways of building better software, but to do so bearing in mind that the ultimate goal is to produce as much great science as possible. This doesn’t mean coding as quickly as possible, which produces buggy, horrible software; it means drawing from the huge body of software development knowledge for minimising the amount of time it takes to build really reliable software that you can then use confidently to do your science. We won’t be telling you the “best” way to do anything (there’s usually more than one) and we won’t be handing down stone tablets saying that you must do something without good reason. What we will do is present a range of techniques, thoughts and considerations and try to explain why we think they’re helpful. You should use somewhere between none and all of these, depending on what you find most useful.

We’re not here to tell you how you should go about developing your software, we’d just like to suggest a few things that other people, ourselves included, have found useful.

We hope you enjoy the blog and find it useful!