So, the question is: What are the challenges to using data science to improve medical treatment?
Small DataMany medical data sets end up being small to the point of being underpowered. Reasons for this include the cost of collecting samples and making the measurements, and practical considerations of recruiting patients to the study. This misses a trick, both because they can be underpowered, and also because of the 'unreasonable effectiveness of data' (the observation that larger training sets can lead to significant improvement in predictive ability). This is even more of an issue with many new data types, many of which are very high-dimensional.
The Right Patient Cohort?Does the cohort of your training data match the cohort of people who you'll actually want to help, clinically? The answer to this is often 'no', and this may seriously affect the generalisation of your model. Is it possible to use transfer learning to help offset this effect?
NHS-Quality Data?Data from research labs is not enough! The quality can be too low, and will typically not be certified to NHS standards. Trust in data quality is vital, but also your clever data science method will not be
allowed into clinical practice if the data aren't NHS-certified.
Changes in Measurement TechnologyWe are now in an era where medical measurement technologies are developing very rapidly. For example, if you have a method for cancer diagnosis that uses next-generation sequencing technology, will it still work with whatever platform is being used in 5 years time? We need to learn how to future-proof, as much as possible.
Do You Trust Your Models?If one is using a data science method to help treat someone medically, the price for the method failing could be very high (possibly even death). Therefore, do you *really* trust your model? Is the parameter fitting robust? It the model robust to poor data (outliers etc).
Do You Trust Your Code?Similarly, do you trust your code? Really? Because a bug in your code in this context could kill someone. There are other disciplines that have expertise in this kind of thing (e.g. control systems for
aircraft), but this is a huge challenge for medical data science.
Data IntegrationAn open methodological question: how best to combine multiple sources of information, in e.g. statistically principled ways. Medicine is now routinely producing multiple data types for many patients, so we need reliable (automated) ways to get the most out of these.
Covariate DriftOur goalposts will move over time. For example, if we're screening for the early detection of a disease, the more successful we are, the more the at-risk cohort will change over time. Can our models adapt to account for this?
Ongoing Data AggregationAs we use a medical data science system, it will aggregate more data (i.e. old test examples can subsequently be added to the training set). Can we use these data to improve e.g. the predictive ability of our models?
Are You Solving The Right Problems?Are we solving problems that will actually improve clinical practice. It's all very well to develop a statistical model that can predict which cancer patients will die sooner, but if one has no treatments
available which can change this, there is no value to the prediction.
There are a lot of challenges that we've barely begun to address, when it comes to the idea of getting data science methods into clinical practice. But at the same time, there are so many potential benefits
that it is well worth the effort. There's much to be done...