Deep learning can offer huge value in furthering the understanding of a number of diseases, allowing faster diagnosis and better treatment decisions. However, there is a problem with the data. Despite the huge potential of deep learning and related tools, there is a need for a fundamental change in the way that this is collected in order to use these effectively.
Nikolay Oskolkov, a research scientist from Lund University, knows a lot about data. As a former theoretical physicist, who later transitioned to working in biomedicine, he is passionate about the use of machine learning, particularly deep learning, within clinical diagnostics but is also keen to communicate the existing challenges.
Although there is a lot of talk about data within the life sciences, ironically there is a lack of it. “There are so many areas within life sciences where we can run the same level of analyses that they do in data science,” says Oskolkov. “But they have millions of texts, videos and images. In biomedicine, it is a different setup. It doesn’t mean that it is impossible to do machine learning, but it is just good to remember that it is very different.”
Amongst all the buzzwords, some definitions and distinctions are helpful. In the context of clinical diagnostics, those of most relevance are Bayesian statistics, frequentist statistics, machine learning and deep learning. The decision to use one of these for statistical analysis is contingent upon the amount and nature of data collected.
Bayesian statistics, while commonly used in other fields, is rarely used within biomedicine. This was surprising for Oskolkov when he transitioned from theoretical physics. “When I moved to this area, I observed how they did analysis within medicine, which was very different from physics,” explains Oskolkov. “I noticed that they were predominantly using frequentist statistics, where the definition of probability is different from Bayesian statistics.”
Frequentist statistics, as the term suggests, is based on the frequency of success. “If you conduct 1,000 trials and you see that 100 times you get success and the rest are failures, then the probability of success will be 100 divided by 1,000,” says Oskolkov.
This is not well suited to biomedicine. “This way of defining probability does not automatically bring in uncertainties into your analysis,” explains Oskolkov. “Bayesian statistics never says ‘yes’ or ‘no’, it will say ‘yes, but with some uncertainties’, unlike frequentist statistics.”
These might seem like small differences but they are important. “In healthcare, where the price of a wrong conclusion is high, it is better to stick with more conservative statistical methods, which is Bayesian statistics,” explains Oskolkov. “It is a more intelligent way of making decisions; it is more careful.”
Similarly, the distinction between deep learning and traditional machine learning is significant. The former is a subset of the latter but has additional properties, which make it particularly useful for gaining more nuanced insights into health conditions. “Deep learning is about deep neural networks,” explains Oskolkov. “The advantage of these is that the performance and the accuracy increases the more data they get. In traditional machine learning, you reach some plateau or saturation, meaning that after a certain amount of data you don’t see this improvement in accuracy.”
Understand the information
Making sense of the insights generated from deep learning is difficult, making the choice between machine learning and deep learning less clearcut. “Deep learning uses a lot more parameters, compared with traditional machine learning, so it’s a more complex, less-transparent form,” says Oskolkov. “People complain that it is a black box. Over the past couple of years, we’ve made a lot of progress in terms of the interpretability of deep learning, but it’s still an issue. Traditional machine learning is more interpretable.”
Bringing deep learning into clinical diagnostics, while potentially of huge benefit, is not straightforward. “In medicine, it is more difficult to conduct robust analyses than in physics, which works with simplified options with ideal worlds,” explains Oskolkov. “In the biological world, how things work is much more of a mystery. We cannot conduct precise analyses. That is why a lot of decisions are made based upon intuition and experience.”
It is important to note that adopting deep learning does not mean replacing healthcare professionals. “The idea is that an algorithm, like deep learning, can be an advisor, a computational framework, which analyses a bunch of data that a human cannot analyse in a practical time frame,” explains Oskolkov. “It is just not feasible for a medical doctor to look at all the numbers, images and everything, whereas a machine can analyse all the complexity of the data very quickly and make a prediction or give advice.”
To be able to successfully integrate deep learning into healthcare, it is essential not only to consider the data-analysis process but the nature of the data itself. An important distinction is between big data and little data. Currently, there is a UK biobank of 500,000 individuals with many types of data, including whole genome genetic information. A number of projects are under way to conduct whole-genome sequencing of hundreds of thousands of individuals. However, such data is limited by a concept called ‘the curse of dimensionality’. This occurs where there is an imbalance between the number of statistical observations and the number of parameters, or the dimensionality of the data.
“If you think that one genome, one individual, is one statistical observation and you want to find genes or mutations that lead to certain human diseases, it means you have to analyse millions of mutations,” says Oskolkov. “If you have half a million individuals in the UK biobank, and millions of mutations, it means that you place half a million individuals in milliondimensional parameter space.”
To overcome the curse of dimensionality, you would have to sequence millions of individuals, which clearly is not feasible. In the absence of that large amount of data, analysis is restricted to traditional frequentist statistics, which is inherently limiting in terms of the insights that can be generated.
Little wonder
However, there is an alternative: small data. “Wholegenome sequencing means that you chop your DNA and then sequence the pieces of DNA,” says Oskolkov. “Rather than one whole genome, a tiny sequence from a genome can be a statistical observation. This way, you can get billions of statistical observations, and this is where you can run deep learning.”
Greater collaboration between natural and life sciences could help to overcome these issues, bringing together knowledge of statistics and medicine. “I want the natural and life sciences to communicate and learn from each other,” says Oskolkov. “In healthcare, there is a lot of discussion about how we are collecting data instead of how we are going to analyse it. If you don’t have a clear idea about analysis then it means you should reformulate and reconsider where you invest money.” One of the criticisms levelled at traditional machine learning and deep learning is their potential to make errors as a result of being exposed to new data that they have not been previously trained on. However, a lesserdiscussed form of deep learning, called reinforcement learning, could help to overcome this issue.
“In reinforcement learning, it is a self-learning algorithm – it is automated so it doesn’t need human interference,” explains Oskolkov. “It can understand the rules of environments itself by just observing, which opens new horizons. If an algorithm can understand what to do without you teaching it, this is a different mentality, a different world. Reinforcement learning is a way to learn about the unknowns in an environment, and this is real artificial intelligence.”
This is in stark contrast to deep learning and traditional machine learning, whereby the conclusions could potentially be reached though pen and paper methods. Although it is not feasible for humans to carry out this work at scale, it is possible.
Reinforcement learning means going beyond human capabilities and is already being explored in other industries. “There are already examples outside life sciences – in data science, what Google DeepMind is doing with Alpha Go, biologists or medical doctors cannot do,” says Oskolkov. “What does this have to do with biomedicine? A lot – its general intelligence; it’s an algorithm that can learn from the data by itself and which can discover something you would never think about.”
Oskolkov is optimistic about the future of these tools for healthcare but it does require a different way of thinking. “It could bring major advances,” Oskolkov. “Life sciences can learn a lot from looking at other industries.”