Artificial Intelligence Unleashed on the Brain: Using Machine Learning to Predict Alzheimer’s
This week, a novel computer algorithm tries to turn cross-sectional studies into longitudinal ones to give us new insights into neurodegenerative diseases like Alzheimer’s. The manuscript, appearing in the journal Brain, may be the birth of a new machine-learning technique that could transform the field.
But I’m not entirely convinced.
Let’s walk through it.
One of the major problems in neurodegenerative disease research is a lack of big longitudinal datasets. If resources were unlimited, we could follow blood gene expression profiles for tens of thousands of people for decades, see who develops neurodegenerative diseases, and gain a deep understanding of the longitudinal changes in gene expression that might drive the disease. That information could not only give us a new prognostic tool but could identify therapeutic targets too.
Of course, we don’t have unlimited resources. Most neurodegenerative disease datasets are cross-sectional, or nearly so — collection of data at a single point of time, occasionally supplemented by post-mortem autopsy studies.
Researchers led by Yasser Iturria-Medina at McGill University took a different approach. What if the power of machine-learning could be leveraged to transform cross-sectional data into longitudinal data.
It’s a complicated idea, but basically they took gene expression profiles from the blood and (in autopsy cases) brains from individuals with neurodegenerative disease and healthy, elderly controls.
Feeding all this data into a computer algorithm, they asked the machine to find which gene transcripts tended to cluster together.
Now here’s where we need to introduce some machine learning jargon. The paper describes this approach as “unsupervised” — see it’s right there in the keywords.
What that means is that the gene data was presented to the algorithm without any additional information — like how severe the dementia was. The algorithm had to just figure out which genes hang together without knowing how they relate to disease. This is really important because if you use an unsupervised method to cluster gene expression, and you subsequently show that those clusters predict disease severity, you have a really strong argument that you’ve discovered something fundamental about the disease process.
The authors tried to show that here. After the algorithm was trained, each patient could then be mapped in what amounts to a one dimensional space — how close they are to the pattern seen in healthy controls vs. how close they are to patterns seen in neurodegenerative disease.
On the assumption that neurodegenerative disease progresses from health, slowly, to advanced disease, this map reflects time, or — as the researchers call it — pseudotime.
In other words, by looking at any individual’s gene expression profile, they could estimate how far along the disease pathway they have traveled. Lining up individuals by pseudotime then allows you to create a pseudolongitudinal cohort study — and maybe learn something fundamental about the disease.
And it seems like it works.
These pseudotime estimates were strongly associated with disease severity in terms of findings on PET scans and post-mortem brain pathology. They were also associated with performance on various cognitive tests, though not quite as strongly.
This is all super cool, but I am not quite willing to drink the super kool-aid yet.
First of all, the technique demonstrated seems to be cohort specific. In other words, they didn’t identify a universal gene expression profile that could be applied to any individual to see where they are on the path to dementia. For example, in one cohort this technique identified 845 highly influential genes. In another, 416 influential genes were identified. That means you’re unlikely to see a lab test that leverages this technique any time soon.
The other problem is more subtle. The fact that the new “pseudotime” construct correlates with disease state and progression is the real breakout finding here. But it’s only so compelling because of the authors claim that the machine-learning model was “unsupervised”.
But this wasn’t entirely true. Those control patients, so crucial to weed out the noise of normal aging, were labeled as such, according to correspondence I had with Dr. Iturria-Medina. The algorithm knew, from the beginning, who was a control and who was a patient. I asked Dr. Iturria-Medina if it was fair, then, to call this an unsupervised model. He wrote “Probably, depending on the perspective, semi-unsupervised could be a more correct categorization.”
But the authors don’t mention this in the paper. In fact, they go out of their way to point out that the unsupervised nature of the model is a particular strength because it “guarantees absence of…data overfitting”.
And that may be true, if the model is really unsupervised. But since it is somewhat supervised, we now have the possibility that the strong relationships observed between pseudotime and various disease outcomes are not driven by biology, but by overfitting of the training data.
This would all be easily determined by applying the model to a held-out test set. But this wasn’t done.
I want to be clear, this doesn’t invalidate the results, but it does mean we need to see some replication in other cohorts with rigorously held out test sets before we can be really confident that the algorithm is learning something about the disease, and not just the dataset it was developed in.
We’re in an amazing new world where data science promises to give new insights into disease — but it is really complex and subtle variations in study design can have big implications for interpretation. Many of us, myself included, are still learning how to interpret studies like this. Machine learning algorithms may not be as complicated as the human brain yet, but they’re complicated enough that understanding these studies is far from intuitive.
This commentary first appeared in medscape.com.