Machine learning workflow for evaluating genetic variants based on protein structures
A collaboration between the Center for Human Genetics and Genomic Medicine, University Hospital RWTH Aachen, and scieneers GmbH
Why are some genetic variants harmless, while others cause diseases?
This question is one of the greatest challenges in genomic medicine. Especially in the case of rare diseases – which mostly affect children – it is often difficult to identify the truly disease-causing variant among millions of changes in our genome.
Many of these cases involve so-called missense variants: minimal changes in the DNA that lead to the substitution of a single amino acid in the protein. Assessing the impact of these small changes on protein function is often difficult. Even modern computational prediction tools frequently provide unclear results, causing many variants to be classified as “of unknown clinical significance” – with far-reaching consequences for affected patients and treating physicians.
In our joint research project between the Center for Human Genetics and Genomic Medicine at University Hospital RWTH Aachen and scieneers GmbH, we have developed a new machine learning workflow that places the three-dimensional protein structure – a factor often neglected until now – at the center of variant evaluation.
Here’s a direct link to our paper: Utilizing protein structure graph embeddings to predict the pathogenicity of missense variantsÂ
Why protein structures are crucial
Most previous prediction models focus on features such as evolutionary conservation, population frequencies, or the amino acid sequence of the underlying protein – the “language” of proteins. However, it is often the 3D structure of a protein that ultimately determines whether and how a genetic variant affects its function, as the properties and roles of a protein arise only through the spatial arrangement of its amino acids.
So far, prediction models have made little use of this structural data, or they rely primarily on the structure of the wild type (unmodified proteins) or on highly simplified descriptors.
Our approach: Enhancing established models with protein structure embeddings
Genomic medicine meets machine learning:
- Using ESMFold, a state-of-the-art protein structure prediction model, we predicted the 3D structures of more than 60,000 altered and unaltered proteins.
- These structures were transformed into so-called graph embeddings using graph autoencoder networks, resulting in a compressed yet information-rich representation of complex protein structures.
- These embeddings then served as input for our classification models, which can predict whether a variant is likely to be disease-causing.
A clear added value: Improving established prediction scores
How great is the practical benefit of this structural information?
We tested our approach by supplementing the well-known CADD score – an established measure for the pathogenicity of genetic variants – with our graph embeddings. The result: The predictions became noticeably more accurate thanks to the additional structural information.
Remarkably, although the CADD score already incorporates sequence-based information (amino acid sequence) from ESM models, the direct integration of 3D structure provided real added value. This highlights that future prediction tools should ideally combine both sequence and structural data.
New approaches and perspectives
The underlying methods are scalable – databases such as AlphaFold or ESMFold are expanding rapidly and already cover a large part of the human proteome. Our approach can therefore, in principle, be applied to all coding variants in the genome.
By the way: The generated protein embeddings are not only useful for interpreting genetic variants but could also be used for other tasks, such as predicting protein functions.
Did you know?
The 2024 Nobel Prize in Chemistry was awarded for the development of AlphaFold, another groundbreaking model for predicting protein structures. Such complex AI models require enormous computational power – modern cloud platforms make it possible to run these models efficiently and at scale. For this purpose, we used the Azure Cloud in combination with Databricks to predict the structures of over 60,000 proteins and to train our own machine learning models.
Nobel laureates David Baker, Demis Hassabis, and John M. Jumper. Illustrations: Niklas Elmehed © Nobel Prize Outreach, CC BY-NC-SA
Conclusion
- Protein structure is crucial: The explicit use of 3D structures significantly improves the interpretation of genetic variants.
- Machine learning meets genomic medicine: Graph embeddings enable models to “recognize” new connections beyond the sequence.
- Working together towards a common goal: This work highlights the strength of interdisciplinary collaboration between data scientists, engineers, and domain experts.
Through our contribution and cooperative research, we aim to play a small but important role in addressing current questions in genomic medicine. Our goal is to further improve diagnosis and therapy – especially for people with rare diseases – in the future.
Authors
Martin Danner, Data Scientist at scieneers GmbH
martin.danner@scieneers.de
Dr. Jeremias Krause, resident physician, UKA
jerkrause@ukaachen.de