Machine learning workflow for evaluating genetic variants based on protein structures

06.08.2025 – ca. 15 Min. Lesezeit – Zurück zur Startseite – Alle Blog-Artikel

A collaboration between the Center for Human Genetics and Genomic Medicine, University Hospital RWTH Aachen, and scieneers GmbH

Why are some genetic variants harmless, while others cause diseases?

This question is one of the greatest challenges in genomic medicine. Especially in the case of rare diseases – which mostly affect children – it is often difficult to identify the truly disease-causing variant among millions of changes in our genome.

Many of these cases involve so-called missense variants: minimal changes in the DNA that lead to the substitution of a single amino acid in the protein. Assessing the impact of these small changes on protein function is often difficult. Even modern computational prediction tools frequently provide unclear results, causing many variants to be classified as “of unknown clinical significance” – with far-reaching consequences for affected patients and treating physicians.

In our joint research project between the Center for Human Genetics and Genomic Medicine at University Hospital RWTH Aachen and scieneers GmbH, we have developed a new machine learning workflow that places the three-dimensional protein structure – a factor often neglected until now – at the center of variant evaluation.

paper_preview_protein_structure_graph_embeddings

Here’s a direct link to our paper: Utilizing protein structure graph embeddings to predict the pathogenicity of missense variants

Why protein structures are crucial

Most previous prediction models focus on features such as evolutionary conservation, population frequencies, or the amino acid sequence of the underlying protein – the “language” of proteins. However, it is often the 3D structure of a protein that ultimately determines whether and how a genetic variant affects its function, as the properties and roles of a protein arise only through the spatial arrangement of its amino acids.

So far, prediction models have made little use of this structural data, or they rely primarily on the structure of the wild type (unmodified proteins) or on highly simplified descriptors.

Our approach: Enhancing established models with protein structure embeddings

Genomic medicine meets machine learning:

Using ESMFold, a state-of-the-art protein structure prediction model, we predicted the 3D structures of more than 60,000 altered and unaltered proteins.
These structures were transformed into so-called graph embeddings using graph autoencoder networks, resulting in a compressed yet information-rich representation of complex protein structures.
These embeddings then served as input for our classification models, which can predict whether a variant is likely to be disease-causing.

A clear added value: Improving established prediction scores

How great is the practical benefit of this structural information?

We tested our approach by supplementing the well-known CADD score – an established measure for the pathogenicity of genetic variants – with our graph embeddings. The result: The predictions became noticeably more accurate thanks to the additional structural information.

Remarkably, although the CADD score already incorporates sequence-based information (amino acid sequence) from ESM models, the direct integration of 3D structure provided real added value. This highlights that future prediction tools should ideally combine both sequence and structural data.

New approaches and perspectives

The underlying methods are scalable – databases such as AlphaFold or ESMFold are expanding rapidly and already cover a large part of the human proteome. Our approach can therefore, in principle, be applied to all coding variants in the genome.

By the way: The generated protein embeddings are not only useful for interpreting genetic variants but could also be used for other tasks, such as predicting protein functions.

Did you know?

The 2024 Nobel Prize in Chemistry was awarded for the development of AlphaFold, another groundbreaking model for predicting protein structures. Such complex AI models require enormous computational power – modern cloud platforms make it possible to run these models efficiently and at scale. For this purpose, we used the Azure Cloud in combination with Databricks to predict the structures of over 60,000 proteins and to train our own machine learning models.

Nobel laureates David Baker, Demis Hassabis, and John M. Jumper. Illustrations: Niklas Elmehed © Nobel Prize Outreach, CC BY-NC-SA

Conclusion

Protein structure is crucial: The explicit use of 3D structures significantly improves the interpretation of genetic variants.
Machine learning meets genomic medicine: Graph embeddings enable models to “recognize” new connections beyond the sequence.
Working together towards a common goal: This work highlights the strength of interdisciplinary collaboration between data scientists, engineers, and domain experts.

Through our contribution and cooperative research, we aim to play a small but important role in addressing current questions in genomic medicine. Our goal is to further improve diagnosis and therapy – especially for people with rare diseases – in the future.

Authors

Martin Danner, Data Scientist at scieneers GmbH
martin.danner@scieneers.de

Dr. Jeremias Krause, resident physician, UKA
jerkrause@ukaachen.de

Further blog posts

PyData 2025

10. September 2025

PyData Berlin 2025 at the Berlin Congress Center was three days full of talks, tutorials, and tech community spirit. The focus was on open-source tools and agentic AI, as well as addressing the question: How can LLMs be used productively and in a controlled manner? We from scieneers gave a presentation on LiteLLM, titled “One API to Rule Them All? LiteLLM in Production”.

M3 2025

23. May 2025

At this year's Minds Mastering Machines (M3) conference in Karlsruhe, the focus was on best practices for GenAI, RAG systems, case studies from different industries, agent systems, and LLM, as well as legal aspects of ML. We gave three talks about our projects.

DesinfoNavigator

17. March 2025

DesinfoNavigator is an online tool that helps users identify disinformation by analyzing texts for misleading rhetorical strategies. It is based on the PLURV framework and uses a large language model to identify signs of disinformation and generate verification instructions. It complements fact-checking, is free, and promotes critical thinking about information.

How students can benefit from LLMs and chatbots

8. November 2024

Modern higher education leverages AI, with the University of Leipzig's pilot showing LLMs and RAG as digital tutors in law studies. Students ask questions, receive tailored answers with references, enhancing independent learning. Benefits include personalized support, adaptability, and cost-efficiency, while teachers gain planning, material development, and feedback tools. Challenges like computational costs, provider dependence, and result quality persist. Microsoft Azure's infrastructure underpins this innovation, promising a flexible, scalable future for education.

Leveraging VideoRAG for Company Knowledge Transfer

23. October 2024

VideoRAG offers a cutting-edge approach to bridging the knowledge gap in companies by transforming video and textual data into a searchable knowledge base using Generative AI and Retrieval-Augmented Generation. It ensures efficient transfer of nuanced insights, including implicit knowledge held by experienced employees, via AI-powered chatbots, making it accessible to newer generations within the workforce.

Previous Next