Finding the Needle in the Genomic Haystack

Genomic Medicine Meets Machine Learning

17.06.2026 – ca. 7 Min. Lesezeit – Zurück zur Startseite – Alle Blog-Artikel

Your genome contains roughly 3.5 million genetic variants.

Most of them do absolutely nothing.

But sometimes a single letter change in DNA can cause a life‑threatening disease.

Finding that one variant among millions is one of the central challenges in genomic medicine.

When I started working at the Centre of Human Genetics and Genomic Medicine, one number completely surprised me:

350 million.

That’s how many people worldwide live with a rare disease.

In the EU, a disease is defined as rare if it affects fewer than 1 in 2,000 people. Individually that sounds uncommon but together rare diseases affect millions. (Source: European Commission)

In Germany alone, around 4 million people live with one.

🧬 Many rare diseases come down to a single genetic change

What many people don’t realize is that most rare diseases are caused by a single genetic variant.

One of the most common types is called a single nucleotide variant (SNV) – a change in just one letter of our DNA.

Our DNA consists of four nucleotides:

A – Adenine
T – Thymine
G – Guanine
C – Cytosine

During protein biosynthesis, this DNA sequence is transcribed into messenger RNA (mRNA) and then translated into proteins – the molecular machines that carry out almost every function in our cells.

Missense variants explained through protein biosynthesis: On the right, a single nucleotide variant is shown that results in a missense variant. Other variant types include nonsense, synonymous, and loss‑of‑function variants.

If a single nucleotide changes, the resulting protein can also change.

These variants can have different consequences for example:

Missense variants change one amino acid in the protein
Nonsense variants introduce a premature stop signal
Synonymous variants leave the amino acid unchanged

The tricky part is that not every change is harmful.

In fact, the average person carries more than 9,000 missense variants, and the vast majority of them are completely benign. (Source: Exome sequencing and analysis of 454,787 UK Biobank participants)

So when clinicians analyze a patient’s genome, they are essentially searching for one disease‑causing variant among millions.

A true needle‑in‑the‑haystack problem.

🧠 This is where machine learning enters the picture

And the problem becomes even harder when we consider how much of the genome we still don’t fully understand.

Only about 2% of the genome directly codes for proteins. The remaining 98% – sometimes called the “dark genome” – is much harder to interpret, even though variants there can also cause disease. (Source: Shedding light on the dark genome)

To tackle this challenge, researchers increasingly rely on machine learning models that estimate whether a variant is likely pathogenic. This field is called variant effect prediction.

Tools like AlphaMissense, CADD, SIFT, REVEL, or SpliceAI help prioritize variants and narrow down the search.

One particularly interesting model among them is AlphaMissense, developed by DeepMind.

AlphaMissense builds on the ideas behind AlphaFold, the protein structure prediction system that earned Demis Hassabis and John Jumper the Nobel Prize in Chemistry in 2024. While AlphaFold predicts the three‑dimensional structure of proteins from their amino acid sequence, AlphaMissense was specifically designed to assess the impact of missense variants.

As we learned earlier, missense variants change a single amino acid in a protein. Whether that change is harmful depends largely on how it affects the structure and function of the protein.

AlphaMissense estimates exactly that. The model assigns a score between 0 and 1, reflecting the likelihood that a missense variant is pathogenic rather than benign. These predictions help clinicians and researchers prioritize variants when analyzing patient genomes bringing us a little closer to finding the needle in the genomic haystack.

When DeepMind released AlphaMissense, they also published predictions for 71 million possible missense variants across the coding region of our genome. (Source: A catalogue of genetic mutations to help pinpoint the cause of diseases)

Even though the model weights themselves were not released (i.e., the model is not fully open‑sourced), DeepMind did provide a reference implementation of AlphaMissense on GitHub for researchers who want to explore the methodology:

👉 https://github.com/google-deepmind/alphamissense

Of course, models like AlphaMissense also have limitations.

They are specifically designed for missense variants in protein‑coding regions. This means they cannot directly assess e.g.:

variants in non‑coding regions
insertions and deletions (indels)
larger structural variants
or variants that affect gene regulation or splicing

Each of these variant types requires different computational approaches.

🩸 A Real‑World Example: Sickle Cell Disease

But what does a real disease caused by a single nucleotide variant actually look like?

One of the most fascinating examples is sickle cell disease – a mutation that causes a severe genetic disorder but at the same time protects against malaria.

Interestingly, this was exactly the disease that came up in conversations during a workshop I attended two years ago at CSIR‑IGIB in New Delhi.

Our Impact

In our joint research with the Centre of Human Genetics and Genomic Medicine at Uniklinik RWTH Aachen, we develop and apply machine learning methods to gain new insights from genomic data and support better healthcare and patient outcomes. We also contribute by maintaining and expanding the cloud infrastructure and data lakehouse architecture that enable scalable genomic data analysis.

If you are interested in this work or potential collaborations, feel free to get in contact to learn more.

Author

Martin Danner, Data Scientist & ML Engineer at scieneers GmbH

martin.danner@scieneers.de

Further Blog-Posts

Finding the Needle in the Genomic Haystack

17. June 2026

Every human genome contains millions of genetic variants. Most are harmless but occasionally a single change can cause a serious disease. Finding that one variant is a needle‑in‑the‑haystack problem and increasingly a challenge tackled with machine learning.

Insights from the Microsoft AI Tour 2026 in Munich – Between Vision, Sovereignty and Real‑World Applications

13. March 2026

A personal recap from the Microsoft AI Tour 2026 in Munich, covering Satya Nadella’s keynote, multimodal AI research in healthcare, sovereign cloud architectures, and emerging trends around AI agents, Power Platform, and the revival of Excel as an AI‑driven data interface.

Vier vertikale, abgerundete Balken mit Symbolen und Texten zu PoC, Prototyp, MVP und Pilot vor hexagonalem Hintergrund

PoC vs Prototyp vs MVP vs Pilot

11. March 2026

Clarity often only emerges when planning data products or software projects once the development phases have been clearly defined. Taking a step-by-step approach can gradually reduce uncertainty and align expectations within interdisciplinary teams.

AI Image Generation in Practice

27. February 2026

AI image generation has made a massive leap forward since 2025, evolving from distorted results to the production of photorealistic 4K images in real time. In this article, we explore how modern diffusion models work, offer practical tips for structured prompting and demonstrate how to achieve consistency across multiple images using a customer project on storyboard creation. We also explain which models are currently the best and where the technology's limits lie.

IT-Days 2025

5. January 2026

Data science, AI and cloud architectures form the core of our business, but it's sometimes beneficial to step outside your comfort zone. That's precisely what our colleague sat scieneers did in mid-December at IT Days 2025 in Frankfurt. We took away ideas from topics such as software architecture, DevOps, agile methods, and digital sovereignty that will directly influence our daily work on scalable RAG systems, clean software architecture, and monitoring.

Diagramm zeigt Datenfluss von Datenquellen über KI-Embedding zu PostgreSQL mit pgvector, dann zu Vektor- und Volltextsuche mit Zeilenebenen-Sicherheit und Abruf.

From Zero to Hero: Implementing RAG using PostgreSQL

10. December 2025

Large Language Models excel, but they struggle with private data — until Retrieval-Augmented Generation (RAG) steps in. This guide will show you how to transform PostgreSQL with pgvector, enabling semantic and full-text keyword searches, as well as row-level security for safe access. Follow the step-by-step instructions to implement hybrid retrieval and enhance accuracy without requiring new infrastructure.

Gruppe von Personen steht auf und vor einer Treppe in einem hellen Raum mit großen Fenstern und einer runden Deckenleuchte.

Throwback to our fall event 2025

8. October 2025

Once again, the team took centre stage. All scieneers from Karlsruhe, Cologne and Hamburg came together for a two-day autumn event at the end of September. As well as having some exciting discussions and taking part in some joint activities, we also welcomed five new colleagues. We are now a team of around 50 people!

PyData 2025

10. September 2025

PyData Berlin 2025 at the Berlin Congress Center was three days full of talks, tutorials, and tech community spirit. The focus was on open-source tools and agentic AI, as well as addressing the question: How can LLMs be used productively and in a controlled manner? We from scieneers gave a presentation on LiteLLM, titled “One API to Rule Them All? LiteLLM in Production”.

Machine learning workflow for evaluating genetic variants based on protein structure embeddings

6. August 2025

Missense variants, that is, single amino acid substitutions in proteins, are often difficult to assess. Our machine learning workflow uses protein structure-based graph embeddings to predict the pathogenicity of such variants. In doing so, the structural information enhances existing approaches like the CADD score and provides new insights for genomic medical diagnostics.

Previous Next