https://www.scieneers.de/wp-content/uploads/2026/06/blog_post_genomic_haystack.png
1024
1536
Martin Danner
https://www.scieneers.de/wp-content/uploads/2020/04/scieneers-gradient.png
Martin Danner2026-06-17 12:01:022026-06-17 12:02:02Finding the Needle in the Genomic HaystackFinding the Needle in the Genomic Haystack
Genomic Medicine Meets Machine Learning
Your genome contains roughly 3.5 million genetic variants.
Most of them do absolutely nothing.
But sometimes a single letter change in DNA can cause a life‑threatening disease.
Finding that one variant among millions is one of the central challenges in genomic medicine.
When I started working at the Centre of Human Genetics and Genomic Medicine, one number completely surprised me:
350 million.
That’s how many people worldwide live with a rare disease.
In the EU, a disease is defined as rare if it affects fewer than 1 in 2,000 people. Individually that sounds uncommon but together rare diseases affect millions. (Source: European Commission)
In Germany alone, around 4 million people live with one.
🧬 Many rare diseases come down to a single genetic change
What many people don’t realize is that most rare diseases are caused by a single genetic variant.
One of the most common types is called a single nucleotide variant (SNV) – a change in just one letter of our DNA.
Our DNA consists of four nucleotides:
- A – Adenine
- T – Thymine
- G – Guanine
- C – Cytosine
During protein biosynthesis, this DNA sequence is transcribed into messenger RNA (mRNA) and then translated into proteins – the molecular machines that carry out almost every function in our cells.
Missense variants explained through protein biosynthesis: On the right, a single nucleotide variant is shown that results in a missense variant. Other variant types include nonsense, synonymous, and loss‑of‑function variants.
If a single nucleotide changes, the resulting protein can also change.
These variants can have different consequences for example:
- Missense variants change one amino acid in the protein
- Nonsense variants introduce a premature stop signal
- Synonymous variants leave the amino acid unchanged
The tricky part is that not every change is harmful.
In fact, the average person carries more than 9,000 missense variants, and the vast majority of them are completely benign. (Source: Exome sequencing and analysis of 454,787 UK Biobank participants)
So when clinicians analyze a patient’s genome, they are essentially searching for one disease‑causing variant among millions.
A true needle‑in‑the‑haystack problem.
🧠 This is where machine learning enters the picture
And the problem becomes even harder when we consider how much of the genome we still don’t fully understand.
Only about 2% of the genome directly codes for proteins. The remaining 98% – sometimes called the “dark genome” – is much harder to interpret, even though variants there can also cause disease. (Source: Shedding light on the dark genome)
To tackle this challenge, researchers increasingly rely on machine learning models that estimate whether a variant is likely pathogenic. This field is called variant effect prediction.
Tools like AlphaMissense, CADD, SIFT, REVEL, or SpliceAI help prioritize variants and narrow down the search.
One particularly interesting model among them is AlphaMissense, developed by DeepMind.
AlphaMissense builds on the ideas behind AlphaFold, the protein structure prediction system that earned Demis Hassabis and John Jumper the Nobel Prize in Chemistry in 2024. While AlphaFold predicts the three‑dimensional structure of proteins from their amino acid sequence, AlphaMissense was specifically designed to assess the impact of missense variants.
As we learned earlier, missense variants change a single amino acid in a protein. Whether that change is harmful depends largely on how it affects the structure and function of the protein.
AlphaMissense estimates exactly that. The model assigns a score between 0 and 1, reflecting the likelihood that a missense variant is pathogenic rather than benign. These predictions help clinicians and researchers prioritize variants when analyzing patient genomes bringing us a little closer to finding the needle in the genomic haystack.
When DeepMind released AlphaMissense, they also published predictions for 71 million possible missense variants across the coding region of our genome. (Source: A catalogue of genetic mutations to help pinpoint the cause of diseases)
Even though the model weights themselves were not released (i.e., the model is not fully open‑sourced), DeepMind did provide a reference implementation of AlphaMissense on GitHub for researchers who want to explore the methodology:
👉 https://github.com/google-deepmind/alphamissense
Of course, models like AlphaMissense also have limitations.
They are specifically designed for missense variants in protein‑coding regions. This means they cannot directly assess e.g.:
- variants in non‑coding regions
- insertions and deletions (indels)
- larger structural variants
- or variants that affect gene regulation or splicing
Each of these variant types requires different computational approaches.
🩸 A Real‑World Example: Sickle Cell Disease
But what does a real disease caused by a single nucleotide variant actually look like?
One of the most fascinating examples is sickle cell disease – a mutation that causes a severe genetic disorder but at the same time protects against malaria.
Interestingly, this was exactly the disease that came up in conversations during a workshop I attended two years ago at CSIR‑IGIB in New Delhi.
More on that in the next post 🙂
Our Impact
In our joint research with the Centre of Human Genetics and Genomic Medicine at Uniklinik RWTH Aachen, we develop and apply machine learning methods to gain new insights from genomic data and support better healthcare and patient outcomes. We also contribute by maintaining and expanding the cloud infrastructure and data lakehouse architecture that enable scalable genomic data analysis.
If you are interested in this work or potential collaborations, feel free to get in contact to learn more.










