GHGA Lecture Series: Gunnar Rätsch and André Kahles (virtual)
- 21 Jan 2026
Gunnar Rätsch and André Kahles from ETH Zürich talked about “Efficient and accurate search in petabase-scale sequence repositories” at the GHGA lecture series "Advances in Data-Driven Biomedicine" on January 21, 2026.
Watch this talk here.
Abstract:
Publicly available sequencing data—spanning DNA, RNA, and proteins across all domains of life—has reached petabase scale, yet much of its scientific value remains locked behind metadata-only search. In this talk, we will present MetaGraph, our framework for enabling full-text search across essentially all public sequence archives. We will discuss how MetaGraph leverages annotated de Bruijn graphs and advanced compression to index 18.8 million sequencing datasets and over 200 billion amino-acid residues while reducing ~67 petabases of raw sequence to a footprint small enough to fit on a handful of consumer drives. We will talk about how this global index supports efficient and sensitive sequence queries—from exact k-mer search to sequence-to-graph alignment—and how it enables practical retrieval of transcript expression, genetic variation, antimicrobial-resistance signatures, or circular RNA junctions at extremely low cost. Importantly, we will also discuss how human genome sequencing data and associated phenotypic characterisations can be represented within this framework, and how such unified representations enable scalable queries across population-scale human datasets while preserving structure and context. By making large-scale sequence search rapid, affordable, and comprehensive, we will show how MetaGraph opens new opportunities for discovery across genomics, metagenomics, transcriptomics, and human genetic studies.
Biography:
Data scientist Gunnar Rätsch develops and applies advanced data analysis and modeling techniques to data from deep molecular profiling, medical and health records, as well as images.
He earned his Ph.D. at the German National Laboratory for Information Technology under supervision of Klaus-Robert Müller and was a postdoc with Bob Williamson and Bernhard Schölkopf. He received the Max Planck Young and Independent Investigator award and was leading the group on Machine Learning in Genome Biology at the Friedrich Miescher Laboratory in Tübingen (2005-2011). In 2012, he joined Memorial Sloan Kettering Cancer Center as Associate Faculty. In May 2016, he and his lab moved to Zürich to join the Computer Science Department of ETH Zürich.
Data scientist André Kahles is since 2016 a member of the Biomedical Informatics group at ETH, where his main focus is research on graph representations of large sequence sets and the analysis of complex sequencing data.
He completed my undergraduate training at Friedrich Schiller University in Jena, Germany, and finished his Diplom thesis as a joint work with the Stockholm Bioinformatics Centre in Sweden. In 2009 he joined the Friedrich Miescher Laboratory of the Max Planck Society in Tübingen, Germany. He moved to the analysis of human transcriptomes when looking into large scale cancer sequencing projects in his second part of the PhD at the Memorial Sloan Kettering Cancer Center in New York City, USA. After graduating in 2014, he stayed two more years in New York, working under a fellowship of the Lucille Castori Center for Microbes, Inflammation and Cancer on efficient data structures for the representation of large collections of mixed sequences, such as whole metagenome sequencing samples.