Study combines machine learning with expanded genetic training sets
A new machine learning method called DeepSomatic can significantly improve the detection of genetic mutations potentially related to cancer, according to a study published in Nature Biotechnology.
Genetic mutations that are acquired during a person’s life, called somatic variants, are often associated with tumors. Detecting these variants is crucial for understanding how these cancers develop and for guiding treatments based on an individual’s tumor mutations. Until now, identifying somatic variants was mostly limited to older, short-read sequencing data, which struggles in complex or repetitive regions of the genome.
DeepSomatic represents a major advance because it is designed to work equally well with data from both short-read and long-read genetic sequencing technologies. Short-read sequencing creates a large number of short (50 to 300 base pairs) pieces of DNA to analyze, while newer long-read sequencing generates fewer, but much longer, DNA pieces (1 to 100 kilobytes of base pairs). This versatility allows researchers to get more precise and complete information from the newer long-read data.
The study team, led by UC Santa Cruz Genomics Institute and Google Research, included researchers from the Translational Genomics Research Institute (TGen), part of City of Hope.
To train and test the new method, the team created and openly released a high-quality dataset called CASTLE (Cancer Standards Long-read Evaluation). This resource included six matched tumor–normal cell line pairs that were sequenced on multiple platforms, along with reference sets of known variants for comparison, which they made openly available to the research community.
The power of DeepSomatic was demonstrated by a massive increase in confidence for known mutations. The method increased the number of high-confidence, benchmark somatic variants by seven times over the previous set of somatic variants identified from a single tumor-normal cell line.
“We’ve spent more than two decades mastering short-read sequencing and identifying mutations from that data. The challenge now is that third-generation long-read sequencing produces a very different type of data,” said Floris Barthel, M.D., Ph.D., an assistant professor in TGen’s Bioinnovation and Genome Sciences Division and co-author of the study. “Statistically and mathematically, this shift completely changes the evidence, information, and models needed to detect mutations. DeepSomatic is designed to tackle that challenge by improving how we call variants from long-read sequencing data.”
DeepSomatic is unique in that it works well for identifying or “calling” single nucleotide, insertion, and deletion variants from both short- and long-read genetic sequencing.
Comparing the same stretch of DNA in tumor versus normal sequencing data can be complicated because different sequencing technologies may sample the same stretch of DNA in completely different ways. But it’s a complication that DeepSomatic handles well, Barthel said. “It has a robust statistical model that allows you to call those variants that are present in your tumor but that are not present in your matching normal sample.”
The researchers successfully tested the DeepSomatic model on material from eight pediatric leukemia patients and one glioblastoma patient provided by TGen. The glioblastoma patient’s samples had been thoroughly sequenced using multiple platforms and included both tumor and normal samples. They conclude that the method works extremely well on patient samples, including even difficult archival samples that had been preserved in formalin.
The researchers’ analysis also “revealed quite different mutation patterns in the different cancers” they examined, including glioblastoma and leukemias, they write in the study.
Braden’s Hope for Childhood Cancer, Big Slick, Black & Veatch Foundation, Masonic Cancer Alliance, Noah’s Bandage Project, Elizabeth and Monte McDowell, Cancer Center Auxiliary, the Department of Defense (W81XWH-20-1-0358), and NIH/NCI (U01CA253405, R01HG010485, U41HG010972, U24HG01185, HT9425-23-1-0844, and OT2OD033761) funded this research.
By: Becky Ham | October 16, 2025 | Original Post

