Improving Structural Variation Detection Accuracy in Repetitive Regions Using Hybrid Algorithms and SFS Signatures

SVDSS combines ideas from traditional mapping-based and assembly-based SV detection algorithms with the novel mapping-free Substring-free Sample-specific Strings (SFS) framework to achieve a 15% recall improvement over state-of-the-art methods in detecting SVs in repetitive regions of the genome.

Although there are orders of magnitudes less Structural Variants (SVs) in a genome compared to single-nucleotide mutations, the total volume of base-pairs impacted by SVs is far more than other types of genetic variant combined. SVs are known to play a huge role in human evolution and diseases, yet they remain the least well-characterized type of genetic variants with many basic questions about them still not completely resolved despite significant bioinformatics effort [1]. This is mostly due to the complexity of SV detection from short-read sequencing data in particular from repetitive regions of the genome such as microsatellites and telomeres. Mapping-based algorithms based on analysis of read mapping signatures and assembly-based methods based on comparison of assembly contigs with reference genomes remain the standard methods [2] while new mapping-free methods that rely on sequence signature such as k-mers are also being explored as alternatives [3]. Yet each class of methods suffers from their own shortcomings and technical barriers.

We had recently introduced the concept of Substring-Free Sample-specific strings (SFS) as a novel and optimal framework for mapping-free variant detection [4]. SFS are sequences that are unique to a sample compared to a reference genome. Each SFS therefore represents a variant although larger variants such as SVs often result in multiple SFS. We combined the SFS concept with ideas from traditional mapping-based approaches and the Partial-Order Alignment (POA) [5] technique to develop a new hybrid method that attempts to surpass the limitations faced by each class of algorithms to achieve significant improvements in SV discovery performance.

Our method SVDSS starts by computing SFS from the target sample against the reference genome. The approximate alignment of SFS compared to the reference genome is inferred from the aligned reads. The SFS are then clustered based on position and length so that each cluster represents a potential haplotype allele. The clusters are finally assembled with POA and the resulting assembly contigs are mapped back to the reference genome and analyzed using standard mapping-based detection methods to produce a SV call. SVDSS can currently detect deletions and insertions.

Overview of SVDSS.
Visual overview of SVDSS's compute pipeline.

We compare the performance of this approach to several state-of-the-art methods on a ground-truth SV callset calculated from running an assembly-based SV detection pipeline [6] on the assemblies of two GIAB HiFi [7] samples and the T2T assembly for CHM13 [8]. We divide the genome into Tier 1 and Tier 2 regions (based on GIAB definitions), where Tier 2 accounts for repetitive regions of the genome (telomere, centromere, satellites, etc) that are traditionally difficult to genotype and and Tier 1 accounts for the remainder of the genome with each tier including approximately half of the SVs from the ground-truth callset.

SVDSS achieves a breakthrough 15% increase in recall in Tier 2 regions compared to the best-performing method in the comparison (72% to 57%) while still achieving the highest precision in these regions. Furthermore, our analysis on the CMRG[9] callset of 250 clinically-relevant SVs also shows that SVDSS is capable of detecting the most variants (232 compared to 228 for the second-best method) including 5 SVs that are not detected by other methods.

Access the full article here.


[1]. Mark JP Chaisson, John Huddleston, Megan Y Dennis, Peter H Sudmant, Maika Malig, Fereydoun Hormozdiari, Francesca Antonacci, Urvashi Surti, Richard Sandstrom, Matthew Boitano, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature, 517(7536):608–611, 2015.

[2] Paul Medvedev, Monica Stanciu, and Michael Brudno. Computational methods for discovering structural variation with next-generation sequencing. Nature methods, 6(11):S13–S20, 2009.

[3] Parsoa Khorsand and Fereydoun Hormozdiari. Nebula: Ultra-efficient mapping-free structural variant genotyper. Nucleic acids research, 49(8):e47–e47, 2021.

[4] Parsoa Khorsand, Luca Denti, Human Genome Structural Variant Consortium, Paola Bonizzoni, Rayan Chikhi, and Fereydoun Hormozdiari. Comparative genome analysis using sample-specific string detection in accurate long reads. Bioinformatics Advances, 1(1):vbab005, 2021.

[5] Christopher Lee, Catherine Grasso, and Mark F Sharlow. Multiple sequence alignment using partial order graphs. Bioinformatics, 18(3):452–464, 2002.

[6] Heng Li, Jonathan M Bloom, Yossi Farjoun, Mark Fleharty, Laura Gauthier, Benjamin Neale, and Daniel MacArthur. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nature methods, 15(8):595–597, 2018.

[7] Justin M Zook, Nancy F Hansen, Nathan D Olson, Lesley Chapman, James C Mullikin, Chunlin Xiao, Stephen Sherry, Sergey Koren, Adam M Phillippy, Paul C Boutros, et al. A robust benchmark for detection of germline large deletions and insertions. Nature biotechnology, 38(11):1347–1355, 2020.

[8] Nurk, Sergey, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger et al. "The complete sequence of a human genome." Science 376, no. 6588 (2022): 44-53.

[9] Justin Wagner, Nathan D. Olson, Lindsay Harris, Jennifer McDaniel, Haoyu Cheng, Arkarachai Fungtammasan, Yih-Chii Hwang, Richa Gupta, Aaron M. Wenger, William J. Rowell, Ziad M. Khan, Jesse Farek, Yiming Zhu, Aishwarya Pisupati, Medhat Mahmoud, Chunlin Xiao, Byunggil Yoo, Sayed Mohammad Ebrahim Sahraeian, Danny E. Miller, David J´aspez, Jos´e M. Lorenzo-Salazar, Adri´an Mu˜nozBarrera, Luis A. Rubio-Rodr´ıguez, Carlos Flores, Giuseppe Narzisi, Uday Shanker Evani, Wayne E. Clarke, Joyce Lee, Christopher E. Mason, Stephen E. Lincoln, Karen H. Miga, Mark T. W. Ebbert, Alaina Shumate, Heng Li, Chen-Shan Chin, Justin M. Zook, and Fritz J. Sedlazeck. Curated variation benchmarks for challenging medically relevant autosomal genes. Nature Biotechnology, Feb 2022.

Please sign in or register for FREE

If you are a registered user on Nature Portfolio Microbiology Community, please sign in