Not only single nucleotide variants and deletions are important

Not only single nucleotide variants and deletions are important

The appearance of SARS-CoV-2 not only changed our entire way of living, but also allowed us to observe and investigate the evolution of an RNA virus in real time. For the first time, we have an enormous amount of genomic data that allow us to reconstruct the changes that the virus acquired over almost two years of evolution and try to predict the mutations that could lead to variants of concern (VOC), just by comparing viral genomes. The first such notable change in SARS-CoV-2 was the D614G substitution in the spike glycoprotein (S) that took over early in the pandemics, and then, there were Alpha, Beta and now Delta variants. The identification of these VOCs was based on single nucleotide substitutions and short deletions. Clearly, these mutations changed the virus properties and could results in higher infectivity or partial antibody escape. However, a distinct type of variability, namely, insertions of short sequences into the virus genome, has been largely overlooked. Inserts are not included in most analyses due to two main reasons: first, because they are not always called from the raw data, and second, because at the stage of genomic multiple alignments, in most cases, the length of the alignment is kept the same as the length of the reference Wuhan strain genome. This makes the analysis computationally more efficient, but leaves inserts out of the equation. 
It is an important miss because insertions appear to be essential for beta-coronavirus evolution. At least three insertions in S and nucleoprotein (N) that appeared early in sarbecovirus evolution differentiate highly pathogenic beta-coronaviruses (SARS-CoV-1, SARS-CoV-2 and MERS) from non-pathogenic and mildly pathogenic strains. 
In our present work, we tried to correct this by compiling a set of inserts from all genomes that were available by mid-June 2021. We found at least 347 distant inserts, varying in length from 2 to 69 nucleotides. We investigated the molecular mechanisms that likely caused the inserts and show that short inserts (<9 nt) probably occur due to polymerase slippage, whereas longer inserts are potentially associated with either duplications or polymerase jumps during subgenomic RNA synthesis. We were unable to pinpoint the sources of all the observed insertion events for two main reasons: first, most inserts are short, so that it is not always possible to confidently determine the source within SARS-CoV-2 genome, and second, some of the inserts can potentially come from human RNA, which is hard to demonstrate convincingly.
Most importantly, we investigated in detail inserts that occurred in S, and show that at least three of these are located in the antibody-binding site of the N-terminal domain (NTD), whereas others are also located in NTD loops and might lead to antibody escape, and/or T cell evasion. Two inserts, ins214AAG and ins214TDR, belong, respectively, to lineages A.2.5 and B.1.214.2, that have been gaining in frequency before the emergence of the Delta variant, possibly indicating that those lineages outcompeted other lineages that have been circulating at the time. The appearance of the same or similar inserts in the Delta background, perhaps, by recombination, might lead to epidemiologically relevant consequences. Recently, an insertion in site 214 in S protein was found in the new Omicron VOC. Thus, the insertions in the SARS-CoV-2 genome appear to merit monitoring, especially, at a time when vaccination could select for escape variants.
The full study was published in Communications Biology: