Emu: a novel computational tool for more precise community profiles from full-length 16S sequences

Emu is a novel software designed to establish microbial community profiles with full-length 16S rRNA gene sequences from error-prone nanopore devices.

Microscopic organisms are ubiquitous. These tiny creatures are constantly interacting with each other and with their environment, both as drivers and passengers of the ecosystem in which they reside. A tremendous amount of effort has been dedicated to researching the behaviors, functions, and interactions of microscopic communities over recent years, highlighting the importance of microbial life. This research has been made possible by significant cost reduction of computational power and high-throughput sequencing technologies as well as the development of mature analysis techniques. As research in this area continues, there is hope for a deeper understanding of currently unexplained observations in a range of fields such as human health and environmental ecology due to the hypothesized powerful role of microscopic organisms. A reliable way to characterize diversity in a community of microbes is through phylogenetic taxonomy of the 16S ribosomal RNA (rRNA) gene ever since the development of this technique in 1977.

Microbial community profiling with the 16S rRNA gene

Microbiome research starts by identifying which microbes are present in the environment. A cost-effective standard for determining the relative abundance of each type of microbe is through sequencing of the 16S rRNA gene. This process is essentially completed by targeting all the 16S rRNA genes in a given sample, sequencing the amplified genes, classifying (or binning) each gene as the organism it was most likely extracted from, then counting the number of genes from each organism. The community profile is then computed, where the relative abundance of each organism is the number of 16S rRNA genes that were counted from that organism divided by the total number of 16S rRNA genes sequenced.

Currently, high-throughput sequencing is dominated by Illumina devices, which have proven to be cost-effective and accurate, yet limited to paired-end reads of roughly 300 to 500 nucleotides. Since the 16S rRNA gene is approximately 1,500 nucleotides in length, and each amplicon requires conserved primers to sequence the 300 to 600 bp DNA fragments,  scientists need to carefully select only a portion of the gene to target, and ultimately restricts the amount of information given by each rRNA sequence. In short, sequences can be accurately classified at the genus-level taxonomic rank at best, and represent only a fraction of a single gene from a microbial genome, which ultimately restricts downstream analyses.

Comparison of 16S rRNA sequencing pipelines completed by Illumina short-read and Oxford Nanopore Technologies (ONT) long-read devices.

Oxford Nanopore Technologies (ONT) recently released a protocol that allows for the affordable sequencing of the entire 16S rRNA gene on their nanopore sequencers, at a similar cost to Illumina machines. Yet ONT devices come with a major obstacle: much higher error rates than short read technologies such as Illumina. These erroneous sequences can contain nucleotides that were not present in the true 16S rRNA gene, fail to detect nucleotides that were actually present, or simply report the wrong nucleotide. Since error rates measure greater than the nucleotide difference between the 16S rRNA gene of two unlike species, directly applying Illumina computational pipelines to ONT sequences will produce inaccurate community profiles. 

Several methods of achieving accurate, full-length 16S rRNA gene reads have been validated [5-7]. These SLR and assembly based approaches paved a path forward and highlighted the power of utilizing the full length of the 16S rRNA gene. However, while promising techniques, these approaches came with additional financial burden, limiting their application to large scale/survey studies and research labs with limited budgets. Thus, we have developed a software consisting of an iterative statistical algorithm to correct for inaccurate classifications delivered by sequencing error. Our method Emu has shown to generate accurate species-level community profiles in our simulated and real data sets with single-pass full-length rRNA gene reads.

Emu algorithm

Emu was given its name because it uses an expectation-maximization (EM) approach for microscopic (mu, 𝛍) organisms, yielding EM + mu or Emu for short. This algorithm is constructed upon the idea that an unclassified sequence is more likely to come from an organism that is expected to be in the sample at high abundance rather than an organism that is either detected in low abundance or not at all. Emu first produces read classifications and a direct proportion community profile that is expected to have some inaccuracies due to error-prone sequences. The initial guess profile is then used to update the read classification likelihood giving more weight to higher abundance species. These updated read classification likelihoods then update the community profile directly. This process continues until only marginal changes are made to the community profile between iterations. A clean-up step is completed to remove low likelihood species and the final community profile estimation is ultimately returned.

Simplified schematic of the expectation-maximization error-correction method embedded in Emu.

In summary, Emu has been demonstrated to be a useful  software tool for full-length 16S rRNA analyses. It has been shown to produce species-level profiles from 16S rRNA sequences with single-pass reads for the first time. This reduces the cost and complexity of species-level microbiome analysis from previous strategies. Emu leverages the novel handheld ONT MinION device, giving research groups the flexibility to perform both in-house and offline sequencing. As a broader message, Emu shows how mathematical algorithms can be used to overcome limitations provided by hardware as well as biological and physical systems. Our hope is that Emu can support a range of microbiome research studies giving more accurate insights to the small yet mighty organisms we encounter every day.

[5] Callahan et al., “Ultra-Accurate Microbial Amplicon Sequencing with Synthetic Long Reads.”

[6] Callahan et al., “High-Throughput Amplicon Sequencing of the Full-Length 16S RRNA Gene with Single-Nucleotide Resolution.”

[7] Karst et al., “High-Accuracy Long-Read Amplicon Sequences Using Unique Molecular Identifiers with Nanopore or PacBio Sequencing.”