In recent years, microbiome research has emerged as a groundbreaking field with the potential to revolutionize various areas, including agriculture, environmental restoration, drug discovery, human health, etc. 'Who are they?' is one of the most important questions to start with in microbiome research. Fortunately, the advent of high-throughput sequencing has brought about a seismic shift in microbiome research. Gone are the days when we had to rely on laborious and time-consuming culturing methods to investigate microbial compositions. Nowadays, we can swiftly answer the 'who are they' question by analyzing massive amounts of sequencing data.
After obtaining those massive reads, the next critical step is to decode them and quantify the microbial composition. We have two options at this stage: assemble the reads first or take an assembly-free approach. The assembly-free approach, namely metagenomic profilers, will eventually be more beneficial for clinical and industrial applications. This belief stems from the explosion of microbial genome information in publicly available databases, which is the foundation for metagenomic profilers to operate.
Despite their promise, current metagenomic profilers have limitations. A recent benchmarking study, CAMI2, revealed that no metagenomic profilers excelled in taxon identification and abundance estimation at the species level. Such a bottleneck faced by metagenomic profilers is largely due to their reliance on universal single-copy markers or whole microbial genomes as references. This often results in challenges like missing/indistinguishable markers in reference database construction or multi-alignment of short reads against conserved regions in read-alignment.
To address these challenges, we focused on the Type IIB restriction system. It is well known that the endonucleases from the Type IIB restriction-modification systems differ from all other restriction enzymes. In particular, the Type IIB enzymes cleave DNA on both sides of their recognition at fixed positions to cut out the recognition site with iso-length DNA fragments. In a previous study, we demonstrated that Type IIB restriction sites are widely and randomly distributed along microbial genomes (Sun et al., Genome Biology, 2022). We discovered that species-specific Type IIB restriction endonuclease digestion sites (or IIB fragments/tags) far outnumber universal single copy markers and naturally avoid the multi-alignment problem. As a result, we developed MAP2B (MetAgenomic Profiler based on type IIB restriction site). This novel metagenomic profiler can effectively eliminate false positives and generate higher precision and more accurate taxonomic profiles from Whole Metagenome Sequencing (WMS) data.
Our benchmarking exercises using simulation datasets with varying sequencing depths and species richness showcased MAP2B's superior performance over existing metagenomic profilers in species identification. Further tests using real WMS data from an ATCC mock community confirmed its superior precision against sequencing depth. Additionally, by leveraging WMS data from an Inflammatory Bowel Disease (IBD) cohort, we demonstrated that the taxonomic features identified by MAP2B could better discriminate IBD from healthy controls and predict metabolomic profiles.
Our previous study demonstrated that the decoding of whole metagenomic sequencing data is also hindered by confusion surrounding the concept of sequence abundance versus taxonomic abundance (Sun et al., Nature Methods, 2021). Here, sequence abundance refers to the proportion of DNA, whereas taxonomic abundance refers to the proportion of individuals/cells. We showcased compelling evidence that interchanging sequence abundance and taxonomic abundance will influence both per-sample summary statistics and cross-sample comparisons. While most metagenomic profilers, such as Bracken and Kraken, offer sequence abundance, taxonomic abundance may be a more clinically and ecologically relevant parameter. Hence, there is a pressing need for metagenomic profilers that can generate taxonomic abundance. Notably, MAP2B is one of the few in its ability to produce taxonomic abundance, setting it apart from existing tools. MAP2B can be accessed at: https://github.com/sunzhengCDNM/MAP2B.