In microbiome research a recurrent, and seemingly basic, question is which microbes differ in abundance between groups of samples. For instance, we might want to ask which microbes differ in stool samples before and after treating patients with a particular drug. A plethora of tools have been developed to identify such differentially abundant (DA) microbes and choosing between them is no simple matter. They use radically different data transformations and statistical tests, which many take as evidence that they are appropriate for different contexts. However, in the microbiome literature they are used interchangeably, with little regard for dataset characteristics, to identify DA microbes. This would be a major obstacle for reproducible research if these tools produced different results. Unfortunately, this seems to be the case, at least on simulated data, as reported by several recent papers that we highlight in our manuscript. These papers present valuable results, but provide little practical insight as to what degree biological conclusions differ when different DA tools are applied to the same (non-simulated) datasets.
We set out to answer this question and devised our lab’s first ever "hackathon", with the goal of testing and running commonly used DA tools on a small number of in-house 16S rRNA gene sequencing datasets (Figure 1a). While we had a pretty good idea of which tools to test, we were also interested in what the microbiome research community had to say. As such we took to Twitter to look for suggestions on which DA tools to test (Figure 1b).
Figure 1: Our lab hackathon to compare differential abundance methods for microbiome analysis
In the end we identified 14 commonly used DA methods in microbiome literature and ran them on 38 different publicly-available 16S rRNA gene sequencing datasets (Figure 2). While we were expecting tools to be somewhat discordant, we were unsure as to what magnitude they would differ. To our surprise we frequently found that the number of significantly DA microbes differed by several orders of magnitude. For example, LEfSe and ANCOM-II identified 10,208 versus 227 microbes, respectively, as significantly DA in a freshwater sequencing dataset.
Figure 2: Our general approach to compare the differential abundance tools
These are the three key take-home messages from our work:
- Some tools (e.g., LEfSe and edgeR) had inappropriately high false-positive rates when applied to simulated data and as such should not be used for 16S rRNA gene DA testing.
- ANCOM-II and ALDEx2 were the most consistent tools in terms of calling significant hits that were also called by the majority of other DA tools. This suggests their precision may be higher than other methods.
- We recommend that future microbiome differential abundance analyses include a summary of the results from multiple DA tools applied to their data. This is important to give readers a sense of the degree to which the authors’ findings are robust to the choice of DA tool. Otherwise it is all too easy for readers (and authors) to put stock in significant hits that cannot be reproduced by other tools (Fig. 3).
Figure 3: “Bioinformaticians identifying significant hits with traditional bioinformatics methods. Image is in the public domain (Wikipedia Link).