Discovering novel bacterial taxa is a positive and unique experience when researchers can learn more about their novel taxa and share it with the world. However, anyone who has described and validly named novel bacteria will know the effort required to gather all the relevant information, as well as tedious work in checking you have compared your isolate to all validly named close relatives. This mind-numbing process is further increased if you have to do this for multiple taxa in a single manuscript. I discovered this when helping to describe 38 novel taxa in the PIBAC collection (Wylensek, Hitch et al, 2020). By the end of that process I was happy it was over, until I realised we had multiple additional collections also awaiting description and naming.
With this in mind we set out to speed up the process of gathering the information required for writing protologues, the standardised format for describing novel taxa. In our experience, the key information required for generating an informative protologue are;
- Extensive confirmation of taxonomic novelty and clearly defining the level at which it is novel (e.g. family, genus or species) and thereby the taxonomic lineage of the organism
- Describing the taxa's functional repertoire, particularly functions of importance to the environment it occurs within
- The ecological distribution of the taxa within the the environment it was isolated from, but also other environments as to determine its specificity
Protologger provides all this information in an easy to understand way so that the user can take the pre-written sentences and integrate them into the protologue of the novel taxa they are describing. These sentences are output within the overview file to provide broad insights into the taxonomic placement, novelty, functional repertoire and ecology of the input isolate. The rationale for this was to make using and understanding the output of Protologger as user friendly as possible.
In addition to making the output user friendly, this tool itself had to be usable by bioinformaticians and lab-microbiologists alike. This is why, along with putting all scripts on GitHub, we also developed the online Galaxy web-server. Having never done any web-design or ran a server like this before it was a struggle to get running, but it ensures that everyone has equal access to the resources required to run Protologger. Knowing that the web-server and output may be confusing to first-time users we considered how to better explain the workflow. With the help of the AVMZ Aachen team we were able to create an instructional video that clearly explains the process from uploading your data to visualizing the output.
While developing Protologger, the 'Great Autonomic Nomenclature', aka GAN, was published (Pallen et al, 2021). GAN uses pre-formatted tables of latin and greek names to generate a list of potential species and genus names along with the break-down of the names, as required by the International Committee on the Systematics of Prokaryotes (ICSP) for validation. The names proposed depend on the level of novelty predicted by Protologger (genus or species), as well as the ecological and functional features of the isolate. Names at the genus levels are based on the habitat distribution analysis as the environment with the highest ecological score (prevalence x mean rel. abundance) is used to generate relevant names. We believe this approach will ensure that while a species may be isolated from a single sample, its name will be representative of the species as a whole, including the environment it is most commonly found in.
During review, one reviewer made the following comment;
... the authors are well placed to provide an online register of new names produced by this tool by inviting users to log their chosen names and associated descriptions in a Protologger database.
Having recently discussed such databases at the SeqCode workshops we could see the benefit that such databases could play in the future. This led to discussion with our co-authors at the DSMZ about the potential to integrate Protologger output into LPSN. As LPSN is specifically for nomenclature, the integration of Protologgers output would not be suitable, however BacDive is another database, linked to LPSN, which aims to contain as much information about all deposited isolates. Therefore, it has been organized that Protologger output (overview file) will be hosted by the DSMZ and linked to their respective isolates. While this doesn’t yet reach the goals of databases such as the ‘Digital Protologue Database’, it is a step forward and has a secured home (Rossello-Mora et al, 2017).
Due to the continually changing nature of taxonomy, Protologger is a tool that will require continual development and improvement. As such, we ask for community feedback on which outputs you would like to see provided by Protologger, or additional pathways you wish to be included and studied. While currently designed to focus on Bacteria, we will also be expanding the analysis to Archaea in the future, but again, feedback from experts is required to understand what is useful and what is irrelevant for your analysis.
Pallen et al, 2021 = https://doi.org/10.1016/j.tim.2020.10.009
Rossello-Mora et al, 2017 = https://doi.org/10.1016/j.syapm.2017.02.001
Wylensek, Hitch et al, 2020 = https://doi.org/10.1038/s41467-020-19929-w