That research has taken on an added level importance since the Covid-19 outbreak is an understatement but nonetheless true. Research on coronavirus, both clinical and otherwise, is only going to intensify. But this new research stands on the metaphorical shoulders of the research that has already been conducted. In this blog, we will see what big data can tell us about the existing articles that make up those metaphorical shoulders. Specifically, we will see that the research output rises sharply after an outbreak, and that the existing research can be grouped into four large clusters with different profiles, and that two major organising themes arise: animals vs. human focus, and foundational vs. more applied research.
Coronavirus articles and where to find them
The first hurdle to overcome when looking at previous research on coronavirus is how to find the relevant, primary research articles. I identified just over 4000 scientific articles (up to the start of 2020) in the Digital Science Dimensions dataset (with 100m+ documents) in Google BigQuery, using a combination of keywords and heuristics.
If we plot the number of research articles by year, a striking pattern emerges: there is a noticeable peak of research being published around 2004-2005, and another one around 2014. This happens to be just after the 2002-2003 SARS epidemic, and the 2012 MERS outbreak.
After the 2005 and 2014 peaks, the output seems to have fallen somewhat, in what we in retrospect can say might resemble periods of “quiet before the storm”.
What are the major coronavirus themes?
An analysis of the article titles points to a division into four major themes. These themes can be described both in their own right, as well as contrasted and compared with each other along two major axes. I
The first and most distinctive difference is found between what roughly corresponds to “Veterinary sciences” and “Biochemistry and cell biology / medical microbiology” on one side, with “Public Health and health services” and “Biochemistry and cell biology / genetics” on the other. This seems to correspond to a distinction between animal research and non-animal research. Furthermore, there appears to be a secondary distinction between Public Health and health services” and “Biochemistry and cell biology / genetics”. This axis points to an opposition between applied and more foundational research.
Theme 1: veterinary science
The first theme is dominated by animal-related words. A closer look at the fields of research codes used for these articles, as well as the journals in which they were published, showed that veterinary-related research was prominent. Hence, this theme was labeled “veterinary sciences”.
This highlights the fact that although our current focus is a coronavirus outbreak among humans, these viruses often occur in animals, and a large research effort goes into studying how these viruses affect their animal hosts, as well as how they can be identified, how they spread, and how they transmit across species.
Theme 2: biochemistry and cell biology / genetics
As expected, there is a theme dealing with biochemistry, cell biology, and virus gene research. This is perhaps where the core of academic research is located if we weigh by sheer numbers of articles.
As the keywords indicate, the articles in this theme range from the gene and RNA level, up to virus infection effects such as acute respiratory syndrome.
The largest theme in terms of number of articles, this one is described by two fields of research. The reason is simply that the combination of two codes gives a better contrast to the next cluster.
Theme 3: biochemistry and cell biology / medical microbiology
This is the smallest theme (in terms of numbers of articles), and at first glance not too different from the previous theme, but there are some differences. In the word cloud we can see that “murine” (rats and mice) is particularly prominent. Medical microbiology comes out as being more distinctive than genetics here. Furthermore, the articles in this theme are on average slightly more recent than in the previous one. Other factors related to the scope of specific journals may also play a role.
Theme 4: Public health and health services
Finally, we come to the public health and health services theme. As we would expect, the focus here is on the effects of the virus, and in the word cloud we see terms such as “acute” or “severe” and “respiratory syndrome”, along with “case”, “detection”, “infection”, “outbreak”, “patient”, and “vaccine”.
However, a close look at the word cloud shows that there are also words related to animals (“avian”, “canine”, “porcine”), as well as terms related to genetics (“recombinant”) and cell biology (“spike protein”). Clearly there are prominent similarities across the themes, despite the fact that 78% of authors in the sample had published articles that were found in only one of them.
Although these themes can be a meaningful grouping of the coronavirus research, there are connections across them as well, which naturally leads to the question: how are they related to each other?
Research is clearly defined by much more than just frequent words in article titles. There are separate but overlapping research traditions in different fields publishing in different journals, with different aims, using different methods, and so on. Ideally, we would want to capture all of this when assessing how similar or different the themes really are. Words, on their own, are not enough.
By combining the keywords with information about the journals that published the papers in each theme, along with field of research codes. This combined data allows for a much richer representation that can be used to map the themes against each other, by means of machine learning. The result is displayed below.
In this chart, we can see the relative distances between the themes on a map (the coordinates of the map are not directly interpretable). Each bubble is a theme, with a size proportional to the number of articles in it. The two axes in the plot correspond to two dimensions in the article data. On the east-west axis we see data involving animals on the left (recall the heavy presence of “murine” in the word cloud for the “Biochemistry and cell biology / medical microbiology” cluster), and the non-animal research on the right. In the north-south axis, it appears that we are looking at the distinction between public health and animal research versus more foundational research.
Summary and further steps
What this overview shows is that the coronavirus research landscape is complex and composed of different research themes, and different research communities that do not fully overlap. Anyone wanting to get a quick overview of this complex field has their work cut out. This complexity might be of less importance had the impact of coronaviruses been less devastating. Instead, when looking at the historical trends we see spikes in coronavirus research driven by specific epidemics. A new spike is building momentum, but the challenge of summarising and understanding the previous research remains, especially for anyone entering the field, or those doing multidisciplinary research. Scientific publishers such as Springer Nature, with their unique access to data across all the themes, are uniquely positioned to help researchers dealing with all this information overload, thus easing the burden for those climbing the metaphorical peaks of coronavirus research once an outbreak takes place.
For the article selection, I retrieved articles in Dimensions with the terms “coronavirus(es)” in the title or variants of “covid-19” or “sars-cov-2”. Dimensions has a broad, inclusive scope, and to ensure that these are in fact primary research articles, I took a few extra steps. First, I included only articles from a list of "core" research journals. Furthermore, I excluded articles with no or only a few references, as well as articles with words like “correction” or “book review” in their title.
For clustering of article titles into themes or groups. I used a technique known as K-means to sort the articles into four groups, based on the relative distances between the titles, specifically title embeddings that represent the titles as a sequence of numbers. These numbers, although not directly interpretable, function as “semantic fingerprints” that preserve patterns of meaning that the K-means algorithm can pick up. The number of four groups was decided based on trials deciding what number of clusters best fit the data.
After assigning all articles in the sample to one of the four themes, we get the following count of articles per theme or group:
Veterinary sciences: 732 articles
Biochemistry and cell biology / genetics: 1502 articles
Biochemistry and cell biology / medical microbiology: 347 articles
Public health and health services: 1457 articles
For the dimensionality reduction in the final figure, I used truncated SVD (a form of matrix factorisation), since the input matrix was very sparse. The result can be interpreted in a similar way to SVD, and the result appears to be good: the first two dimensions explain a full 95.7% of the variation in the data.