An overview of literature on COVID-19, MERS and SARS: Using text mining and latent Dirichlet allocation

The unprecedented outbreak of COVID-19 is one of the most serious global threats to public health in this century. During this crisis, specialists in information science could play key roles to support the efforts of scientists in the health and medical community for combatting COVID-19. In this article, we demonstrate that information specialists can support health and medical community by applying text mining technique with latent Dirichlet allocation procedure to perform an overview of a mass of coronavirus literature. This overview presents the generic research themes of the coronavirus diseases: COVID-19, MERS and SARS, reveals the representative literature per main research theme and displays a network visualisation to explore the overlapping, similarity and difference among these themes. The overview can help the health and medical communities to extract useful information and interrelationships from coronavirus-related studies.


Introduction
The unprecedented outbreak of coronavirus disease 2019 (COVID-19) [1], caused by a novel coronavirus named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), represents one of the most substantial global challenges in this century. The pandemic has severe consequences for public health, economics, politics and society. On 28 April 2020, about 180 countries and territories reported a combined total of about 2,883,603 laboratory-confirmed cases, with 198,842 deaths globally [2]. Figure 1 presents the geographical distribution of COVID-19 confirmed cases. SARS-CoV-2, taxonomically, is currently classed as a species of SARS-related coronavirus and belongs to the genus Betacoronavirus [3]. Two others similar betacoronaviruses, SARS-CoV and MERS-CoV, have also caused epidemics around the world in the last two decades, specifically SARS in 2002-2003 and the Middle East respiratory syndrome (MERS) in 2012-2013. Several similarities and differences in the causative agents, pathogenesis and immune responses, epidemiology, diagnosis, treatment and management of COVID-19, SARS and MERS have been identified [4][5][6]. For example, Law et al. [6] discuss the current understanding of COVID-19 and compare it with the outbreak of SARS in 2003 in Hong Kong in terms of the causes, transmission, symptoms, diagnosis, treatments and preventions, to establish an effective measure to control COVID- 19. In response to the COVID-19 pandemic, a large number of academic studies and case reports have already emerged in major international scientific and medical journals. Most of them addressed relevant research questions, including the virus's evolution and effects, as well as potential risk factors and clinical, laboratory and imaging findings [7]. In addition, to support the efforts of scientists in the health and medical community in combatting COVID-19, many leading research organisations created a range of free resources for scholars and the public to download and read. For example, in support of the global efforts in diagnosis, treatment, prevention and further research on SARS-CoV-2 and COVID-19, Elsevier has established the Novel Coronavirus Information Center and made more than 24,000 related articles free to access on ScienceDirect [8]. Another example is Kaggle, which has launched the COVID-19 Open Research Dataset (CORD- 19), containing over 57,000 scholarly articles, including over 45,000 with full text, on COVID-19, SARS-CoV-2 and related coronaviruses [9]. However, the huge amount of coronavirus literature from numerous information sources can be difficult for the health and medical community to keep up with. It is vital to establish how a literature review on these coronavirus studies can be performed most rapidly, and how the main research themes for COVID-19 can be classified. As the COVID-19 research efforts build on earlier research on SARS and MERS, one can expect both similarities and differences among the research themes related to COVID-19, MERS and SARS. Although it is vital for the health and medical community to understand coronavirus-related diseases, answering research questions will be very challenging. First, it is impossible to categorise the vast quantity of disparate literature from this rapidly growing subject area through manual processes, as the time frame involved increases linearly with the volume of literature under analysis [10]. Besides, manual categorisation of the coronavirus literature into major research themes could be prone to various biases. However, with the rise of information and communication technology (ICT) in information science, the widespread recent developments in data mining technologies, particularly text mining techniques, offer potential solutions to these challenges by allowing analysis of a large number of unstructured documents through automated processes [11]. Indeed, the vast amount of coronavirus literature provides the ideal arena for specialists in information science to apply text mining techniques to find relevant answers to research questions and synergise existing research insights for the health and medical community [12].
Text mining, which comprises a range of techniques such as latent Dirichlet allocation (LDA), together with natural language processing, can be used to identify and extract information or relationships from unstructured data and has become a popular approach to literature analysis in an era of rapidly emerging research [13][14][15]. For example, Ozaydin et al. [11] performed a comprehensive literature review of mobile health services from 5644 research articles using text mining. LDA, which is a Bayesian probabilistic model of text documents according to 'bag of words' [16] and generates the proper topics from documents by utilising a probability distribution to ensure all topics obey a Dirichlet polynomial prior distribution [17], is widely used in literature analysis. In this article, we combine the application of text mining with LDA procedure to perform a literature analysis of the coronavirus literature and provide an overview of the research that has been conducted on COVID-19 and other coronavirus-related pneumonias (MERS and SARS). In detail, the main purposes of this article are as follows:  To identify the most relevant search terms and generic research themes of three coronavirus diseases -COVID-19, MERS and SARS -by performing an automated literature analysis and synthesis based on text mining and  LDA. • To uncover the representative literature on each main research theme for coronavirus-related diseases, thereby helping the health and medical community to find the appropriate studies on target themes for these diseases.

•
To build a novel visual concept network that visualises the similarities among the research themes for coronavirus diseases to reveal the key aspects of these pathogens and the extent of overlapping, similarity and difference among these themes.
The first contribution of the study is to present an overview of coronavirus literature using text mining for coronavirusrelated research, offer a structured morphology of the existing literature, uncover the research themes and representative literature for each theme, and reveal the overlapping, similarity and difference among these themes. Our literature analysis can help the health and medical communities to combat COVID-19 by facilitating the extraction of useful information and interrelationships from the mass of coronavirus literature. The second contribution is to propose a methodological framework for science foresight analysis [18]. The framework rapidly provides a snapshot of any specific field of study, enabling scholars to evaluate possible opportunities for new research and development activities in their field.
This article is organised as follows. In section 2, we introduce the main concepts of infectious diseases caused by coronaviruses and the related research in the form of literature analysis and synthesis and present some literature on text mining. In section 3, we present the data and methods used in this research. In section 4, the results are analysed and discussed. Finally, in section 5, we summarise our conclusions and present future research directions.

Coronaviruses and related diseases
Belonging to the Coronaviridae family, coronaviruses are a group of enveloped, single-stranded RNA viruses present in various species of birds, snakes, bats and other mammals. According to their serological pattern, coronaviruses can be grouped as alpha, beta, gamma and delta [19]. Diseases caused by coronavirus infection have emerged as epidemic and pandemic outbreaks more than once in the last few decades. Outbreaks in humans have been caused by infection with various coronaviruses, including 229E, OC43, NL63, HKU1, SARS-CoV and MERS-CoV. The recent SARS-CoV-2 has proved to be the most serious coronavirus to date, as it has spread across 203 countries and territories in all five major continents. All coronavirus diseases produce similar symptoms such as rhinorrhea, mild or severe cough, tracheitis and bronchitis [6]. SARS-CoV, MERS-CoV and the recently discovered SARS-CoV-2 are all grouped as betacoronaviruses.
SARS  [20]. Genetic analysis shows that SARS-CoV has a nucleotide sequence similarity to other coronaviruses of only about 50%-60%. SARS-CoV also has a high mutation rate, and can still be cultured after residing on various surfaces for up to 24 h [21]. Bats have been found to harbour SARS-CoV and transmit it to human hosts [22]. However, the transmissibility of SARS-CoV is lower than that of SARS-CoV-2.
MERS-CoV, which originated from camels [23], was first discovered in the Middle East countries (Saudi Arabia, Oman, UAE) in 2012 when a cluster of cases of respiratory tract infection started to surface. MERS-CoV subsequently spread to 24 other countries, including Malaysia and the United States, and genetic analysis revealed some homology with SARS-CoV [24]. From September 2012 to 30 June 2018, about 2239 confirmed cases of MERS-CoV were reported by the World Health Organization (WHO). About 83% of the cases came from Saudi Arabia, and the crude fatality rate was 35.5% during this period, including 791 individuals who died due to other co-morbid illnesses, such as diabetes, renal failure and hypertension [25].

Literature analysis
Involving searching, screening and synthesising research materials from multiple sources, the literature analysis is a structured methodology to evaluate a body of literature to inform research development, identify potential research gaps and highlight the boundaries of a research subject [26]. Literature analysis enhances the effectiveness of the management and planning of research and development activities [18]. The typical process flow of a literature analysis involves defining appropriate search keywords, searching the literature and completing the analysis [27]. Traditionally, literature analysis required considerable efforts from domain experts. Although online library databases enable researchers easily to search an enormous amount of available articles from any physical location, the high volumes of articles returned presently the challenging task of reading and analysing the contents of each paper, even though only a small part of some articles may be relevant [28]. Today, new technologies such as text mining are used in literature analysis.
In biomedicine, new research heavily depends on making full use of previous scientific work, so literature analysis is a crucial tool for biomedicine. Table 1 presents a summary of several selected works from a literature analysis of biomedical articles. Four of the articles concern coronavirus-related infectious diseases: two for COVID-19 and two for SARS. The literature analytical techniques used include meta-analysis, qualitative or quantitative analysis and citation analysis. The last three articles focus on health gamification, telemedicine and cognitive computing in healthcare.

Text mining
As a particular type of data mining, text mining aims to extract useful knowledge such as relations, patterns and trends from unstructured or semi-structured data, for example, text documents [35,36]. The main process in text mining is transforming text into numerical data using statistical methods to extract textual contents into an organised document-term matrix, which encompasses the following two dimensions: the words (or terms, composed of n words) and the documents [37]. The two most common techniques developed in recent years for building knowledge using text mining are latent semantic analysis (LSA) and topic modelling. LSA is a form of natural language processing that extracts relationships between textual terms and documents by assuming that words with similar meaning will occur in similar pieces of text [38]. Topic modelling transforms the relevant words and their frequency into an organised structure, in which the documents are distributed into several topics [39]. There are many variants of those techniques: for example, the work of Lee et al. [40] presents a comparative study of four techniques in text mining, including two LSA techniques (LSA and probabilistic latent semantic analysis (PLSA)) and two topic modelling techniques (LDA and correlated topic modelling). The authors highlight that LDA is the best tool for dealing with multiple topics. This technique can determine the probability of each document belonging to each of the topics, and groups the documents into the most probably matching topics [41].
Text mining is now widely applied in biomedical research, as a vast number of biomedical texts, such as electronic patient-authored texts [42] and biomedical studies [43], provide a rich source of knowledge. Text mining effectively empowers researchers to create new information by making use of existing biomedical work. In biomedical literature analysis, there is a pressing need to deploy new technology that can automatically extract knowledge from published literature in response to the recent double exponential growth rate of biomedical literature [44]. Text mining is a suitable technique for such a challenge. Table 2 presents several selected papers on text mining-based approaches for biomedical study.
As a Bayesian probabilistic model for identifying latent topics from large and unstructured text documents, LDA is one of the most widely used topic modelling tools in literature analysis. For example, Wu et al. [17] employed LDA to perform topic segmentation and topic evolution for literature on stem cell research. By proposing a topic analysis approach incorporating LDA and the three-dimensional strategic diagram, Feng et al. [16] analysed the 62,340 literatures in the field of medical informatics between 1991 and 2018. By following LSA [38] and PLSA [49], LDA was first proposed by Blei et al. [41] in 2003 and adopted the Dirichlet prior distribution with the assumption that all topics are uncorrelated. LDA has several advantages for literature analysis. First, LDA is highly efficient for dealing with big data as it can capture effectively text-specific dimensions and does not make any assumption [50]. Second, LDA incorporates several steps of text analysis with little human intervention, for example, data sampling, and thus the result of topic modelling is more realistic and objective.
3. Data and methods 3.1. Data description 3.1.1. Data sampling. We conduct text mining based on CORD-19, which contains more than 57,000 scholarly papers (43,540 full texts) about COVID-19, MERS, SARS and other coronavirus diseases [9]. This data set is updated regularly and includes peer-reviewed publications and preprint literature from PubMed Central, bioRxiv, medRxiv and others. The latest update date of the data set in this study is 24 April 2020.
To focus on the three studied coronavirus diseases, COVID-19, MERS and SARS, we search for studies with matched keywords in the titles as well as the abstracts. The keywords for COVID-19 are 'COVID-19', 'SARS-CoV-2', '2019-nCoV', 'novel coronavirus pneumonia' and 'novel coronavirus infected pneumonia'. The keywords for MERS are 'MERS' and 'Middle East respiratory syndrome', and those for SARS are 'SARS' and 'severe acute respiratory syndrome'. After keyword matching, we exclude several irrelevant studies by manual inspection. Only English literatures are included. Finally, we have 3440 studies related to COVID-19, 1590 studies related to MERS and 2879 related to SARS, and the total number of literatures is 7909. These studies are published in 1461 journals. We list the top 20 journals by publication number in Table 3.

Publication trends.
We summarise the publication trends of the literature on the three coronavirus diseases in the form of a publication number bar chart. The x-axis represents the publish time (those for COVID-19 are reported monthly, while the other two are reported by year). The y-axis represents the number of publications.
The first case of COVID-19 was reported in Wuhan, China, in late December 2019 [47]. In our literature collections, the earliest academic study related to COVID-19 was published in January 2020. Because of the rapid growth of infected cases, the WHO declared a Public Health Emergency of International Concern on 30 January [2]. On 11 March, the WHO assessed COVID-19 as a pandemic [1]. A resulting boom in research literature after February 2020 can be identified in Figure 2. As the latest update time of the data set in this study is 24 April 2020, most of the literatures are published from January to April. Some of the studies even appeared in December 2020 issues of journals (December 2020 is the publish time, not the submit time). The first confirmed case of MERS occurred in 2012. Two later outbreaks occurred in South Korea in 2015 and Saudi Arabia in 2018 [23]. As research usually requires several months to 1 or 2 years to complete, we find two publication peaks in 2016 and 2019 in Figure 3. Based on the trend for the first quarter of 2020, we can expect another publication peak this year.
The outbreak of SARS was reported in 2003 [48]. In Figure 4, we find a peak in 2004, again because research and publication take some time. After 2004, the number of publications decreased until 2016, 1 year after the outbreak of MERS. We also find an increase in 2020 because of the outbreak of COVID-19.

Proposed methods
LDA is one of the most popular topic modelling methods [49]. Three concepts are important when applying the LDA algorithm: corpus, documents and terms. We refer to the total text collection as the corpus. Every item within the corpus can be considered as a document. Words in a document are called terms. Here, we consider documents as a mixture of  latent topics. Latent topics can be inferred by modelling the distribution of words. Expressed another way, topics can be seen as items composed of a group of words. Documents are then composed of topics with different weights [50]. In detail, a literature is a document W which is a set of n words represented by W = (ω 1 , ω 2 , . . . , ω n ), where ω n is the nth word in the document; the set of M documents constitutes a corpus D which is denoted by D = (W 1 , W 2 , . . . , W M ). LDA assumes that the corpus D contains K topics, and each topic defines a multinomial distribution. Based on Blei et al. [41], the process for LDA is presented as follows: First, the Dirichlet distribution η and θ in the selection process are defined: θ with parameter α for word selection and η with parameter β for topic section. Second, the general process for each document W is described in the following two steps: 1. Choose θ ∼ Dir(β).
2. For each of the n works ω n :  (a) Choose a topic z n ∼ Multinomial(θ).
(b) Choose a word ω n from pðω n jz n ; βÞ, a multinomial probability conditioned on the topic z n .
In this research, we use body text of literatures to conduct the experiments. The proposed text mining methods are displayed in Figure 5. Before conducting LDA, some pre-processing tasks are required. We use two Python libraries -natural language toolkit (NLTK) and spaCy (Industrial-Strength Natural Language Processing in Python) -for the data preprocessing. Data pre-processing includes the following three steps: (1) removing punctuation, unnecessary special characters and stop words; (2) tokenisation, that is, chopping the documents up into words; and (3) lemmatisation, that is, removing inflectional endings to retrieve the root or dictionary form of a word. After removing the other forms of words, only nouns and adjectives are left. We also include bigram [50] words in the data to extract more valuable information. A bigram is a set of two adjacent words: for example, 'machine' and 'learning' could be combined into the bigram 'machine_learning'. After pre-processing, we present the top 30 most frequent words for each of the three diseases. WordCloud, a popular Python visualisation tool, is also used to display the frequency of terms in the three disease-related literature corpora. We then use the LDA module in Gensim, a widely used topic modelling library, to extract meaningful topics from the collection of documents [35]. We also display the most relevant publications in each topic as well as the top three most frequent terms for the topic. Finally, we calculate the semantic similarity among different topics. We use NetworkX, a popular network visualisation tool, to display the semantic similarity network.

Results and analysis
The presentation of results is divided into three sections: topic modelling, representative studies and topic similarity networks. In section 4.1, we present the top 30 most frequent words associated with each of the three diseases. WordCloud is also used to display the most frequent terms in the literature corpora related to the diseases. We then present the topic modelling results. In section 4.2, we identify the most relevant literature for each topic and the topics' top three most frequent terms. Finally, we calculate the semantic similarity among the topics in section 4.3.

Topic modelling results
4.1.1. Most relevant terms. First, we present the global results for the text mining of coronavirus-related disease literature. Table 4 shows the frequencies of the most relevant terms for the three coronavirus-related diseases (COVID-19, MERS and SARS). Here, we only present the top 30 most relevant terms due to limited space. From this table, we can discover that the most relevant terms for research on the three coronavirus-related diseases include 'patient', 'case' and 'infection'. This indicates that there are some similar research directions for the three diseases, consistent with the fact that they are all caused by coronavirus infection. In addition to these research commonalities, there also exist research differences among the three diseases. Specifically, the top three terms for COVID-19 are 'patient', 'case' and 'number'. This reveals that the current research on COVID-19 mainly focuses on the symptoms of patients or the number of infection cases. This indicates that medical specialists still have a far from sufficient understanding and knowledge of SARS-COV-2. This is again to be expected, as the outbreak of this disease is very recent. For MERS-related research, although 'MERS', 'virus' and 'infection' are the most frequent terms, the fourth term is 'cell', which reveals that MERS research also now heavily concentrates on the study at the cellular level, for example, the status of infected cells. 'Camel' is another frequent term in MERS-related research, presumably related to the region/countries in which outbreaks occur. For SARS, it is evident that infection cases are not the primary concern for current research, as the top two terms are 'cell' and 'protein', which indicate that protein and antibody-related research is more prevalent for SARS. We also present the percentage frequency of the most popular terms in Figure 6, which allows the visualisation of the results.

WordCloud for coronavirus diseases.
We use WordCloud to display the most important terms in the corpora on the three coronavirus-related diseases. The font size in the figure depends on the term frequency without lemmatisation or bigram processing. Therefore, the word frequency of WordCloud is slightly different to the LDA model.
For COVID-19 research, 'patient' and 'case' are the two largest words in Figure 7. 'covid' is shorthand for 'COVID-19' -the name of the emerging infectious disease. Finally, the words 'cases', 'number' and 'model' are related to modelling of the disease transmission.
For MERS research, we can see 'mer' and 'virus' in Figure 8, which refer to 'MERS'. It indicates that a large amount of research literature mentions both MERS and coronavirus. Meanwhile, 'patient' and 'case' are related to individual-  Figure 9. It indicates that many studies are related to the protein structure of SARS-CoV. 'patient', 'case' and 'number' are related to the infected number. 'mouse' is also included in the WordCloud, which is related to medical experiments using mice.

Relevant topics for each coronavirus disease.
A more interesting analysis is to identify the relevant topics for each coronavirus disease and explore the research trends for each topic. For this purpose, we use LDA to discover the research topics. Each literature could have 1 to k (number of topics) topics. Table 5 shows the relevant topics for COVID-19, where each topic is presented in a row and the three dominant terms for each topic are also given. Dominant terms are those terms that could differentiate the topic from other topics. With the help of pyLDAvis, a widely used LDA visualisation tool, we selected three dominant terms from top 30 frequent  terms of the topic. The column labelled '#' shows the total number of studies included in each topic. 'β' is the weight of the term, which is a coefficient measuring the importance of the term in the topic. Since the original weight is very small, we give it a 10 × magnification. Finally, we present the number of studies that were published throughout the analysed period (from January to April 2020). From this table, we can see that there are six research topics for the study of COVID-19. The six topics differ significantly from one another as their top three dominant terms are not the same.     Table 6 presents the six relevant topics for the study of MERS. Topic 1 mainly concentrates on the human and animal virus and includes 869 published studies. 731 papers are focusing on the study of cellular-level research, which together comprises Topic 2. Topic 3 is the virus and vaccine-related research, containing 944 papers. Topic 4 only includes 551 publications, which is a protein structure-related topic. Topic 5 concentrates on disease detection, which includes 703 literatures. Topic 6 focuses on the infected number, which includes 1245 literatures. Table 6 also reveals the research trends for each topic. From this table, we can see that there are two bursts for all four topics: the first in 2016 and the second in 2019.
Finally, for the study of SARS, seven research topics are identified by the LDA. The results are presented in Table 7.  Table 7.

Representative literature per topic
To select the most relevant studies, two metrics are considered, in the following order of priority: the number of different terms mentioned in each literature (from one to all three of the most relevant terms, displayed for each topic) and the total number of times each of the three terms occurs, regardless of the specific topic.  Table 8 shows representative publications on the six COVID-19 topics. The representative literature for Topic 1 develops a new transmission model, which integrates a global network model with a local Susceptible-Exposed-Infective-Recovered (SEIR) spreading model to predict the outbreak dynamics of the COVID-19 [51]. The study chosen to represent Topic 2 builds a deep learning model to help the screening of COVID-19 based on computed tomography (CT) images [52]. Topic 3's representative paper offers some recommendations to the universities to help mitigate the negative effect of COVID-19 on students' mental health [53]. Topic 4's representative paper 'provides a comprehensive structural genomics and interactomics road-maps of' SARS-CoV-2 [54]. Topic 5's representative study proposes five practical steps to prevent the spreading of infectious disease in Long-Term Resident Rooms [55]. Topic 6's representative paper analyses 'the clinical characteristics and laboratory findings' of COVID-19 cases [4]. All six papers were published in 2020. Their authors are named in Table 8. Table 9 lists the six representative studies of MERS-related research. Each study corresponds to one topic. The first representative paper investigates the factors that affect the response to infectious disease with the help of meta-analyses [57]. The second one introduces the hospital outbreak of MERS in South Korea in 2015 [58]. The third representative study discusses the potential drugs and treatments to both MERS and SARS [59]. The fourth study introduces the current understanding of MERS in 2014 [60]. Camel is doubted to be a possible source of the virus. The fifth representative paper talks about the emerging and spreading of MERS in 2012 [61]. The sixth representative paper is a review focus 'on the origin, epidemiology and clinical manifestations of MERS-CoV' as well as 'the diagnosis and treatment of infected patients' [62]. Table 10 lists the seven representative studies of SARS-related research. Each study corresponds to one topic. The first representative paper focuses on the host cell of SARS-CoV [63]. The second studies the structural proteins of SARS-CoV [64]. The third study investigates the 'clinical, radiologic, and hematologic findings of SARS patients with pneumonia' [65]. The fourth study examines 'whether the initial chest radiograph helps predict the clinical outcome of patients' with SARS [66], and the answer is yes in this study. The fifth one introduces the pregnancy outcome of a woman who was exposed to the SARS [67]. The sixth one develops a new approach to help optimise the lead inhibitor of SARS-CoV [68]. The seventh representative paper conducts a comparative analysis of the transmission and epidemiological characteristics of both SARS and COVID-19 [69].

Topic similarity analysis
To examine the semantic similarity and differences among topics extracted by LDA, we use the Jaccard similarity scores to measure the similarity between pairs of topics, with a range from 0 to 1 [65]. We choose the top 30 most frequent topic terms to calculate the similarity score. In Figure 10, the red nodes are COVID-19-related topics, the green nodes are MERS-related topics and the yellow nodes are SARS-related topics ('T' means 'Topic'). The edges are the similarity scores between the two topics. To simplify the structure of the network, similarity scores under 0.15 are excluded. Brown dashed lines indicate that the similarity scores are under 0.3. The black solid lines are those above 0.3. Figure 10 indicates that Topic 4 of COVID-19 is sharing much more topic terms with other topics. It is highly correlated to six topics, including Topics 1, 2 and 3 of MERS and Topics 1, 3 and 5 of SARS. What's more, Topics 1, 2 and 3 of MERS and Topics 1, 3 and 5 of SARS are highly correlated to another two topics. They have the same number of highly correlated topics.

Conclusion and discussion
The outbreak of COVID-19, caused by SARS-COV-2, represents one of the most substantial global challenges in this century. Millions of people have been infected while hundreds of thousands have died. In response to the pandemic, a large number of academic studies and case reports have already emerged in major international scientific and medical journals. However, the huge amount of coronavirus literature makes it difficult for the health and medical community to keep up. By applying text mining and LDA to conduct a literature analysis on three coronavirus diseases -COVID-19, MERS and SARS -we illustrate that information specialists can support the health and medical community using  information techniques in literature analysis. We first present the most relevant terms appearing in research on coronavirus diseases and identify the main research themes. We then uncover representative studies for each main research theme as examples to guide the health community to find appropriate literature on the target themes for these diseases. Finally, we build a novel visual concept network to show the degree of overlap, similarity and difference among these themes. This study can help the health and medical community to extract useful information and interrelationships from a mass of coronavirus literature, such as finding the structured morphology of the existing literature and uncovering research themes and representative studies. Our work also provides a methodological framework for literature analysis that can rapidly present a snapshot for any specific field of study: a very important requirement for many people, such as new entrants to a research field, researchers from other fields and policymakers, to evaluate possible opportunities for new research and development activities.
There are also some limitations to the study. First, although the data set is large, it is not possible to collect all related articles because of time and access-rights limitations. Second, we only use the most popular topic modelling methodthe LDA model. Other methods, such as clustering, could supplement our research strategy. Third, we only use the fulltext data and neglect the abstract data. Titles, abstracts and keywords can also provide useful information for topic modelling. We plan to conduct LDA topic modelling using abstracts in future work.
This study has many valuable implications in the future. By performing the proposed text mining framework, we can identify the most relevant search terms and generic research themes of different research topics. Besides, our study could help the health and medical community to find the appropriate studies on target themes for these diseases. What's more, our visual concept network could visualise the similarities among the research. In the future, we plan to collect more literatures and apply more advanced techniques to support the fight against the pandemic.