How to identify the roots of broad research topics and fields? The introduction of RPYS sampling using the example of climate change research

Since the introduction of the reference publication year spectroscopy (RPYS) method and the corresponding program CRExplorer, many studies have been published revealing the historical roots of topics, fields, and researchers. The application of the method was restricted up to now by the available memory of the computer used for running the CRExplorer. Thus, many users could not perform RPYS for broader research fields or topics. In this study, we present various sampling methods to solve this problem: random, systematic, and cluster sampling. We introduce the script language of the CRExplorer which can be used to draw many samples from the population dataset. Based on a large dataset of publications from climate change research, we compare RPYS results using population data with RPYS results using different sampling techniques. From our comparison with the full RPYS (population spectrogram), we conclude that the cluster sampling performs worst and the systematic sampling performs best. The random sampling also performs very well but not as well as the systematic sampling. The study therefore demonstrates the fruitfulness of the sampling approach for applying RPYS.


Introduction
introduced the CRExplorera program which can be used to investigate the roots of research fields and topics.For example, the program has been used by Rhaiem and Bornmann (2018) to reveal the historical roots of the new topic in scientometrics of academic efficiency assessments or by Andy Wai Kan (2017) identifying seminal works that built the foundation for functional magnetic resonance imaging studies of taste and food.The CRExplorer facilitates the so-called reference publication year spectroscopy (RPYS) (Marx, Bornmann, Barth, & Leydesdorff, 2014).This statistical method is based on a field-or topic-specific publication set including cited references (CRs).RPYS visualizes CR counts by referenced publication years (RPYs, not to be confused with the method RPYS); years with high counts (especially early years) point to underlying cited publications which might be interpreted as historical roots or landmark papers of a field or topic.
Since the introduction of the RPYS, the method faces the problem of proceeding large datasets which are based on broader topics or fields.The hardware capacities of conventional computers running the CRExplorer are frequently not sufficient enough to process large datasets.To tackle this problem in using the software, we introduce in this paper the technique of drawing several samples from a large dataset and to produce RPYS results based on these samples.The study is based on a large dataset which has been produced by Haunschild, Bornmann, and Marx (2016) to identify the early roots of climate change research (Marx, Haunschild, Thor, & Bornmann, 2017).As we will demonstrate in this study some sampling methods lead to results which are very close to the results from the complete climate change dataset (the population).
By using samples to draw conclusions on populations, the study connects to the recent discussion in the Journal of Informetrics around the paper "sampling issues in bibliometric analysis" published by Williams and Bornmann (2016).Both authors demonstrate the relevance of the sampling concept for bibliometric analyses (in the context of inference statistics).Some authors have commented on the paper by questioning the relevance of the sampling topic for the field.In this paper, however, we will demonstrate the fruitfulness of this concept for bibliometric studies.
In the following section "Dataset and Methodology", we describe the climate change dataset which we used in this study to demonstrate the various RPYS sampling methods.The three different sampling methods which are implemented in the CRExplorer are also explained in this section: random, systematic, and cluster sampling.The section "Results" starts with the RPYS based on the complete climate change dataset, i.e. the population dataset (subsection "Population analysis").The results of the population RPYS constitute the outcome which should be reached by the sampling methods: the closer the RPYS of the sampling method is to the population RPYS, the more appropriate is the method for replacing the population RPYS.The results of the population RPYS revealing the historical roots of climate change research are explained in detail.
The subsections "Random sampling", "Systematic sampling", and "Cluster sampling" in the results section present the RPYS results based on the different sampling methods.
All subsections in the "Results" section presenting the RPYS results based on the population and sample data are followed by corresponding subsections, in which the script language of the CRExplorer is explained for performing the specific RPYS.The explanations are provided in detail so that the reader learns how to use the language.

Climate change publications
Our analyses are based on the Web of Science (WoS, Clarivate Analytics) custom data of our inhouse database derived from the Science Citation Index Expanded (SCI-E), Social Sciences Citation Index (SSCI), and Arts and Humanities Citation Index (AHCI) produced by Clarivate Analytics (Philadelphia, USA).We used in this study a publication set containing most of the relevant literature regarding climate change research.The set was compiled using a sophisticated method known as "interactive query formulation".A set of key papers was retrieved and a reformulated search query based on the keyword analysis of key papers was constructed (Wacholder, 2011).The search was restricted to the publication years 1980-2014 and to the document types "article" and "review".A detailed description of the search process for retrieving the relevant publications on climate change can be found in Haunschild, et al. (2016).
In total, the publication set (the population) comprises 222,060 publications and 10,932,050 CRs in 4,004,082 distinct CR variants.An earlier RPYS study by Marx, Haunschild, Thor, et al. (2017)

Sampling methods
If a dataset contains numerous CRs from many publications, the full dataset cannot be completely imported in the CRExplorer because of restrictions by the available main memory on the computer of many users.To tackle this problem, the user has the option to draw one of the following three types of samples from the full dataset.The samples are based on different methods for selecting a subset from the original set of all CRs (the population) (Levy & Lemeshow, 2008): (1) Random Sampling: The sample of CRs is randomly selected from the population where every possible combination of n CRs from the population has the same chance of being selected.For example, if the user wants to import a sample of 100 CRs out of the population of 400 overall CRs, CRExplorer randomly selects 25% of all CRs.
(2) Systematic sampling: Systematic sampling is a very popular sampling method (Levy & Lemeshow, 2008) whereby elements are selected from an ordered sampling frame.
Here, a given number of CRs is used to select the sample uniformly distributed over the list of all CRs of the citing publications.For example, if the user wants to import 100 CRs out of 400 overall CRs, CRExplorer systematically selects 25% of the list of all CRs by picking the 1 st , 5 th , 9 th , and so on CR.
(3) Cluster sampling: Cluster sampling is not a sampling frame which is based on individual units, but on clusters of units.Thus, clusters of units are sampled instead of individual units.The CRExplorer randomly selects one year from the citing publication years which lie between two given years set by the user of the program.Then, all CRs in the papers published in this year are selected as a sample and are imported.The results of Bornmann and Mutz (2015) reveal that the restriction on all CRs from a recent citing year leads to very similar results as the consideration of all CRs from several citing years in references analysis.

Population analysis
The results of the RPYS based on the population data which are shown in Figure 1  For this study, we restrict the RPYS analysis to the RPYs from 1970 to 2010 and use the results for comparison with the RPYS results from various sampling methods.We connect with this focus to the study by Marx, Haunschild, Thor, et al. (2017) who analyzed the very early roots of climate change research.Thus, the results of the RPYS are not only of interest in the comparison of samples and population, but also for revealing landmark publications in climate change research from the past which have been published more recently.The RPYS in Figure 1 does not only show the NCRs (in red), but also the five-year median deviation (in blue).Thus, the blue line is the deviation of the NCRs in each year from the median for the NCRs in the two previous, the current, and the two following years.This deviation from the five-year median provides a curve smoother than the one in terms of absolute numbers.Using the five-year median deviation curve, peaks in the data can be identified more easily than with the absolute numbers, since each year is compared with its adjacent years.Although we have calculated the RPYS until 2014, we show the spectrogram in Figure 1 only until 2010 to ensure a referencing window of at least three years.The spectrogram features nine more or less pronounced peaks at the following RPYs: 1974RPYs: , 1976RPYs: , 1982RPYs: , 1984RPYs: , 1987RPYs: , 1993RPYs: , 2000RPYs: /2001RPYs: , 2004RPYs: , and 2007.Table 1 lists the CRs which occur most frequently within the peak RPYs.We use the spectrogram in Figure 1 and the most frequently cited publications in Table 1 to judge the reliability of the different sampling methods results which are presented in following sections.

Using the script language for the population analysis
We employed the script language of the CRExplorer to produce the results in Figure 1 and Table 1.The language can be applied instead of using the menus of the graphical user interface of CRExplorer.A separate JAR file is necessary to use the language (this file can be downloaded from http://www1.hft-leipzig.de/thor/crexplorer/CitedReferencesExplorerScript.jar).We started by analyzing the CRs in all climate change papers on a machine with 512 GB of main memory (RAM, random access memory).The CRE and CSV files which are necessary for a RPYS analysis of all CRs published between 1970 and 2014 can be produced using the following The functions saveFile and exportFile allow us to save the results of our analysis in different formats: the CRE-internal file format, the list of CRs in CSV file format, and the data to produce the RPYS graph in CSV format (see Figure 1).

Random sampling
In an attempt to cover a range from small to large number of samples, we performed seven different random sample RPYS analyses using 10, 50, 100, 500, 1,000, 2,500, and 10,000 samples with 50,000 CRs in each sample.Figure 2 shows the results of the merged samples in comparison with the population spectrogram (full RPYS).As the samples are of different size, they had to be scaled.We used f = max(NCR sample,RPY )/max(NCR full, RPY ) as a scaling factor.The samples do not fully reproduce the population spectrogram but most of the relevant peaks also occur in all of the samples.It seems that a few (10 or 50) random samples are sufficient to obtain a first impression of the RPYS.The differences between the samples can be seen more clearly in Figure 3 where the difference between each sample and the RPYS with 10,000 samples is shown.The random sampling seems to converge rather slowly with the sample size, but the RPYS with 500 samples seems to be a good compromise between accuracy and computational time.Each sample needed approximately one minute of computational time on our Intel® Xeon® E5-2640 with 2.6GHz so that 500 samples can be calculated within a day or overnight.10,000 samples of 50,000 CRs each needed about a week on the same PC.Due to the slow convergence of the random sampling, we present the most important references under the peaks for the results from 10,000 samples in Table 2.The differences between the RPYS with 10,000 samples and the RPYS results with smaller sample sizes are displayed in Figure 5.In the case of the climate change literature, the systematic sampling converges faster than the random sampling.The difference between the RPYS result of 500 samples and larger samples seems to be insignificant.However, smaller sample sizes do not seem to be sufficient to resemble the RPYS accurately.3.  1 and Table 3 shows that all top papers of the population RPYS also appear as top papers in the RPYS from 500 systematic samples.Only the order of the top papers is different for RPYs 1987 and 1993.The ordering of the top papers is the same as in the population RPYS for all other RPYs.Even the NCR agrees quite well in most cases.Mainly, the reference Stuiver M, 1993, is significantly underestimated in terms of the NCRs.It seems from our results that the systematic sampling with 500 samples each can be used to approximate the population spectrogram very well.

Using the script language for random and systematic sampling
The script language can be extended using the Java program language.Every user can expand the capabilities of the CRExplorer by writing such extensions.One CRExplorer extension is available at https://github.com/andreas-thor/cre/blob/master/crs/packages/Loop.crs:Loop.crs.
This extension simplifies loop programming in the CRExplorer script language.The analysis via sampling procedures was made using the extension Loop.crs.In this case, ten random samples of 50,000 CRs were drawn from the population of CRs.They were clustered and merged.
Afterwards, CRs referenced only once were removed.loops.The number of cycles is provided as the value of count (here 10).The functions differ in their behavior after the loops are finished.forEach performs no further action whereas forEachUnion merges the CRE files of each cycle to a final CRE data set.The parameter dir can be provided but is optional.If parameter dir is not provided, the system default temporary directory is used.If there is too few disc space, the CRExplorer stops with an error message.
Furthermore, if dir is provided, the temporary files of each cycle are kept and can be used later on using other CRExplorer script files.The variable index is available in the loop and runs from 0 to count-1.The importFile function contains two additional arguments compared to Listing 1.
The parameter sampling can be set to "RANDOM" (as in this example) or "SYSTEMATIC".
Two of the sampling methods can be selected this way.The argument offset: index+1 instructs the CRExplorer to skip the first index+1 CRs.This is not necessary for the random sampling, but very important for the systematic sampling.The systematic sampling uses an equidistant set of CRs from the data file.Without the offset option, all samples would contain the same CRs.
The argument maxCR: 50000 restricts the sample size to 50,000 CRs which easily fit into 1 GB RAM, although about 250,000 CRs could be imported per GB from the climate change publication set.However, merging of the samples needed more RAM depending on the number of samples.As multiple samples need more memory than single samples, we deem it appropriate to restrict the sample sizes in our study consistently to 50,000 CRs per sample.
We conducted a series of merging tests determining the number of samples we were able to merge with a certain amount of RAM.The results are shown in Table 4.However, the number of samples and the amount of RAM should be seen as guiding values as they may differ between publication set types and sampling methods.Especially, the values obtained for the random sampling of course strongly depend on the random samples drawn.Suppose the user has less than 8 GB of RAM available but still would like to merge 500 systematic samples of 50,000 CRs each, one can also merge in batches, e. g., merging four batches of 125 samples each is possible with 4 GB RAM.However, the resulting CR variants might differ somewhat as they might be determined differently in the various merging steps.In the case of cluster sampling, 2 GB were enough to analyze the publication year 2011 and it was possible to process the publication year 2014 with 4 GB RAM.
The function removeCR in Listing 2 now contains a lower threshold than in the case of the population spectrogram.We propose to use the following rule of thumb for calculating the number of CRs to be removed: The number of CRs of each sample (NCR sample ) and of the population (NCR full ) can be determined via the function analyzeFile.The syntax of analyzeFile is analogous to the one of importFile.This rule of thumb results in our current case in: threshold (sample) =  ( 100 6,594,657 50,000 ⁄ ) ≈ (0.758)= 1 (2)

Cluster sampling
For cluster sampling, the CRExplorer randomly selects one year from the given set of citing publication years.Then, all CRs from the papers in this year are selected and imported.As an exploration of the cluster sampling, we used the publication years 2011, 2012, 2013, and 2014 and compared the corresponding spectrograms with the population spectrogram (see Figure 6).It seems from these results that the cluster sampling should not be recommended for RPYS.It should be explored in future studies, whether the cluster sampling approach is appropriate for other publication sets.We could imagine, for instance, that this approach is feasible for research topics which have been started only a few years ago.In these cases, the CRs in the single citing years might be so uniform that the cluster sampling could work.
This meant that many users could not perform RPYS for broader research fields or topics.In this study, we present various sampling methods to solve this problem.The study therefore demonstrates the fruitfulness of the sampling approach for bibliometric studies.Some comments following the paper by Williams and Bornmann (2016) questioned the usefulness of this approach for bibliometric studies.
The statistical analysis of large datasets with the CRExplorer becomes more prevalent, since it has become possible with the new program version to import data from CrossRef (see https://www.crossref.org).The user of CrossRef gains free access to meta-data of publications which can be (1) downloaded as files and imported in the CRExplorer or (2) directly imported by using the CRExplorer search interface for CrossRef data.Especially the use of the search interface allows fast access on comprehensive CR data from publications.
In this study, we introduce the script language of the CRExplorer which can be used to draw many samples from the population dataset (see also the handbook of the program at www.crexplorer.net).The language can be applied instead of using the menus in the program.
Script languages are standard in statistical software to automate the process of empirical analysis.
Once a script has been produced for a given dataset, the script can be used for further similar datasets.Scripts fulfill an important function in the replicability and reproducibility of empirical studies.Are script, dataset, and program for a published study available, the results in the manuscript can be reproduced (and possible errors identified).Although replicability and reproducibility are essential components of the open science movement (Cumming & Calin-Jageman, 2016), scripts are scarcely available for popular bibliometric software, such as VOSviewer or CitNetExplorer.Thus, the user of the CRExplorer script language receives an impression, how the script language of bibliometric software could be designed.
Based on a large dataset of publications from climate change research, we compare RPYS results using population data with RPYS results using sampling data.We show RPYS results for three different sampling techniques: random sampling, systematic sampling, and cluster sampling.
From our comparison with the full RPYS (population spectrogram), we conclude that the cluster sampling performs worst and the systematic sampling performs best.The random sampling also performs very well but not as well as the systematic sampling.Merging 500 systematic samples of 50,000 CRs each reproduces the population RPYS rather accurately and also the same peak CRs are found in the sampled spectrogram as in the population spectrogram.Merging 10,000 random samples also results in the same peak CRs as obtained from the population RPYS results.
It is unknown if our findings can be transferred to other research fields than climate change.
Studying different publication sets might make it necessary to increase the sample sizes or the number of samples drawn, or it might be possible to obtain good RPYS results with smaller sample sizes or number of samples.We would like to encourage other studies to check which sample sizes and number of samples are needed to approximate the population spectrogram accurately enough.
has analyzed the RPYs before 1971.The restriction to RPYs before 1971 reduced the number of distinct CR variants to 239,887.This reduction of the number of cited references (NCR) made the RPYS analysis feasible.The CRs published between 1970 and 2014 comprise 6,594,657 CRs in 3,728,879 distinct CR variants.The main memory requirements rise with the number of unique CR variants which makes it impossible to analyze the RPYS using the full climate change dataset on a current standard computer.Thus, the dataset is well suited to demonstrate different sampling methods in this study.
(the population spectrogram) serve as baseline for the comparison with the results based on the three sampling methods.The figure presents the NCRs for each RPY.Frequently occurring RPYs show up as distinct peaks within the RPYS spectrogram.The highest peak in Figure 1 with the most CRs is visible for RPY = 2000.

Figure 1 :
Figure 1: Annual distribution of CRs throughout the period 1970-2010 which have been cited in climate change publications (published between 1980 and 2014) : "savedrecs.txt",type: "WOS", RPY:[1970, 2014, false], PY: : "savedrecs.cre")exportFile(file: "savedrecs_CR.csv",type: "CSV_CR") exportFile(file: "savedrecs_GRAPH.csv",type: "CSV_GRAPH")Listing 1: CRExplorer script to analyze the CRs in the WoS file savedrecs.txtListing 1 imports the WoS file with the complete climate change data.Furthermore, it identifies variants of the same CR in the dataset, cluster them, and merge their occurrences (NCRs)(Thor, et al., 2016).Three export files are saved in different formats.The set function in the listing can be used to change options of the settings dialog in the CRExplorer.We set usage of two neighboring RPYs for calculation of the median deviation in this case, i.e. a five-year median deviation.The option n_pct_range: 0 is set here and in the following scripts for purely technical reasons.This option does not change the results presented in this study.The function importFile is needed to import WoS or Scopus files.We supply options to restrict the CRs to RPYs between 1970 and 2014 and publication years of citing publications between 1980 and 2014.The value of maxCR can be used to limit the number of imported CRs.A value of 0 means no limit.The function info prints a brief line of information to the screen.With the function cluster, we clustered the imported CRs automatically by using a similarity threshold of 0.75 considering volume and page.The function merge merges the clustered CR variants.Consistent withMarx, Haunschild, Thor, et al. (2017), we removed all CR variants occurring less than 100 times with the removeCR function.

Figure 2 :
Figure 2: Annual distribution of random samples of the CRs throughout the period 1970-2010 which have been cited in climate change publications (published between 1980 and 2014)

Figure 3 :
Figure 3: Deviation of the randomly sampled RPYS results from the RPYS based on 10,000 samples

Figure 4 :
Figure 4: Annual distribution of systematic samples of the CRs throughout the period 1970-2010 which have been cited in climate change publications (published between 1980 and 2014)

Figure 5 :
Figure 5: Deviation of the systematically sampled RPYS results from the RPYS based on 10,000 samples

Figure 6 :
Figure 6: Annual distribution of cluster samples of the CRs throughout the period 1970-2010 which have been cited in climate change publications (published between 2011 and 2014)

Table 1 :
Most frequently CRs, their titles, and NCR values from selected RPYs in Figure1orbital theory of the Ice Ages, the instability of the climate of the past, and dendrochronology in connection with climate research.Six CRs (CR3, CR6, CR7, CR11, CR12, and CR17) are concerned with meteorology.The publications mainly present measured data or modelling results with regard to the atmospheric and oceanic circulation systems.These two sets of CRs are distributed more or less equally over the selected time span.Since the year 2000, however, IPCC reports increasingly appear as the most-frequently CRs.Seven CRs (CR16, CR18, CR20, and CR23-CR26) are part of IPCC reports, mostly related to the scientific basis of climate change and emission scenarios of greenhouse gases.Finally, there are four CRs (CR1, CR2, CR8, and CR21) which deal with various other issues in climate change research, e.g.
CR5HAYSJD, 1976, SCIENCE, V194, P1121 Variations in the Earth's orbit: pacemaker of the Ice Ages 923 CR6 NORTH GR, 1982, MON WEA REV, V110, P699 Sampling errors in the estimation of empirical orthogonal functions 676 CR7 RASMUSSON EM, 1982, MON WEA REV, V110, P354 Variations in tropical sea surface temperature and surface wind fields associated with the Southern Oscillation/El Nino biological and statistical studies about effects from climate change.

Table 2 :
Most frequently CRs from selected RPYs with their NCR values using 10,000 random

Table 3 :
Most frequently CRs from selected RPYs with their NCR values using 500 systematic

Table 4 :
Amount of RAM necessary to merge a certain number of samples with 50,000 CRs each Amount of RAM Number of merged systematic samples Number of merged random samples