Sequence analysis of annually normalized citation counts: an empirical analysis based on the characteristic scores and scales (CSS) method

Scientometrics, Sep 2017

In bibliometrics, only a few publications have focused on the citation histories of publications, where the citations for each citing year are assessed. In this study, therefore, annual categories of field- and time-normalized citation scores (based on the characteristic scores and scales method: 0 = poorly cited, 1 = fairly cited, 2 = remarkably cited, and 3 = outstandingly cited) are used to study the citation histories of papers. As our dataset, we used all articles published in 2000 and their annual citation scores until 2015. We generated annual sequences of citation scores (e.g., \(\left\{ {01233233221} \right\}\)) and compared the sequences of annual citation scores of six broader fields (natural sciences, engineering and technology, medical and health sciences, agricultural sciences, social sciences, and humanities). In agreement with previous studies, our results demonstrate that sequences with poorly cited (0) and fairly cited (1) elements dominate the publication set; sequences with remarkably cited (3) and outstandingly cited (4) periods are rare. The highest percentages of constantly poorly cited papers can be found in the social sciences; the lowest percentages are in the agricultural sciences and humanities. The largest group of papers with remarkably cited (3) and/or outstandingly cited (4) periods shows an increasing impact over the citing years with the following orders of sequences: \(\left\{ {0123} \right\}\) (6.01%), which is followed by \(\left\{ {123} \right\}\) (1.62%). Only 0.11% of the papers (n = 909) are constantly on the outstandingly cited level.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs11192-017-2521-9.pdf

Sequence analysis of annually normalized citation counts: an empirical analysis based on the characteristic scores and scales (CSS) method

Benford's law: A 'sleeping beauty' sleeping in the dirty pages of logarithmic tables. Retrieved September Sequence analysis of annually normalized citation counts: an empirical analysis based on the characteristic scores and scales (CSS) method Lutz Bornmann 0 1 2 Adam Y. Ye 0 1 2 Fred Y. Ye 0 1 2 0 Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing University , Nanjing 210023 , China 1 Center for Bioinformatics, School of Life Sciences, Peking University , Beijing 100871 , China 2 Division for Science and Innovation Studies, Administrative Headquarters of the Max Planck Society , Hofgartenstr. 8, 80539 Munich , Germany In bibliometrics, only a few publications have focused on the citation histories of publications, where the citations for each citing year are assessed. In this study, therefore, annual categories of field- and time-normalized citation scores (based on the characteristic scores and scales method: 0 = poorly cited, 1 = fairly cited, 2 = remarkably cited, and 3 = outstandingly cited) are used to study the citation histories of papers. As our dataset, we used all articles published in 2000 and their annual citation scores until 2015. We generated annual sequences of citation scores (e.g., f01233233221g) and compared the sequences of annual citation scores of six broader fields (natural sciences, engineering and technology, medical and health sciences, agricultural sciences, social sciences, and humanities). In agreement with previous studies, our results demonstrate that sequences with poorly cited (0) and fairly cited (1) elements dominate the publication set; sequences with remarkably cited (3) and outstandingly cited (4) periods are rare. The highest percentages of constantly poorly cited papers can be found in the social sciences; the lowest percentages are in the agricultural sciences and humanities. The largest group of papers with remarkably cited (3) and/or outstandingly cited (4) periods shows an increasing impact over the citing years with the following orders of sequences: f0123g (6.01%), which is followed by f123g (1.62%). Only 0.11% of the papers (n = 909) are constantly on the outstandingly cited level. Introduction Bibliometrics is the backbone of scientometrics; most of the studies in scientometrics are based on publication and citation data (Vinkler 2016). Bibliometrics applies statistical methods for analyzing counts of publications and citations (University of Waterloo Working Group on Bibliometrics 2016). Since the introduction of citation analysis (Garfield 1955) , citations have been seen as the basic unit of impact which follow from ‘‘votes’’ of citing authors for publications (Bornmann and Marx 2014; Jha et al. 2016) . ‘‘The act of citing another person’s research provides the necessary linkages between people, ideas, journals and institutions to constitute an empirical field or network that can be analysed quantitatively’’ (Mingers and Leydesdorff 2015, p. 1). Many publications in bibliometrics have focused on analyzing the distributions of citations. For example, Albarra´n and RuizCastillo (2011) investigated 3.7 million articles published in 22 scientific fields. They found that ‘‘citation distributions are highly skewed: About 70% of all articles receive citations below the mean, and articles with a remarkable or outstanding number of citations represent about 9% of the total’’ (p. 48). According to the results of Ponomarev et al. (2012), ‘‘a typical citation pattern has an initial period of slow citation growth lasting from 5 to 20 months… After this initial slow growth phase, the citation rates accelerate until they reach saturation plateaus, after which they decrease’’. However, there is a gap in the literature with respect to studies analyzing citation distributions in more detail. In this study, therefore, annual categories of normalized citation scores (‘‘poorly cited’’, ‘‘fairly cited’’, ‘‘remarkably cited’’, and ‘‘outstandingly cited’’) are used to study the citation histories of papers (Gla¨nzel and Schubert 1988). As our dataset, we use all the articles published in 2000 and their annual citation scores until 2015. We compare the sequences of annual citation scores in six broader fields (natural sciences, engineering and technology, medical and health sciences, agricultural sciences, social sciences, and humanities). Literature overview An early study with the focus on number of citations as a function of time was published by Vlachy (1985). The aging of information in papers (measured by synchronous or diachronous methods) have been studied by Gla¨nzel and Schubert (1995) as well as Gla¨nzel (1997 , 2004). Schubert and Gla¨nzel (1986) introduced the so called ‘‘response time’’ which reveals the speed of receiving citation impact (see also Bornmann and Daniel 2010) . They found different times between the fields. Only a few studies have focused on the citation histories of publications, where the citations for every year are assessed (whether they are lower or higher compared to citations which other publications received in the same year). Most of these studies have dealt with specific distributions of citations. Good examples are sleeping beauties. These are papers which generate little or no citation impact over a long time period (e.g. 10 years), before they start to generate considerable impact. According to Mir and Ausloos (2016), the phenomenon of sleeping beauties is also labeled as resisted discoveries, premature discoveries, delayed recognition, or information awakening. Overviews on sleeping beauties’ studies can be found in Teixeira et al. (2016) and Min et al. (2016). Recently, the citation histories of papers have been investigated in more detail by two studies. Baumgartner and Leydesdorff (2014) explored the citation curves (1) of six journals in different fields as well as (2) in one entire field (virology) over 16 years. Basically, they found two typical curves: ‘‘sticky knowledge claims’’ continue to be cited more than 10 years after publication. ‘‘Transient knowledge claims’’ show a decay pattern after reaching an early peak. The other study by Colavizza and Franceschet (2016) investigated the Physical Review archive, covering 120 years of physics. They found the following three types of citation curve: ‘‘(1) Marathoners: publications which start fast or slow, reach a moderate peak and keep improving the ratio of received citations, or at least keep being relevant over prolonged amounts of time by manifesting a slow decline or a plateau. Marathoners in effect tend to age slowly, or not at all, and are also more numerous and varied than sprinters. (2) Sprinters: publications with fast, even extremely fast and high peak, and equally rapid ageing. These publications are immediately relevant for their community, and rapidly forgotten thereafter, and are fewer in number in the APS dataset. (3) Middle-of-the-roads: publications with a citation history close to the global average citation history, that is, a fast but moderately peaking curve with a gradual decay over time’’ (p. 1043). Methods Field normalization of citation impact This study uses standard impact scores in bibliometrics, namely field- and time-normalized citation impact scores (in a dynamical variant) (Vinkler 2010). These dynamically normalized impact counts (DNIC) are defined as DNICij ¼ CEfijj ; f ¼ f ðiÞ 1 Efj ¼ Nfj ijf ¼f ðiÞ X Cij where i = 1, 2,… are publications, j = 1, 2,… are citing years, and f = 1, 2,… are fields. Here, field delineations based on disciplinary OECD minor codes are used. The OECD field definitions can be found at http://www.oecd.org/science/inno/38235147.pdf. We selected the 2 digit level scheme. Cij denotes citations received by publication i in year j, and Efj denotes mean (received) citations of all publications in field f and year j (i.e. Efj is the expected value). Nfj is the number of cited publications in field f and year j (Nfj is based on non-zero citations), and f = f(i) means a certain field of a given publication. The indicator follows the standard approach in bibliometrics with both field- and time-normalized citations (Waltman 2016). The difference from the standard approach in bibliometrics is that the calculation is based on annual citations, and not on the citations between publication year and a fixed time point later on. ð1Þ ð2Þ If Cij = 0, then DNICij = 0. If DNICij [ 1, the citation impact of the publication is higher than the average in the corresponding OECD disciplinary category and (cited as well as citing) publication years. If DNICij \ 1, the impact is lower than the average. Classifying of publications using the CSS method Gla¨nzel and Schubert (1988) introduced the characteristic scores and scales (CSS) method for grouping ranked observations into rank-specific categories (see also Gla¨nzel 2007, 2010, 2011). Consider a set of n papers. The observed citations Xi received by paper i are ranked in descending order, X1 X2 . . . Xn , where X1* and Xn* denote the citations of the most and least frequently cited papers, respectively. Set the initial values b0 = 0 and v0 = n, where n is the number of papers. b1 is defined as the mean citations; v1 is defined by the comparison Xv1 b1 and Xv1þ1\b1. This comparison is repeated, yielding bk ¼ Xvk1 Xi i¼1 vk 1 with Xvk bk and Xvkþ1\bk; for k 2 ð3Þ Thus, we obtain series b0 B b1 B … and v0 C v1 C …. The kth class is defined by the pair of threshold values [bk-1, bk]; the number of papers belonging to this class amounts to vk-1 - vk. The CSS method can be used to classify the papers within certain fields into four impact classes: ‘‘poorly cited’’, ‘‘fairly cited’’, ‘‘remarkably cited’’, and ‘‘outstandingly cited’’. Then, for example, the share of outstandingly cited papers can be determined for a set which includes papers from different fields (e.g. all papers published by a university). However, the method can not only be used to classify single papers, but also to certain aggregates of papers. For example, Bornmann and Gla¨nzel (2017) propose using the CSS method to classify the universities in a specific ranking (e.g. the Leiden ranking) into performance classes (e.g. based on the number of highly-cited papers). The universities can then be separated into low and high performers. In this study, we use the CSS method for classifying the papers into four citation impact classes based on DNICij. Thus, we do not use the citation counts of single papers, but the annual field- and time-normalized scores for the classification. Consider the set DNICij of n papers published in various disciplines. We used the OECD major codes to compare the results of six broad disciplines: natural sciences, engineering and technology, medical and health sciences, agricultural sciences, social sciences, and the humanities. The broad disciplines are aggregates of OECD minor codes. In each discipline and across disciplines, the DNICij scores (of paper i in a given year j) are ranked in descending order (DNIC1 DNIC2 . . . DNICn)j. The comparison between DNIC and b is defined by bkj ¼ Xvk1 DNICij i¼1 vk 1 ; DNICvkj bkj and DNICvkjþ1\bkj ð4Þ Then, the pair of threshold values [bk-1, bk] forms the impact class. Using the CSS method, the annual categorization of papers to citation impact classes is based therefore on the annual DNIC scores. The values of the annual DNIC scores are kept with min k C 2, 3, …, respectively, which means k C 2, 3, … in every year after the publication year. Since the values k = 2 and k = 3 are usually used to identify highly cited papers (Gla¨nzel 2011), we set k C 2 as ‘‘fairly cited’’ papers, k C 3 as ‘‘remarkably cited’’ papers, and k C 4 as ‘‘outstandingly cited’’ papers in the long run. Sequence analysis of annual CSS scores In a yearly time series j = 1, 2,…, m, the annual CSS scores k of each publication form a sequence across 16 years (starting in 2000). In other words, we have a sequence of 16 scores for every publication with values between 0 = poorly cited and 4 = outstandingly cited. Two examples of sequences are shown in Fig. 1. Sequence fag is f01233233221g and sequence fbg is f01001000100g. fag indicates a highly cited publication (most of the time) and fbg a constantly little cited or non-cited publication. The statistical analyses of the data in the current study are based on the strategy proposed by Brzinsky-Fay et al. (2006) for the analysis of sequence data. Sequence data is analyzed in many research fields, e.g. DNA sequences in biology and life courses in social sciences. ‘‘A sequence is defined as an ordered list of elements, where an element can be a certain status (e.g., employment or marital status), a physical object (e.g., base pair of DNA, protein, or enzyme), or an event (e.g., a dance step or bird call). The positions of the elements are fixed and ordered by elapsed time or by another more or less natural order’’ (Brzinsky-Fay et al. 2006, p. 435) . Dataset used The bibliometric data used in this study is from an in-house database developed and maintained by the Max Planck Digital Library (MPDL, Munich) and derived from the Science Citation Index Expanded (SCI-E), Social Sciences Citation Index (SSCI), and Arts and Humanities Citation Index (AHCI) prepared by Clarivate Analytics, formerly the IP & Science business of Thomson Reuters. The study is based on 790,698 articles published in 2000 and the corresponding citations across 16 citing years (with 2000 as the first citing year). Since many papers have been assigned to more than one OECD minor code, 161,302 papers appear between two and six times in the dataset (435,634 papers have no duplicates). We decided to let the papers appear multiple times in the dataset, since the papers might have different citation distributions in the disciplines. Table 1 shows the number of annual CSS categories in the dataset. Since we included 790,698 articles with 16 annual citation scores each in the study, the study is based on 12,651,168 annual CSS categories. CSS categories Poorly cited (0) Fairly cited (1) Remarkably cited (2) Outstandingly cited (3) Total Results Descriptive statistics Absolute number In percent Cumulative relative number 8,956,874 2,642,053 753,340 298,901 12,651,168 The sequence analyses which we describe in the ‘‘Sequence analysis’’ section are based on several transformations of the original raw data from the MPDL in-house database. In order to reveal the relations between the raw data and the transformed (field- and timenormalized) data, Table 2 shows annual citations, annual normalised citation scores (DNIC), and sequences of CSS scores for some example papers. Table 2 tries to demonstrate the spectrum of different citation impact histories in the dataset. Group (1) in the table consists of papers with increasing citation impact over the citing years. The citation impact of the papers in group (2) is more or less stable over the years. Decreasing and fluctuating histories, respectively, are shown under group (3) and (4) in the table. The WoS accession numbers listed can be used to inspect the paper and its citations in WoS in more detail. The CSS method was initially proposed by Gla¨nzel and Schubert (1988). Since then, the method has been used in various contexts to classify single papers or aggregates of papers as ‘‘poorly cited’’, ‘‘fairly cited’’, ‘‘remarkably cited’’, and ‘‘outstandingly cited’’ (Albarra´n and Ruiz-Castillo 2011; Bornmann and Gla¨nzel 2017; Gla¨nzel 2007, 2010, 2011; Li et al. 2013) . Although the studies were based on different bibliometric datasets, the distributions seem to follow (more or less) a general distribution pattern of percentages: 70% (poorly cited)—21% (fairly cited)—7% (remarkably cited)—2% (outstandingly cited). In addition, similar distribution patterns are reported by Chi and Gla¨nzel (2016) in the context of usage counts. Table 3 presents distributions of ‘‘poorly cited’’, ‘‘fairly cited’’, ‘‘remarkably cited’’, and ‘‘outstandingly cited’’ papers in the six disciplines which we considered in our study. The statistics in the table refer to CSS scores across 16 citing years (beginning in 2000). For example, the mean percentage of poorly cited papers in natural sciences is 70.57% across 16 citing years; the lowest percentage is 66.21% and the highest is 77.49%. The range between the minimum and maximum percentages is 11.28 points. The comparison of the percentages in Table 3 with the general distribution pattern of percentages (70—21—7— 2%) reveals that natural sciences, engineering and technology, medical and health sciences, and agricultural sciences are more similar to the general distribution pattern than the social sciences and the humanities. However, the largest variability of the percentages over the years can be observed for the agricultural sciences (see the ranges in Table 3). Similar field-specific differences in distributions of CSS scores are also reported by Gla¨nzel (2011) and Albarra´n and Ruiz-Castillo (2011) . r u ) ) o (3 (4 F .6 .4 ,4 ,4 ,ittiilr(I)frssssecacezaeaceeonoduqdnnooSSCCDNm ittiillrrI)(ssaccezaaeonodonunnCDNAm ,.,.,.,.,.,,,,.,,.,,..,..,004360134054000043152560261211700 .,.,.,.,.,,,,,.,.,..,,..,,01561106708800001619146062290200 .,.,,.,.,.,.,,,,.,.,.,..,.,616212221030460032111912542000 .,.,,.,.,.,,,,,..,..,,.,,149236911900000559504544800 .,,..,,.,,,,,,,.,.,.,..,8072414007200000337624339100 ,..,,..,,,,,,,,,,..,.,1480436270000000354627200 .,,..,.,.,.,,.,.,.,..,,.,,.2303036204303604422238224434644006 ,,,.,.,.,.,...,.,.,,,.,,..,900302580250260821733522754543900 ..,.,.,,.,.,.,..,.,.,.,,.,,..813215257047011263353135352634202 ,,..,,.,.,.,,..,,.,,.,.,.,..927138361613121756354837465560 ..,.,.,,.,.,.,.,..,.,.,,,.,,.825228281529151541921223733231 ,.,.,.,.,.,.,.,,.,.,,..,,..,.212372423824281252722532264281 ,.,.,.,.,,.,.,.,..,.,.,.,,.,..6272283227232622324212822223711 .,.,.,,.,.,.,.,,,.,.,.,..,,...1454296539373335423331495451172 ,.,.,,.,,.,.,.,,,.,,..,,.17275317513536380027072280002 ,.,.,..,,.,.,,.,,.,.,.,..,.,.714909905211612742321168119121203 ,,.,.,..,.,.,.,.,,,.,.,.,..,..3518412313525353326314413631590 ,itittittitir)(rsssseacacacaaceaagnuu4dngnonpohyflm n re c e d l 9 6 6 7 0 7 nu ,4 ,6 ,2 ,2 ,3 ,1 ,an ,505 ,778 ,353 ,319 ,357 ,146 )(3 sn ,4 ,5 ,3 ,2 ,2 ,1 ,e o 2 5 4 8 4 5 4 7 l tlittiiltitrseeaeacaaeennunbphnohw ittilsaacnounnA ,,,,,,,,,,,,,,,7592746111100000 ,,,,,,,,,,,,,,,56293170342200000 ,,,,,,,,,,,,,,,831979911866420100 ,,,,,,,,,,,,,,,16812529929122173000000 ,,,,,,,,,,,,,,,125153123113101000000 ,,,,,,,,,,,,,,,135133212741100000000 ,,,,,,,,,,,,,,,1114173511515972111110 ,,,,,,,,,,,,,,,138149119713610021210 ,,,,,,,,,,,,436352443544652911131111 ,,,,,,,,,,,,452287459176442423149213 ,,,,,,,,,,,,393934237024363626288316 ,,,,,,,,,,,,331303225313383441325428 ,,,,,,,,,,,,438363279302383239324335 ,,,,,,,,,,,,219163192151101416115115 ,,,,,,,,,,,,,,,71112401110023224146116 ,,,,,,,,,,,,,,,4161217748474175181 ,,,,,,,,,,,,,,,4131251261935110110880 ,iitti:r)()(rsssceaaeanng2b1hppdw g t e n n ittra ECD .90 .70 .10 .20 .80 .80 .50 .30 .70 .70 .70 .70 .70 .20 .60 .60 .50 rsee sn O 5 5 5 5 2 2 2 1 1 1 1 1 1 3 1 1 1 p o Mean Min Max Range t f n o e 6 8 9 7 6 3 8 7 2 1 7 4 3 7 5 0 9 4 t rc .8 .1 .9 .2 .9 .8 .7 .7 .7 .7 .6 .6 .6 .5 .5 .5 .4 .1 is n e 3 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 8 s I p 2 3 n o c s te r 8 1 1 8 0 8 5 9 6 4 8 1 5 0 0 9 9 8 e c lu e 4 2 7 7 2 9 7 8 9 2 8 7 7 2 6 1 9 9 l o b ,6 ,2 ,57 ,00 76 65 61 60 56 56 52 50 49 45 43 39 38 ,6 en ta s 8 7 0 u o b um 8 1 1 1 9 q T A n 1 7 e s t t n n e e 9 9 3 7 5 3 9 9 0 8 2 0 9 7 3 6 1 1 u rc .5 .8 .9 .5 .2 .6 .8 .6 .7 .2 .7 .8 .4 .8 .3 .6 .3 .6 q n e 9 3 1 2 2 1 1 1 1 2 1 1 1 1 1 1 0 9 re I p 1 4 f s e n e 9 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 6 . c I p 2 4 ) n 3 e ( i c e s t r d e t l lu e 0 6 2 1 8 8 0 7 6 2 0 3 3 1 2 9 1 2 i ia o b 4 4 5 8 7 2 1 9 8 1 7 5 0 4 0 3 6 8 c co sb um ,45 12 9 7 5 3 3 2 2 5 2 2 5 3 5 3 3 ,12 lyg S A n 1 5 n i t d en 8 2 8 0 0 0 5 5 3 2 0 5 6 4 7 2 5 2 tan rc .5 .1 .9 .9 .9 .7 .5 .5 .5 .7 .5 .4 .7 .6 .7 .7 .7 .1 ts n e 8 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 u I p 1 3 o , lu e 9 9 3 6 7 2 0 1 6 9 4 7 0 1 4 9 7 7 ) ix irc en so b 12 35 32 29 29 23 18 18 17 23 16 14 25 21 25 23 24 ,98 (2 s g ic b um 6 1 2 ed in A s A n 3 ti c tireh ilcead isceen ltseou reb ,1282 2458 7337 3203 8127 1252 9137 5235 7159 3127 5146 3176 0181 959 1154 728 855 ,502 ,()1 dn c b um 5 a Ms A n ) t % n ly .ts05 and In rceep .1426 .910 .512 .713 .414 .705 .708 .705 .700 .902 .609 .606 .708 .604 .707 .504 .507 .2938 ,if)ra a g 0 le in y ( t r g e d a ee lo t r 8 9 3 2 3 6 3 5 0 8 4 9 3 0 8 1 0 1 e ( lu e 9 5 0 5 1 8 2 8 3 9 1 7 2 6 1 4 8 8 te ign hno so b ,86 22 18 20 17 8 9 8 8 10 8 7 9 7 9 6 6 ,7 sa n c b um 2 t E te A n a d itc tsn 8 y e 11 lro em o l p e sequences, the selected 17 sequences from the total set are listed for all disciplines (although other sequences might meet the threshold of 0.5% in single disciplines). In accordance with the prevalence of skewed citation distributions in the sciences and the dominance of non-cited and little cited papers, the list of sequences in Table 4 only contains two CSS scores: 0 = poorly cited and 1 = fairly cited. Thus, in the set of all papers (and also in most of the disciplines), sequences with 3 = remarkably cited and 4 = outstandingly cited are rare (less than 0.5%). Figure 2 shows the sequences in the dataset as sequence index plots. Whereas Table 4 focusses on the most frequent sequences, all sequences are included in Fig. 2. The plots show a horizontal line for each sequence, distinguishing the CSS scores with different colors (Brzinsky-Fay et al. 2006) . Similarly to Table 4, Fig. 2 demonstrates that the group of sequences with constantly poorly cited elements is the biggest group at the top of the plots. Below this biggest group, we can observe those sequences which are commonly labeled as sleeping beauties. This is a relatively small set of papers which are poorly cited initially and remarkably or outstandingly cited in later years. Another group of papers (sequences) is also clearly visible in Fig. 2. These papers are poorly cited most of the time with a short interruption of a fairly cited period (mostly 1 year). The probability of interruption in early years is higher than in later years in all disciplines. This is especially visible for the agricultural sciences and social sciences, where a large red bar is visible in the second year after publication (see the corresponding higher percentages for these disciplines in Table 4). At the bottom of all plots, the small set of constantly outstandingly papers is visible. With regard to the differences between the disciplines, Table 4 shows that the social sciences are the discipline with the highest percentage of constantly poorly cited papers (29.59%). The lowest percentages are in the agricultural sciences (18.58%) and humanities (19.59%). Thus, here is a large difference between the social sciences and the humanities (although they are frequently treated together in bibliometrics). However, both disciplines show similar results, if we look at the horizontal ‘‘Total’’ line in Table 4. Both disciplines have the highest percentages, which mean that the sequences are more highly concentrated than those in other disciplines. This might be partly an effect of the lower number of sequences. However, agricultural sciences also have a relatively low number of sequences, but the concentration of sequences is significantly lower than in the social sciences and the humanities. In order to obtain a better overview of the sequences in the dataset, two further analyses have been done. The analyses condense the sequences still further. The first condensation which is shown in Table 5 treats CSS scores identically if they consist of the same elements. That means the sequence f2112g is treated the same as f1222g because both sequences consist of the CSS scores 2 and 1 only. The results in Table 5 refer to the complete dataset and are not restricted to the most frequent sequences unlike the results in Table 4. The results in Table 5 confirm the results in Table 4 and Fig. 2. About a quarter of the sequences consist of constantly poorly cited papers f0g. However, the largest group of sequences f01g is that which includes poorly cited and fairly cited periods (46.85%). This group of papers is especially dominant in the humanities with 64.35%. There is a third large group of sequences (19.43%) in Table 5 f012g which includes poorly cited, fairly cited, and remarkably cited periods. This group contains about 20% of the papers in all disciplines except one: in the humanities, only 11.82% of the papers have these three elements. All papers Natural sciences Engineering and technology Medical and health sciences Agricultural sciences Social sciences Humanities The results in Table 5 allow a closer look at the sequences which include outstandingly cited periods (3). The largest group of papers with such a period is f0123g (6.01%), which is followed by f123g (1.62%) in the table. Only 0.11% of the papers (n = 909) are constantly on the outstandingly cited level over a period of 16 years. Most of these papers have been published in the natural sciences (n = 417) and medical and health sciences e t r 5 8 0 3 8 5 5 4 3 1 9 9 8 7 3 8 lu e 4 4 4 8 9 5 6 2 9 6 0 0 2 3 9 l o b ,4 ,6 ,6 ,4 ,7 7 7 1 7 9 9 8 5 ,6 ta s 0 8 3 7 2 3 2 3 2 2 0 o b um 7 8 5 4 1 9 T A n 3 1 1 7 s e i it te r n lu e 6 7 0 5 1 3 3 9 7 7 1 9 0 0 0 8 a o b 7 3 1 5 2 3 6 2 m s 6 3 4 3 ,9 u b um 7 2 1 1 H A n 1 t n 1 9 5 4 9 7 4 1 9 7 7 0 8 0 0 0 rec .15 .59 .76 .45 .90 .50 .10 .40 .10 .40 .00 .20 .00 .00 .00 .00 )(3 s n e 4 2 1 0 d e I p 1 e c t i n c e i c e s t r y l g ilaco lsobu ebum ,3539 ,4450 4783 4820 154 927 71 124 010 423 37 012 41 1 0 ,8122 iadnn S A n 2 1 5 t s t u t n 7 8 9 7 3 5 0 0 7 5 2 5 2 0 0 0 o e .9 .5 .9 .9 .8 .6 .1 .3 .0 .4 .0 .0 .0 .0 .0 .0 d rc 7 8 3 6 0 0 0 0 0 0 0 0 0 0 0 0 n n e 4 1 2 0 a I p 1 , ) 2 ( d iilsenp ileacd isececn ltseobu rebum ,07098 ,21582 ,57429 ,63170 5435 724 944 919 1145 791 383 211 162 9 0 ,85202 liitrcey sc Ms A n 1 2 fa i d , t ) y n 5 6 6 0 4 6 6 1 7 9 5 1 5 0 0 0 (0 b e .2 .1 .4 .2 .9 .5 .1 .4 .1 .4 .0 .1 .0 .0 .0 .0 tsn adn In rcep 74 42 91 6 0 0 0 0 0 0 0 0 0 0 0 001 itecd t f n o e 6 0 6 1 2 2 7 3 9 4 6 7 2 2 7 4 t rc .8 .9 .6 .5 .6 .6 .1 .8 .6 .6 .2 .8 .8 .8 .6 .9 is n e 3 3 8 5 2 2 2 1 1 1 1 0 0 0 0 8 s I p 2 1 6 n o c s e e t r 8 9 8 3 7 7 4 1 6 1 2 2 6 9 2 8 c lu e 4 8 0 4 2 0 9 7 2 6 5 9 9 7 9 9 n ltao sob bum ,688 ,890 ,586 ,534 ,702 ,702 ,171 ,441 ,331 ,291 99 68 64 64 52 ,609 eeuq T A n 1 1 7 s t n t e en 9 3 5 0 5 5 8 0 7 5 9 5 4 8 3 7 uq n rce .59 .89 .25 .61 .21 .22 .24 .19 .09 .21 .04 .21 .03 .06 .05 .32 rfe I p 1 2 1 8 t s s o e sen itian lteuo reb 73 15 79 82 65 86 69 72 61 75 95 75 04 18 36 82 tehm liip um sb um 32 43 41 7 2 2 2 2 1 2 2 ,91 ,re c H A n 1 v s e i d n e 8 4 0 6 1 3 2 1 1 1 0 0 0 0 1 5 d I p 1 1 1 6 n a t e s h ta lt a a d eh In ep 22 21 8 6 2 2 2 2 1 1 1 0 0 0 0 76 rem e h d t n a irredn ilecad isecen ltseou reb ,2182 ,9350 ,9917 ,3720 4956 6784 5392 4745 3546 3922 2836 1845 1917 2191 1880 ,8502 itced o c b um 5 2 1 1 2 y e Ms A n 2 lr i m a a f s e h d t an In ep 42 51 9 5 1 2 2 1 1 1 0 0 0 0 0 07 ed (n = 383). There is only one such paper in the humanities and 6 such papers in agricultural sciences. Constant performers on the level of fairly cited (1) or remarkably cited (2) are very rare in the dataset. In total, only 37 papers are constantly fairly cited and 3 papers constantly remarkably cited. The second condensation which is shown in Table 6 treats identically all sequences that have the same order of CSS scores. That means the sequence f2112g is treated the same as f211112g because the CSS scores appear in the same order in both sequences (first 2, then 1, and then 2 again). The sequences which are shown in Table 6 are restricted to those with at least 0.5% of the papers in the dataset—similar to Table 4. Again, the results in Table 6 reveal that about a quarter of the papers are constantly poorly cited (with a significantly higher percentage in the social sciences). 13.9% of the papers have a sequence with initially increasing citation impact (from 0 to 1) and then decreasing (from 1 to 0). For 8.66 and 5.51% of the papers the f010g sequence order is followed by a f10g and f1010g sequence. In Table 6, remarkably cited or outstandingly cited periods do not play any role. Their occurrences are too low in general. Discussion In recent years, a development has become apparent in bibliometrics for citation impact no longer to be reduced to the times cited information, but analyzed more specifically. For example, the citation context is considered in the bibliometric analyses to have more specific information on the impact of publications and how cited publications are perceived (Small et al. 2017). Carroll (2016) takes into account ‘‘the frequency with which the paper is cited within citing publications … adding depth and value to the citation metric’’ (p. 1329). The results of Hu et al. (2015) show that successive citations in papers are more intentional and reasonable than first-time citations—if papers are cited multiple times in a paper. The ‘‘Literature overview’’ section in this paper presents some further studies which take a closer look at citations by investigating the citation history of papers. In this study, we used a method for the analysis of citation distribution which has never been used before in bibliometrics (to the best of our knowledge). Based on annually normalized citation scores, we generated annual sequences of CSS scores (e.g. f01233233221g) which we analyzed using the strategy proposed by Brzinsky-Fay et al. (2006) . This strategy allows the identification of very frequent and less frequent sequences over the complete publication set and disciplinary sets. In agreement with previous studies, our results demonstrate that sequences with poorly cited (0) and fairly cited (1) elements dominate the publication set; sequences with remarkably cited (3) and outstandingly cited (4) periods are rare. The highest percentages of constantly poorly cited papers can be found in the social sciences; the lowest percentages are in the agricultural sciences and humanities. The largest group of papers with remarkably cited (3) and/or outstandingly cited (4) periods shows an increasing impact over the citing years with the following orders of sequences: f0123g (6.01%), which is followed by f123g (1.62%). Only 0.11% of the papers (n = 909) are constantly on the outstandingly cited level. These might be the few papers which significantly drive scientific progress (Rodr´ıguez-Navarro 2016). This study was a first attempt to use sequence analyses with bibliometric data. We think that this statistical approach can lead to interesting insights in citation histories. The application of this approach can be further extended beyond the analyses in our study. For example, a focus of future research could be on the comparison of sequences and the measurement of differences between two sequences. According to Brzinsky-Fay et al. (2006) , the so-called Levenshtein distance has been used for comparisons in various fields, such as plagiarism detection and the analysis of DNA sequences. The Levenshtein distance quantifies the distance between two sequences. Another topic for future research could be possible explanations of differences between sequences. Distance measures between two sequences could be included as dependent variables in regression models, which are then explained by various characteristics of the publications (e.g., their subject category, country of origin, or reputations of authors). Acknowledgements Open access funding provided by Max Planck Society. The bibliometric data used in this paper is from an in-house database developed and maintained by the Max Planck Digital Library (MPDL, Munich) and derived from the Science Citation Index Expanded (SCI-E), Social Sciences Citation Index (SSCI), and Arts and Humanities Citation Index (AHCI) prepared by Clarivate Analytics, formerly the IP & Science business of Thomson Reuters. Fred Y. Ye acknowledges National Natural Science Foundation of China Grant No. 71673131 for partially financial support. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Albarra´n , P. , & Ruiz-Castillo , J. ( 2011 ). References made and citations received by scientific articles . Journal of the American Society for Information Science and Technology , 62 ( 1 ), 40 - 49 . doi: 10 .1002/ asi.21448. Baumgartner , S. E. , & Leydesdorff , L. ( 2014 ). Group-based trajectory modeling (GBTM) of citations in scholarly literature: Dynamic qualities of ''transient'' and ''sticky knowledge claims'' . Journal of the Association for Information Science and Technology , 65 ( 4 ), 797 - 811 . doi: 10 .1002/asi.23009. Bornmann , L. , & Daniel , H. D. ( 2010 ). Citation speed as a measure to predict the attention an article receives: An investigation of the validity of editorial decisions at Angewandte Chemie International Edition . Journal of Informetrics , 4 ( 1 ), 83 - 88 . Bornmann , L. , & Gla¨nzel, W. ( 2017 ). Applying the CSS method to bibliometric indicators used in (university) rankings . Scientometrics, 110 ( 2 ), 1077 - 1079 . doi: 10 .1007/s11192-016-2198-5. Bornmann , L. , & Marx , W. ( 2014 ). The wisdom of citing scientists . Journal of the American Society of Information Science and Technology , 65 ( 6 ), 1288 - 1292 . Brzinsky-Fay , C. , Kohler , U. , & Luniak , M. ( 2006 ). Sequence analysis with Stata . The Stata Journal , 6 ( 4 ), 435 - 460 . Carroll , C. ( 2016 ). Measuring academic research impact: Creating a citation profile using the conceptual framework for implementation fidelity as a case study . Scientometrics , 109 ( 2 ), 1329 - 1340 . doi: 10 . 1007/s11192-016-2085-0. Chi , P. S. , & Gla¨nzel, W. ( 2016 ). Do usage and scientific collaboration associate with citation impact? In I. Rafols , J. Molas-Gallart , E. Castro-Mart´ınez & R. Woolley (Eds.), Proceedings of the 21th International conference on science and technology indicators-peripheries, frontiers and beyond (pp. 1223 - 1228 ). Valencia, Spain. Colavizza , G. , & Franceschet , M. ( 2016 ). Clustering citation histories in the physical review . Journal of Informetrics , 10 ( 4 ), 1037 - 1051 . doi: 10 .1016/j.joi. 2016 . 07 .009. Garfield , E. ( 1955 ). Citation indexes for science-new dimension in documentation through association of ideas . Science , 122 ( 3159 ), 108 - 111 . Gla ¨nzel, W. ( 1997 ). On the possibility and reliability of predictions based on stochastic citation processes . Scientometrics , 40 ( 3 ), 481 - 492 . doi: 10 .1007/Bf02459295. Gla ¨nzel, W. ( 2004 ). Towards a model for diachronous and synchronous citation analyses . Scientometrics , 60 ( 3 ), 511 - 522 . doi: 10 .1023/B:SCIE. 0000034391 .06240. 2a .


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs11192-017-2521-9.pdf

Lutz Bornmann, Adam Y. Ye, Fred Y. Ye. Sequence analysis of annually normalized citation counts: an empirical analysis based on the characteristic scores and scales (CSS) method, Scientometrics, 2017, 1-16, DOI: 10.1007/s11192-017-2521-9