The h-index as an almost-exact function of some basic statistics
The h-index as an almost-exact function of some basic statistics
Lucio Bertoli-Barsotti 0
Tommaso Lando 0
JEL Classification C 0
0 Mathematical Subject Classification 62P99
As is known, the h-index, h, is an exact function of the citation pattern. At the same time, and more generally, it is recognized that h is ''loosely'' related to the values of some basic statistics, such as the number of publications and the number of citations. In the present study we introduce a formula that expresses the h-index as an almost-exact function of some (four) basic statistics. On the basis of an empirical study-in which we consider citation data obtained from two different lists of journals from two quite different scientific fields-we provide evidence that our ready-to-use formula is able to predict the h-index very accurately (at least for practical purposes). For comparative reasons, alternative estimators of the h-index have been considered and their performance evaluated by drawing on the same dataset. We conclude that, in addition to its own interest, as an effective proxy representation of the h-index, the formula introduced may provide new insights into ''factors'' determining the value of the h-index, and how they interact with each other.
h-Index; W function
Department of Management, Economics and Quantitative Methods, University of Bergamo, Via dei
Caniana 2, 24127 Bergamo, Italy
Department of Finance, VSˇ B -TU Ostrava, Sokolska` 33, 70121 Ostrava, Czech Republic
The purpose of this paper is to present a formula with which to determine (estimate) the
hindex, h, under incomplete information conditions (IIC). By IIC we mean the situation in
which, for different kinds of reasons, we do not know the whole set of citation data, the
entire citation profile that would allow us to obtain the actual exact value of the h-index.
This is the case, for example, when only few ‘‘basic’’ citation statistics (other than the
hindex) are published, or known to us.
To be concrete, we will refer to simple citation indicators—to use the words of
, ‘‘single-number criteria commonly used to evaluate scientific output’’—as:
1. total number of citations C;
2. total number of citations for the t (t 2 f1; 2; 3; . . .g) most-cited publications, Ct; thus,
Ct ¼ Pit¼1 cðiÞ, where cðiÞ represents the number of citations to publication i, and
where publications are ranked in decreasing order of the number of citations:
cð1Þ cð2Þ cðT Þ.
3. total number of publications T ;
4. total number of ‘‘significant’’ publications, that is, those with at least a predetermined
number of citations k each (k 2 f1; 2; 3; . . .g), Tk.
In this paper we focus on these indicators in their simplest versions, that is: C, C1, T and
T1. The purpose of the analysis is twofold: to estimate the h-index (when it cannot be
determined directly from the data) and hence at the same time to identify the main factors
which influence the level of the h-index. A crucial question is therefore the extent to which
the h-index can be satisfactorily predicted from knowledge of only the above basic
statistics—i.e. under IIC.
More formally, we are searching for a formula
h ¼ h^ðS1; . . .; SrÞ;
1 r 4, Sj 2 S, 1 j r, where S ¼ fC; C1; T ; T1g. To be noted is that the formula h^ can
be interpreted as a genuine estimator of the h-index, h, i.e. h^ ffi h, because it does not
depend on values of unknown parameters.
Possible estimators under IIC of the h-index can be found in the literature:
A very simple proxy for the h-index is given by hH ¼ pffiCffiffiffi=ffiffiaffiffi. This model, which can
be traced back to
, is not a genuine estimator of the h-index because hH is
still a function of an unknown parameter, a, and it is not specified (by the formula
itself) how to estimate this parameter in terms of the above basic statistics.
Nevertheless, an estimator for the h-index can be obtained by substituting the
unknown parameter a with a fixed constant (Hirsch found ‘‘empirically’’ that a lay
between 3 and 5).
found that ‘‘pffiCffiffi is essentially equivalent to the
hindex, up to an overall factor that is close to 2’’ (put otherwise, he found that the
distribution ratio pffiCffiffi=2h has an empirical distribution ‘‘sharply peaked about 1’’). This
suggests the approximating formula
h^ ¼ hR ¼ pffiCffiffi=2
with r ¼ 1, S ¼ fCg, which we could then call the Redner formula—probably the
simplest estimator of the h-index, under IIC.
While hR is a model-free proxy for the h-index, more elaborate solutions has been
attempted in the literature by assuming specific probabilistic distributions for the
citation rate. For example, a formula that follows model (1), with r ¼ 4, has been
recently introduced by
Bertoli-Barsotti and Lando (2017)
h ¼ h~ð1Þ
W ¼ log 1
where m~1 ¼ ðC C1Þ=ðT1 1Þ is nothing but a ‘‘trimmed’’ version of the simple
sample mean C=T1, and where Wð Þ represents the so-called Lambert-W function
(Corless and Jeffrey 2015)
. The Lambert-W function is the function W ðzÞ satisfying
z ¼ W ðzÞeWðzÞ, and can be currently computed using mathematical software, for
example the Mathematica software package
(Wolfram Research, Inc. 2014)
, or the R
statistical computing environment
(R Development Core Team 2012)
. The use of a
‘‘trimmed’’ version of the sample mean is a simple technique with which to make the
sample mean more robust with respect to a single outlier—a single highly-cited paper
that could substantially inflate the mean, as is well known.
Formula h~ð1Þ r ¼ 4; S ¼ fC; C1; T ; T1gÞ is based on the assumption that the citation
rate of papers (cited at least once) follows a shifted-geometric distribution (SGD) with
parameter Q ðQ [ 1Þ with probability function pðyÞ ¼ Q yðQ 1Þy 1, y ¼ 1; 2; . . .; pðyÞ
represents the probability of observing the number of citations y of a paper (cited at least
once), while Q represents the expectation of the SGD. Then, n^ðyÞ ¼ TpðyÞ expresses the
‘‘expected’’/estimated number of articles with y citations.
As an alternative approach, an important class of models is the one defined by the
h ¼ c0C2=3T 1=3
where c0 is a fixed and known positive constant
(Schubert and Gla¨nzel 2007)
model (4), specific ready-to-use formulas are obtained by taking, in particular:
(a) c0 ¼ 4 1=3
(Iglesias and Pecharroman 2007; see also Ionescu and Chopard 2013;
Panaretos and Malesios 2009; Vinkler 2009, 2013)
, (b) c0 ¼ 0:75 (Schubert and
Gla¨nzel 2007), (c) c0 ¼ 1 Prathap (2010a, b). Following the notation of
and Lando (2017)
, let hSGðc0Þ ¼ c0C2=3T 1=3. Note that these formulas are functions of
the data only through two out of the four basic statistics (r ¼ 2, S ¼ fC; T g), and they
are based on the assumption of a continuous-type distribution. The formula hSGð1Þ is
also known as the ‘‘p-index’’
(Prathap 2010a, b)
Another approach which deserves mention for completeness, even if it does not yield a
ready-to-use formula, is that proposed by
Iglesias and Pecharroman (2007)
. Adopting a
different perspective, i.e. the rank-size formulation, and starting from the assumption
that the number cðkÞ of citations of the paper of rank k, is approximately distributed
following a stretched exponential type PDF
f ðk; g; bÞ ¼ Cg1=bC 1 þ b 1
k [ 0;
(not to be confused with a Weibull PDF, see below), Iglesias and Pecharroman suggest
deriving a formula for the h-index as the solution of the equation
Interestingly, the solution may be derived in closed form (even if authors did not
realize this) by means of the Lambert-W function. Unfortunately, this solution still
depends on the value of an unknown free parameter, specifically b [see their Eqs. (16)
and (17)]. Hence, their formula could become a genuine estimator of the h-index—of
the form h^ ¼ h^ðC; T; T1Þ, r ¼ 3—only by constraining the unknown parameter b to
assume a fixed (but arbitrary) value b0.
A new formula for the h-index under the Weibull assumption
Let NðyÞ be the empirical citation distribution function, i.e. the function giving the number
of papers which have been cited y times at most. Then, in particular,
nðyÞ ¼ NðyÞ Nðy 1Þ, for y ¼ 1; 2; . . ., nð0Þ ¼ Nð0Þ, is the number of papers that have
been cited exactly y times. We assume that the citation rate of a paper is a random variable
X that is distributed as a two-parameter Weibull distribution, with CDF
Fðx; a; bÞ ¼ 1 exp axb , x [ 0, and 0 otherwise, where a [ 0 and b [ 0. The
probability density function is then
f ðx; a; bÞ ¼ abxb 1 exp
for x [ 0, and 0 otherwise. The Weibull distribution is a rather flexible model: the PDF is
reverse J-shaped for b 1 and bell-shaped otherwise.
Since our assumption involves a continuous distribution, a suitable discretization rule is
needed. In particular, for every y, y ¼ 0; 1; 2; . . ., let T exp ayb express the ‘‘expected’’
R yþ1 f ðx; a; bÞdx ¼ T ðFðy þ 1; a; bÞ
Fðy; a; bÞÞ represents the expected number of
articles with y citations exactly, and N^ðyÞ ¼ TFðy þ 1; a; bÞ the expected number of papers
which have been cited y times at most. As a special case,
Fð1; a; bÞ
Fð0; a; bÞ ¼ 1
can be interpreted as a model for the so-called uncitedness factor, T TT1 ¼ nðT0Þ
Huang 2012; see also Egghe 2013; Burrell 2013)
. A Weibull model for the h-index is then
yielded by the solution of the equation
Replacing axb with t in the equation, we have
Thus, replacing bt with s, we obtain the equivalent equation
s ¼ W abTb and, since x ¼
Hence, by definition of the above mentioned Lambert-W function, we find the solution
asb 1=b, we finally arrive at the formula
W abT b
An empirical counterpart of the above theoretical model for the h index may now be
obtained by substituting the parameters a and b with estimates, a and b , based on
suitable functions of the citation data only through the basic statistics C; C1; T and T1. This
can be done firstly by using the uncitedness factor to derive the equation 1 e a ¼ T TT1,
that can be solved (under the assumption 0\T1\T ) for the variable a as
as an estimate of parameter a, and secondly, by using the trimmed sample citation rate,
as an estimate of the expectation of X, that is E X 1
ð Þ ¼ gða; bÞ ¼ a 1=bC 1 þ b
[ 0. Note
that, by construction, our approximation slightly overestimates the true average number of
citations, so that a correction for continuity by one-half is needed. We then find b as the
solution (method of moments) of the equation
that can be solved numerically. It should be noted that the existence and uniqueness of the
solution of Eq. (15) are not always warranted a priori. Indeed, it can be proved that the
necessary and sufficient condition for existence and uniqueness of the solution is m [ 1
(see ‘‘Appendix’’). We should then consider ‘‘out of range’’ the cases where m 1, and
exclude them from the analysis.
With a and b replaced by a ¼ a ðT ; T1Þ and b ¼ b ðC; C1; T Þ in formula (12) one
finally obtains (r ¼ 4, S ¼ fC; C1; T ; T1g)
h ¼ hWW ¼
W a b T b
where the suffix WW is motivated by the fact that the formula is based on a Weibull
distribution and on the Lambert-W function.
This section empirically investigates the effectiveness of formula hWW as an estimate of the
actual value of the h-index, h. We will compare estimates derived from hWW with the real
values of the h-index. In order to facilitate possible comparisons with other formulas (see
below), we choose to use the same two datasets as in
Bertoli-Barsotti and Lando (2017)
where the authors present an empirical study based on citation data obtained from two
different sets of journals belonging to two different scientific fields: (1) the S&MM list and
(2) the EE&F list.
S&MM list The former dataset includes the 231 journals as selected from a former list
of 568 journals identified as important (in the opinion of a group of experts) in the area
‘‘Statistics and Mathematical Methods’’ (S&MM). Overall, the S&MM dataset
included 485,628 citations of 99,409 publications from these journals
(for details see
Bertoli-Barsotti and Lando 2017)
. For each journal, the actual value h of the h-index
was computed—on the basis of citations retrieved from the Scopus database in last
week of December 2015—as the largest number of papers published in the journal
between 2010 and 2014 and which obtained at least h citations each, from the time of
publication until December 2015. Thus, citation data referred to a 6-year citation
window, 2010–2015, and a 5-year publication window, 2010–2014. The four basic
statistics C, C1, T and T1 were derived as well. The list of the 231 journals in the
S&MM dataset is reported in Table 1.
EE&F list The second dataset included the 100 journals (with a minimum number of
50 publications) top ranked according to the Scopus Impact per Publication (IPP; the
IPP is defined as the ratio of citations in a year to papers published in the three
previous years divided by the number of papers published in those same years) in
2014, within the Scopus subject area of ‘‘Economics, Econometrics and Finance’’
(EE&F). The citation data of all 100 journals in the EE&F list were retrieved during
the last week of April 2016. The dataset obtained included 19,889 publications
receiving a total of 74,096 citations. In this case, differently from the above dataset, in
order to obtain citation and publication windows as similar as possible to those
employed for the computation of the IPP 2014 by Scopus, the citations used were
those received during 2014 of papers published within the previous 3 years 2011–2013
(for further details see Bertoli-Barsotti and Lando 2017)
. For each journal the actual
value h of the h-index was then computed as the largest number of papers published in
the journal between 2011 and 2013 and which obtained at least h citations each in the
year 2014. The list of the journals in the EE&F dataset is reported in Table 2.
Estimation of the h-index with the formula hWW
where b c is the floor function (recall that the floor function of x gives the greatest integer
less than or equal to x). Note that, from an operational point of view, all estimating
formulas (1) generate real numbers. However, for estimation purposes, these numbers
should be rounded-off to the nearest integer, not only in order to produce numbers in the
same range of values as the h-index but also to avoid ‘‘false precision’’.
(Hicks et al. 2015)
To give an example illustrating the calculation of this estimate, let us consider the case
of the Journal of the American Statistical Association (ISSN 0162-1459, from the S&MM
list). We have C ¼ 5231; C1 ¼ 156; T ¼ 663 and T1 ¼ 519. Hence
which yields the solution b
we finally conclude that
A comparative analysis of the accuracy
To verify the accuracy of formula hWW , comparatively, we considered, among several
possible ready-to-use formulas, the following ones among those defined above: h~ðW1Þ,
, for formula hR]. To measure the magnitude of the observed accuracy,
for each of the six estimation formulas respectively numbered as: (1) hWW , (2) h~ðW1Þ, (3)
hSGð0:63Þ, (4) hSGð0:75Þ, (5) hSGð1Þ, (6) hR,
we calculated the absolute relative error (ARE) of the estimator h^jðiÞ of the actual
h-index, hj, for each journal j, j ¼ 1; . . .; J,
where h^jðiÞ ¼ bh^jðiÞ þ 0:5c is the rounded-off version of formula i, i ¼ 1; 2; . . .; 6,
as a criterion with which to assess the overall quality of the formula, we computed
the mean absolute relative error (MARE),
MARE h^ðiÞ ¼
The results are summarized in Table 3.
This paper has addressed the need to gain better understanding of how simple citation
metrics are related to the h-index, or rather, to a ‘‘good’’ proxy representation of the h
index. This also responds to the more basic requirement of ‘‘building bridges’’ between
different types of known and available measures of impact/impact indicators—under IIC.
Differently from other studies (that consider the problem of defining a ‘‘model’’ of the
h-index), our concern has not been to estimate the parameters
(sometimes even considered
at the unit level, i.e. single journal, or single scientist; see e.g. Petersen et al. 2011)
parametric model for the h-index under the assumption of knowing the entire citation
pattern; rather, we addressed the quite different and more practical problem of finding a
proxy representation of h through a universal formula that only depends on few summary
statistics of the data. The formula hWW is ‘‘universal’’ in the sense that it gives a proxy
representation of h that holds for any given journal and any dataset.
The issue of determining an indicator under IIC is closely related to the search for a
solution of the problem of recovering and comparing impact indicators from different
databases. As a simple but significant example of this issue, we may cite the specific
problem of determining/estimating the IF for journals using the Google Scholar-based
hindex as a predictor
(Bertocchi et al. 2015)
As confirmed in our case study analysis, the h-index can be viewed as an almost-exact
function of C; C1; T and T1, through hWW , i.e. that the basic statistics C; C1; T and T1
provide salient information for the evaluation of the h-index with high precision. In
practice, while computation of the h-index h requires knowledge of the entire citation
profile (or at least large part of it, e.g. the so-called h-core), formula hWW requires
knowledge of only a few elementary summary statistics, but reproduces the actual value of
h quite well. In truth, in our computations we found that the estimates yielded by hWW were
slightly biased downwards for quite high values of the h-index but, as can be seen from
Table 3, overall the formula hWW yields very accurate approximations to the empirical
value of the h-index, with values of the MARE ranging around 5–6%, not too dissimilar
from those obtained by formula h~ðW1Þ
(Bertoli-Barsotti and Lando 2017)
. Both formulas h~ð1Þ
and hWW exhibit comparable levels of accuracy (the advantages of the formula h~ðW1Þ, as
compared to formula hWW , may be that: (i) it yields an explicit expression of the basic
indicators C; C1; T and T1, while the latter not, and (ii) it is based on a simpler probabilistic
model). Even though the Pearson correlation, q, is not an adequate measure of the accuracy
of the estimation and should not be used to compare the effectiveness of the different
estimators considered (and this is the reason why this concept has been banished from this
study), for the sake of completeness we point out that: (1) for the S&MM dataset (230
q h; hWW
q h; h~ðW1Þ
qðh; hSGÞ ¼ 0:98
qðh; hRÞ ¼ 0:96; (2) for the EE&F dataset we found q h; hWW
¼ 0:97, q h; h~ðW1Þ
qðh; hSGÞ ¼ 0:97 and qðh; hRÞ ¼ 0:90. Ultimately, despite the differences between the
Acknowledgements Funding was provided by Czech Science Foundation (Grant No. 17-23411Y).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made.
Conditions for existence and uniqueness of a solution of Eq. (15)
For every fixed a ¼ a [ 0, gða ; bÞ ! þ1 as b ! 0 and gða ; bÞ ! 1 as b ! þ1.
oob gða ; bÞ ¼ gðab2; bÞ log a
d log CðzÞ ¼
where w is the digamma function, i.e. the function defined by wðzÞ ¼ dz
(see Johnson et al. 2005, pp. 8–9)
, we find that the inequality
holds if and only if it holds
ob gða ; bÞ\0
C0ð1Þ ffi 0:5772), at þ1.
b [ 0 if and
sign from negative to positive at b ¼ b0, for some b0 [ 0; hence gða ; bÞ is strictly
decreasing for every 0\b\b0, and strictly increasing for every b [ b0, and the
point b0 is a global minimum for gða ; bÞ. Moreover since, as seen before,
lim gða ; bÞ ¼ 1, then 0\gða ; b0Þ\1, and the limit at infinity is approached from
below. We conclude that, in this case too, Eq. (15) has a unique solution if and only
if m [ 1; conversely, if m 1 Eq. (15) may have two solutions, or no solution at
In both cases (a) and (b), Eq. (15) has one and only one solution if and only if m
Bertocchi , G. , Gambardella , A. , Jappelli , T. , Nappi , C. A. , & Peracchi , F. ( 2015 ). Bibliometric evaluation vs. informed peer review: Evidence from Italy . Research Policy , 44 , 451 - 466 .
Bertoli-Barsotti , L. , & Lando , T. ( 2017 ). A theoretical model of the relationship between the h-index and other simple citation indicators . Scientometrics , 111 ( 3 ), 1415 - 1448 .
Burrell , Q. L. ( 2013 ). A stochastic approach to the relation between the impact factor and the uncitedness factor . Journal of Informetrics , 7 , 676 - 682 .
Corless , R. M. , & Jeffrey , D. J. ( 2015 ). The Lambert W Function . In N. J. Higham , M. Dennis , P. Glendinning , P. Martin , F. Santosa , & J. Tanner (Eds.), The Princeton companion to applied mathematics (pp. 151 - 155 ). Princeton: Princeton University Press.
Egghe , L. ( 2013 ). The functional relation between the impact factor and the uncitedness factor revisited . Journal of Informetrics , 7 , 183 - 189 .
Gla ¨nzel, W. ( 2006 ). On the h-index-A mathematical approach to a new measure of publication activity and citation impact . Scientometrics , 67 , 315 - 321 .
Hicks , D. , Wouters , P. , Waltman , L. , De Rijcke , S. , & Rafols , I. ( 2015 ). The Leiden Manifesto for research metrics . Nature , 520 ( 7548 ), 429 .
Hirsch , J. E. ( 2005 ). An index to quantify an individual's scientific research output . Proceedings of the National Academy of Sciences , 102 , 16569 - 16572 .
Hsu , J. W. , & Huang , D. W. ( 2012 ). A scaling between impact factor and uncitedness . Physica A , 391 , 2129 - 2134 .
Iglesias , J. , & Pecharroman , C. ( 2007 ). Scaling the h-index for different scientific ISI fields . Scientometrics , 73 , 303 - 320 .
Ionescu , G. , & Chopard , B. ( 2013 ). An agent-based model for the bibliometric h-index . The European Physical Journal B , 86 , 426 .
Johnson , N. L. , Kemp , A. W. , & Kotz , S. ( 2005 ). Univariate discrete distributions . New York: Wiley.
Malesios , C. ( 2015 ). Some variations on the standard theoretical models for the h-index: A comparative analysis . Journal of the Association for Information Science and Technology , 66 , 2384 - 2388 .
Panaretos , J. , & Malesios , C. ( 2009 ). Assessing scientific research performance and impact with single indices . Scientometrics , 81 , 635 - 670 .
Petersen , A. M. , Stanley , H. E. , & Succi , S. ( 2011 ). Statistical regularities in the rank-citation profile of scientists . Scientific Reports , 1 , 181 .
Prathap , G. ( 2010a ). Is there a place for a mock h-index? Scientometrics , 84 , 153 - 165 .
Prathap , G. ( 2010b ). The 100 most prolific economists using the p-index . Scientometrics , 84 , 167 - 172 .
R Development Core Team. ( 2012 ). R: A language and environment for statistical computing . Vienna: R Foundation for Statistical Computing. http://www.R-project. org.
Redner , S. ( 2010 ). On the meaning of the h-index . Journal of Statistical Mechanics: Theory and Experiment , 2010 ( 03 ), L03005 .
Schreiber , M. , Malesios , C. C. , & Psarakis , S. ( 2012 ). Exploratory factor analysis for the Hirsch index, 17 h-type variants, and some traditional bibliometric indicators . Journal of Informetrics , 6 , 347 - 358 .
Schubert , A. , & Gla¨nzel, W. ( 2007 ). A systematic analysis of hirsch-type indices for journals . Journal of Informetrics , 1 , 179 - 184 .
Vinkler , P. ( 2009 ). The p-index: A new indicator for assessing scientific impact . Journal of Information Science , 35 , 602 - 612 .
Vinkler , P. ( 2013 ). Quantity and impact through a single indicator . Journal of the American Society for Information Science and Technology , 64 , 1084 - 1085 .
Wolfram R. ( 2014 ). Mathematica 10.0 . Champaign , IL: Wolfram Research Inc.