Quantifying the sensitivity of oscillation experiments to the neutrino mass ordering
Oskar Klein Centre for Cosmoparticle Physics, Department of Physics, Stockholm University
, SE-10691 Stockholm,
Center for Neutrino Physics
, Virginia Tech, Blacksburg,
VA 24061, U.S.A
Max-Planck-Institut fur Kernphysik
, Saupfercheckweg 1,
69117 Heidelberg, Germany
Department of Theoretical Physics, School of Engineering Sciences, KTH Royal Institute of Technology, AlbaNova University Center
106 91 Stockholm, Sweden
Determining the type of the neutrino mass ordering (normal versus inverted) is one of the most important open questions in neutrino physics. In this paper we clarify the statistical interpretation of sensitivity calculations for this measurement. We employ standard frequentist methods of hypothesis testing in order to precisely define terms like the median sensitivity of an experiment. We consider a test statistic T which in a certain limit will be normal distributed. We show that the median sensitivity in this limit is very close to standard sensitivities based on 2 values from a data set without statistical fluctuations, such as widely used in the literature. Furthermore, we perform an explicit Monte Carlo simulation of the INO, JUNO, LBNE, NOA, and PINGU experiments in order to verify the validity of the Gaussian limit, and provide a comparison of the expected sensitivities for those experiments.
1 Introduction 2 3 4
Terminology and statistical methods
2.1 Frequentist hypothesis testing
2.2 Application to the neutrino mass ordering
2.3 Median sensitivity or the sensitivity of an average experiment
3.1 Simple hypotheses
3.2 Composite hypotheses
Monte Carlo simulations of experimental setups
4.1 Medium-baseline reactor experiment: JUNO
4.2 Atmospheric neutrinos: PINGU and INO
4.3 Long-baseline appearance experiments: NOA and LBNE
Comparison between facilities: future prospects
Discussion and summary
A The distribution of T
B Simulation details
B.1 Medium baseline reactor experiment: JUNO
B.2 Atmospheric neutrino experiments: PINGU and INO
B.3 Long baseline beam experiments: NOA, LBNE-10 kt, LBNE-34 kt
The ordering of neutrinos masses constitutes one of the major open issues in particle
physics. The mass ordering is called normal (inverted) if m231 m23 m21 is
positive (negative). Here and in the following we use the standard parameterization for the
neutrino mass states and PMNS lepton mixing matrix . Finding out which of these
two possibilities is realized in Nature has profound implications for the flavor puzzle, as
well as phenomenological consequences for cosmology, searches for neutrino mass, and for
neutrinoless double-beta decay. Therefore, the determination of the mass ordering is one
of the experimental priorities in the field. In particular, with the discovery of a large value
of 13  an answer within a decade or so is certainly possible and first hints may be
obtained even sooner in global fits to the worlds neutrino data.
New information is expected to come from long-baseline experiments, like T2K  and
NOA [7, 8], which look for the appearance of e(e) in a beam of (). Proposals for
a more long-term time frame include LBNE , LBNO , a superbeam based on
the ESS , and eventually a neutrino factory . Matter effects  will induce
characteristic differences between the neutrino and antineutrino channels, which in turn
will allow inference of the mass ordering, see e.g., refs. [18, 19] for early references. The
fact that a comparison of neutrino and antineutrino channels is performed also implies that
the leptonic CP phase cannot be ignored and has to be included in the analysis as well.
A selective set of recent sensitivity studies for present and future proposed long baseline
oscillation experiments can be found in refs. .
Another possibility to determine the mass ordering arises from observing the energy
and zenith angle dependence of atmospheric neutrinos in the GeV range, which will also
have the mass ordering information imprinted by matter effects . The flux of
atmospheric neutrinos follows a steep power law with energy and thus the flux in the GeV range
is quite small and requires very large detectors. IceCube technology can be adapted to
neutrino energies in the GeV range by reducing the spacing of optical modules, eventually
leading to the PINGU extension  and a similar low-energy modification can also be
implemented for neutrino telescopes in the open ocean, called ORCA . Another way
to overcome the small neutrino flux is to separate neutrino and antineutrino events using
a magnetic field like in the ICal@INO experiment [40, 41] (INO for short in the following).
Mass ordering sensitivity calculations have been performed for instance in refs.  for
PINGU/ORCA and in refs.  for INO or similar setups.
Finally, the interference effects between the oscillations driven by m221 and m231
in the disappearance of e provide a third potential avenue for this measurement. In
particular, this approach has been put forward in the context of reactor neutrinos .
JUNO [61, 62] will comprise a 20 kt detector at a baseline of about 52 km of several
nuclear reactors. A similar project is also discussed within the RENO collaboration .
The possibility to use a precision measurement of the e survival probability at a nuclear
reactor to identify the neutrino mass ordering has been considered by a number of authors,
e.g., refs. [49, 62, 6474].
This impressive experimental (and phenomenological) effort has also resulted in a
renewed interest in potential issues arising from the statistical interpretation of the resulting
data [75, 76] (see also ), which can be summarized as: given that the determination of
the mass ordering is essentially a binary yes-or-no type question, are the usual techniques
relying on a Taylor expansion around a single maximum of the likelihood applicable in this
case? The goal of this paper is to answer this question within a frequentist framework
for a wide range of experimental situations, including disappearance as well as
appearance measurements. The answer we find in this paper can be stated succinctly as: the
currently accepted methods yield approximately the expected frequentist coverage for the
median experimental outcome; quantitative corrections typically lead to a (slightly)
increased sensitivity compared to the standard approach prediction. The methods applied
in the following are analogous to the ones from ref. , where similar questions have been
addressed for the discovery of 13 and CP violation. In the present work we strictly adhere
to frequentist methods; Bayesian statistics is used to address the neutrino mass ordering
question in ref. , see also refs. [75, 76] for Bayesian considerations.
The outline of our paper is as follows. We first review the principles of hypothesis
testing in a frequentist framework in section 2, apply them to the case of the mass ordering,
define the sensitivity of the median experiment and discuss the relation to the standard
sensitivity based on 2 values from the Asimov data set. In section 3 we consider the
Gaussian limit, where all relevant quantities, such as sensitivities can be expressed
analytically. Details of the derivation can be found in appendix A, as well as a discussion
of conditions under which the Gaussian approximation is expected to hold. In section 4
we present results from Monte Carlo simulations of the INO, PINGU, JUNO, NOA, and
LBNE experiments. The technical details regarding the simulations are summarized in
appendix B. We show that for most cases the Gaussian approximation is justified to good
accuracy, with the largest deviations observed for NOA. In section 5 we present a
comparison between the sensitivities expected for the different proposals, illustrating how the
sensitivities may evolve with date. We summarize in section 6, where we also provide a
table which allows to translate the traditional performance indicator for the mass
ordering (2 without statistical fluctuations) into well defined frequentist sensitivity measures
under the Gaussian approximation. We also comment briefly on how our results compare
to those in refs. [75, 76].
Terminology and statistical methods
Frequentist hypothesis testing
Let us start by reviewing the principles of frequentist hypothesis testing, see e.g., ref. .
First we consider the case of so-called simple hypotheses, where the hypothesis we want
to test, H, as well as the alternative hypothesis H0 do not depend on any free parameters.
H is conventionally called null hypothesis. In order to test whether data can reject the
null hypothesis H we have to choose a test statistic T . A test statistic is a stochastic
variable depending on the data which is chosen in such a way that the more extreme the
outcome is considered to be, the larger (or smaller) the value of the test statistic is. Once
the distribution of T is known under the assumption of H being true, we decide to reject
H at confidence level (CL) 1 if the observation is within the most extreme results,
i.e., if T > Tc, where the critical value Tc is defined by
with p(T |H) being the probability distribution function of T given that H is true. The
probability is the probability of making an error of the first kind (or type-I error
rate), i.e., rejecting H although it is true. It is custom to convert this probability into a
number of Gaussian standard deviations. In this work we will adopt the convention to use
a double-sided Gaussian test for this conversion, such that a hypothesis is rejected if the
data is more than n away (on either side) from the mean. This leads to the following
dx ex2/2 = erfc
This definition implies that we identify, for instance, 1, 2, 3 with a CL (1) of 68.27%,
95.45%, 99.73%, respectively, which is a common convention in neutrino physics. However,
note that n is sometimes defined differently, as a one-sided Gaussian limit, see e.g., eq. (1)
of ref. . This leads to a different conversion between n and , namely
which would lead to a CL of 84.14%, 97.73%, 99.87% for 1, 2, 3.
In order to quantify how powerful a given test is for rejecting H at a given CL we have
to compute the so-called power of the test or, equivalently, the probability of making an
error of the second kind (or type-II error rate). This is the probability to accept H if
it is not true:
where now p(T |H0) is the probability distribution function of T assuming that the
alternative hypothesis H0 is true. Obviously, depends on the CL (1 ) at which we want to
reject H. A small value of means that the rate for an error of the second kind is small,
i.e., the power of the test (which is defined as 1 ) is large.
The case we are interested in here (neutrino mass ordering) is slightly different, since
both hypotheses (normal and inverted) may depend on additional parameters , a situation
which is called composite hypothesis testing. This is for instance the case of long baseline
oscillation experiments, where the value of has a large impact on the sensitivities to the
neutrino mass ordering. In this case the same approach is valid while keeping a few things
we must chose
The rate of an error of the second kind will now depend on the true parameters in
the alternative hypothesis:
1Note that we are using the complementary error function erfc(x) 1 erf(x).
2Here we assume that for given data the value of the observed test statistic T is independent of the true
parameter values. This is the case for the statistic T introduced in eq. (2.10), but it will not be true for
instance for the statistic T 0 mentioned in footnote 3.
with Tc defined in eq. (2.6). It is important to note that in a frequentist framework
this cannot be averaged in any way to give some sort of mean rejection power, as
this would require an assumption about the distribution of the parameters
implemented in Nature (which is only possible in a Bayesian analysis ). Sticking to
frequentist reasoning, we can either give as a function of the parameters in the
alternative hypothesis, or quote the highest and/or lowest possible values of within
the alternative hypothesis.
Application to the neutrino mass ordering
In the search for the neutrino mass ordering, we are faced with two different mutually
exclusive hypotheses, namely HNO for normal ordering and HIO for inverted ordering. Both
hypotheses will depend on the values of the oscillation parameters (which we collectively
denote by ) within the corresponding ordering. In particular, appearance experiments
depend crucially on the CP-violating phase . Hence, we have to deal with the situation of
composite hypothesis testing as described above. Let us now select a specific test statistic
for addressing this problem.
A common test statistic is the 2 with n degrees of freedom, which describes the
deviation from the expected values of the outcome of a series of measurements xi of the
normal distributions N (i, i):
The further the observations are from the expected values, i.e., the more extreme the
outcome, the larger is the 2. If the mean values i depend on a set of p parameters
whose values have to be estimated from the data one usually considers the minimum of the
2 with respect to the parameters:
According to Wilks theorem  this quantity will follow a 2 distribution with n p
degrees of freedom, whereas 2() = 2() 2min will have a 2 distribution with p
degrees of freedom. The 2 distributions have known properties, and in physics we often
encounter situations where data can be well described by this method and the conditions
for Wilks theorem to hold are sufficiently fulfilled, even when individual data points are
not strictly normal distributed. In general, however, it is not guaranteed and the actual
distribution of those test statistics has to be verified by Monte Carlo simulations .
Coming now to the problem of identifying the neutrino mass ordering, one needs to
select a test statistic which is well suited to distinguish between the two hypotheses HNO
and HIO. Here we will focus on the following test statistic, which is based on a log-likelihood
ratio and has been used in the past in the literature:
JUNO, 4320 kt GW yr, 3% E-resol.
Figure 1. Left: distribution of the test statistic T for our default configuration of the JUNO reactor
experiment discussed in section 4.1. Histograms show the results of the MC simulation based on
105 simulated experiments and black curves correspond to the Gaussian approximation discussed in
section 3. Right: the value of as a function of the critical value Tc required for rejecting inverted
(blue) and normal (red) ordering for the JUNO reactor experiment. In the purple region both mass
orderings are rejected at the CL (1 ), in the white region both orderings are consistent with data
at the CL (1 ). The dashed lines in both panels indicate Tc for = 0.01 for both orderings.
The dotted lines indicate the crossing point TcNO = TcIO. The dot-dashed line in the right panel
shows an example (for = 0.1) in which Tc,IO < Tc,NO.
where is the set of neutrino oscillation parameters which are confined to a given mass
ordering during the minimization. Let us stress that the choice of T is not unique. In
principle one is free to chose any test statistic, although some will provide more powerful
tests than others.3
It is important to note that within a frequentist approach, rejecting one hypothesis at
a given does not automatically imply that the other hypothesis could not also be rejected
using the same data. Instead, the only statement we can make is to either reject an ordering
or not. The value of T = 0 therefore does not a priori play a crucial role in the analysis. Let
us illustrate this point at an example. In the left panel of figure 1, we show the distributions
of the test statistics T for both mass orderings obtained from the simulation of a particular
configuration of the JUNO reactor experiment. Experimental details will be discussed later
in section 4.1. In the right panel we show the corresponding critical values Tc for testing
both orderings and how they depend on the chosen confidence level 1 . The curves for
3In the case of simple hypotheses the Neyman Pearson lemma  implies that the test based on the
likelihood ratio is most powerful. For composite hypotheses in general no unique most powerful test is
known. An alternative choice for a test statistic could be for instance the statistic T 0() = 2()
min, where min is the absolute minimum including minimization over the two mass orderings, and
generically denotes the (continuous) oscillation parameters. This statistic is based on parameter estimation
and amounts to testing whether a parameter range for remains at a given CL in a given mass ordering.
We have checked by explicit Monte Carlo simulations that typically the distribution of T 0 is close to a 2
distribution with number of d.o.f. corresponding to the non-minimized parameters in the first term (the
approximation is excellent for JUNO but somewhat worse for LBL experiments). Sensitivity results for the
mass ordering based on T 0 will be reported elsewhere.
testing the different orderings cross around = 5.2%, indicated by the dotted lines. This
represents the unique confidence level for which the experiment in question will rule out
exactly one of the orderings, regardless of the experimental outcome. If, for instance, we
would choose to test whether either ordering can be rejected at a confidence level of 90%,
then there is a possibility of an experimental outcome T with Tc0,.I1O < T < Tc0,.N1O, implying
that both orderings could be rejected at the 90% CL. This situation is indicated by the
dash-dotted line in the right panel of figure 1 and applies to the purple region. Thus, in
order to claim a discovery of the mass ordering, it will not be sufficient to test one of the
orderings. If both orderings were rejected at high confidence, it would mean either having
obtained a very unlikely statistical fluctuation, underestimating the experimental errors,
or neither ordering being a good description due to some new physics. Conversely, if we
would choose = 0.01 < 0.052 (dashed line in both panels, white region in right panel),
then there is the possibility of obtaining Tc0,.N01O < T < Tc0,.I0O1, meaning that neither ordering
can be excluded at the 99% CL.
The CL corresponding to the crossing condition Tc,NO = Tc,IO provides a possible
sensitivity measure of a given experiment. We will refer to it as crossing sensitivity
below.4 If Tc,NO Tc,IO (as it is the case for the example shown in figure 1), this is
equivalent to testing the sign of T . This test has been discussed also in ref. [75, 76]. From
the definition of the sensitivity of an average experiment which we are going to give in
the next subsection it will be clear that the crossing sensitivity is rather different from
the median sensitivity, which is typically what is intended by sensitivity in the existing
literature. It should also be noted that the critical values for the different orderings, as well
as the crossing of the critical values, in general are not symmetric with respect to T = 0.
The fact that figure 1 appears to be close to symmetric is a feature of the particular
experiment as well of the test statistic T . This would not be the case for instance for the
statistic T 0 mentioned in footnote 3. Finally, note that figure 1 is only concerned with
the critical value of T and its dependence on . As such, it does not tell us anything
about the probability of actually rejecting, for instance, inverted ordering if the normal
ordering would be the true one (power of the test). As discussed above, this probability
will typically also depend on the actual parameters within the alternative ordering and can
therefore not be given a particular value. However, for the crossing point of the critical
values, the rejection power for the other ordering is at least 1 .
Median sensitivity or the sensitivity of an average experiment
Let us elaborate on how to compare such a statistical analysis to previous sensitivity
estimates massively employed in the literature, in particular in the context of long-baseline
oscillation experiments. The most common performance indicator used for the mass
order4In the case of composite hypotheses, where the distribution of T depends on the true values of some
parameters (e.g., the CP phase in the case of long-baseline experiments), we define Tc,NO and Tc,IO in
analogy to eq. (2.6), i.e., we chose the largest or smallest value of Tc(), depending on the mass ordering.
Hence, the crossing sensitivity is independent of the true values of the parameters.
ing determination is given by
T0NO(0) = min X [iNO(0) iIO()]2
IO i i2
for testing normal ordering, with an analogous definition for inverted ordering. This
quantity corresponds to the test statistic T defined in eq. (2.10) but the data xi are replaced
by the predicted observables i(0) at true parameter values 0. Since no statistical
fluctuations are included in this definition it is implicitly assumed that it is representative for
an average experiment. (This is sometimes referred to as the Asimov data set , and
T0 is sometimes denoted as 2 .) T0 is then evaluated assuming a 2 distribution
with 1 dof in order to quote a CL with which a given mass ordering can be identified.
In the following, we will refer to this as the standard method or standard sensitivity.
Note that T0 by itself is not a statistic, since it does not depend on any random data. The
interpretation of assigning a 2 distribution to it is not well defined, and is motivated by
the intuition based on nested hypothesis testing (which is not applicable for the mass
ordering question). In the following we show that actually the relevant limiting distribution
for T (but not for T0) is Gaussian, not 2.
The formalism described in section 2 allows a more precise definition of an average
experiment. One possibility is to calculate the CL (1 ) at which a false hypothesis can
be rejected with a probability of 50%, i.e., with a rate for an error of the second kind of
= 0.5. In other words, the CL (1 ) for = 0.5 is the CL at which an experiment
will reject the wrong mass ordering with a probability of 50%. We will call the probability
( = 0.5) the sensitivity of an average experiment or median sensitivity. This is the
definition we are going to use in the following for comparing our simulations of the various
experiments to the corresponding sensitivities derived from the standard method.
Let us note that the median sensitivity defined in this way is not the only relevant
quantity in order to design an experiment, since in practice one would like to be more certain
than 50% for being able to reject a wrong hypothesis. Under the Gaussian approximation
to be discussed in the next section it is easy to calculate the sensitivity for any desired
, once the median sensitivity is known.
The Gaussian case for the test statistic T
A crucial point in evaluating a statistical test is to know the distribution of the test statistic.
In general this has to be estimated by explicit Monte Carlo simulations, an exercise which
we are going to report on for a number of experiments later in this paper. However, under
certain conditions the distribution of the statistic T defined in eq. (2.10) can be derived
analytically and corresponds to a normal distribution :
T = N (T0, 2pT0) ,
where N (, ) denotes the normal distribution with mean and standard deviation and
the + () sign holds for true NO (IO).5 In general T0NO and T IO may depend on model
5Note that T0NO and T0IO are always defined to be positive according to eq. (2.11), while T can take both
signs, see eq. (2.10).
parameters . In that case the distribution of T will depend on the true parameter values
and we have to consider the rules for composite hypothesis testing as outlined in section 2.
We provide a derivation of eq. (3.1) in appendix A, where we also discuss the conditions
that need to be fulfilled for this to hold in some detail. In addition to assumptions similar
to the ones necessary for Wilks theorem to hold, eq. (3.1) applies if
we are dealing with simple hypotheses, or consider composite hypotheses at fixed
parameter values, or
if close to the respective 2 minima the two hypotheses depend on the parameters
in the same way (a precise definition is given via eq. (A.21) in the appendix), or
if T0 is large compared to the number of relevant parameters of the hypotheses.
Let us now study the properties of the hypothesis test for the mass ordering based on the
statistic T under the assumption that it indeed follows a normal distribution as in eq. (3.1).
First we consider simple hypotheses, i.e., T0 does not depend on any free parameters. As we
shall see below, this situation applies with good accuracy to the medium-baseline reactor
For definiteness we construct a test for HNO; the one for HIO is obtained analogously.
Since large values of the test statistic favor HNO over the alternative hypothesis HIO, we
swuocuhldthraetjePct(THN<OTfcor) =too simfaHllNOvaliusecsororfecTt.. SHienncceeH,wNOe npereeddictots fiTnd=aNc(rTit0iNcOal, 2vqaluTe0NOTc),
The critical values Tc as a function of T0 are shown for several values of in the upper
left panel of figure 2. The labels in the left panel of the figure in units of are based on
our default convention based on the 2-sided Gaussian, eq. (2.2).
Let us now compute the power p of the test, i.e., the probability p with which we can
reject HNO at the CL (1 ) if the alternative hypothesis HIO is true. As mentioned above,
p is related to the rate for an error of the second kind, , since p = 1 . This probability
is given by = P (T > Tc) for true IO, where Tc is given in eq. (3.2). If HIO is true we
have T = N (T0IO, 2qT0IO) and hence
where the last approximation assumes T0 T0NO T0IO, a situation we are going to
encounter for instance in the case of JUNO below. We shown p = 1 as a function of
T0 for several values of in the lower left panel of figure 2.
Figure 2. Gaussian approximation for the test statistics T . Left upper panel: critical values for
rejecting normal ordering as a function of T0, see eq. (3.2), for different values of as labeled in the
plot. Left lower panel: power of the test as a function of T0 for different values of , see eq. (3.3).
Right panel: power of the test (left vertical axis) and the rate for an error of the second kind (right
vertical axis) versus the CL (1 ) for rejecting a given mass ordering for different values of T0 as
labeled in the plot. The vertical lines indicate the number of standard deviations, where we have
used our standard convention eq. (2.2) based on a 2-sided Gaussian for the solid lines and eq. (2.3)
based on a 1-sided Gaussian limit for the dashed lines. The dash-dotted red curve indicates = ,
which follows in the Gaussian case from the condition TcNO = TcIO.
Using our standard convention eq. (2.2) to convert into standard deviations the median
sensitivity is n, with
n = 2 erfc1 1 erfc
We show n(T0) in figure 3. This curve corresponds to a section of the lower left panel
(or right panel) of figure 2 at p = 0.5. The green and yellow shaded bands indicate
the CL at which we expect being able to reject NO if IO is true with a probability of
68.27% and 95.45%, respectively. The edges of the bands are obtained from the conditions
= 1/2 0.6827/2 and = 1/2 0.9545/2, respectively. They indicate the range of
obtained rejection confidence levels which emerge from values of T within 1 and 2 from
its mean assuming true IO.
Note that if we had used the 1-sided Gaussian rule from eq. (2.3) to convert the
probability eq. (3.4) we would have obtained n = T0 for the median sensitivity. Indeed,
this corresponds exactly to the standard sensitivity as defined in section 2.3.6 We show
this case for illustration as dashed curve in figure 3. The dashed vertical lines in the right
panel of figure 2 show explicitly that using this convention we obtain = 0.5 at n exactly
for T0 = n2. Note that with our default convention we actually obtain an increase in the
sensitivity compared to T0 used in the standard method. The exponential nature of
erfc implies that the difference will not be large, in particular for large T0, see figure 3. For
instance, the values of T0 corresponding to a median sensitivity of 2, 3, 4 according to
eq. (3.5) are 2.86, 7.74, 14.7, respectively, which should be compared to the standard case
of T0 = n2, i.e., 4, 9, 16. In summary, we obtain the first important result of this paper:
6We would have obtained the result n = T0 also when using a 2-sided test to calculate from the
distribution of T combined with the 2-sided convention to convert it into standard deviations. Note,
however, that for the purpose of rejecting a hypothesis clearly a 1-sided test for T should be used, and
therefore we do not consider this possibility further.
The corresponding sensitivity is shown as red solid curve in figure 3. For this curve we
use our default convention to convert into according to eq. (2.2). If we instead had
used the 1-sided Gaussian convention from eq. (2.3) to convert the probability eq. (3.6)
we would have obtained the simple rule n = T0/2 (dashed red curve). This can be seen
also in the right panel of figure 2, where the red dash-dotted curve indicates the condition
= . For a given T0 the probability for TcNO = T IO can be read off from the section
of the corresponding blue curve with the red curve. By considering the dashed vertical
lines we observe the rule n = T0/2 from the 1-sided conversion of into n. For our
default conversion it turns out that the sensitivity from the condition TcNO = TcIO is always
more than half of the median sensitivity in units of . From the 68.27% and 95.45% bands
in figure 3 one can see that for a typical experimental outcome the sensitivity will be
significantly better than the one given by the crossing condition.
Let us now generalize the discussion to the case where T0 depends on parameters. This
will be typically the situation for long-baseline experiments, where event rates depend
significantly on the (unknown) value of the CP phase . It is straight forward to apply the
rules discussed in section 2 assuming that T = N (T0NO(), 2qT0NO()) for normal ordering
and T = N (T0IO(), 2qT0IO()) for inverted ordering.
First we must ensure that we can reject NO for all possible values of at (1 )
confidence. Hence, eq. (3.2) becomes,
i.e., we have to choose the smallest possible Tc. Considering Tc from eq. (3.2) as a
function of T0, we see that Tc has a minimum at T0 = 2[erfc1(2)]2, and the value at the
minimum is 2[erfc1(2)]2. This minimum is also visible in figure 2 (upper left panel).
Hence, we have
where TNO is the minimum of T0NO() with respect to the parameters .
The expression for the rate for an error of the second kind, eq. (3.3) will now depend
on the true values of in the alternative hypothesis:
() = 21 erfc . (3.9)
The median sensitivity is obtained by setting () = 0.5. This leads to the equation
T0IO() = (Tc)min which has to be solved for . Note that this is a recursive definition,
since which case in eq. (3.8) to be used can only be decided after is computed. However, it
turns out that in situations of interest the first case applies. In this case we have T0IO() =
2[erfc1(2)]2. Typically it also holds that TIO TNO and therefore TNO < T IO() and
0 0 0 0
TNO < 2[erfc1(2)]2 for corresponding to the median sensitivity. Hence, we obtain the
8 qTT00NNOO ++ Tq0ITO0IO 21 erfc 21 s T20
(TcNO = TcIO) ,
where the last relation holds for T0 T0NO T0IO, which again is very similar to the case
for simple hypotheses, eq. (3.6).
Monte Carlo simulations of experimental setups
Let us now apply the methods presented above to realistic experimental configurations.
We have performed Monte Carlo (MC) studies to determine the sensitivity to the neutrino
mass ordering for three different types of experiments, each of which obtains their
sensitivity through the observation of different phenomena: (a) JUNO : interference (in the
vacuum regime) between the solar and atmospheric oscillation amplitudes at a medium
baseline reactor neutrino oscillation experiment; (b) PINGU  and INO : matter
effects in atmospheric neutrino oscillations; (c) NOA  and LBNE : matter effects
in a long baseline neutrino beam experiment. In each case we have followed closely the
information given in the respective proposals or design reports, and we adopted bench mark
setups which under same assumptions reproduce standard sensitivities in the literature
reasonably well. The specific details that have been used to simulate each experiment are
summarized in appendix B.
7.3 104 (3.4) 4.3 104 (3.5) 1.0 102 (2.5) 7.5 103 (2.7)
Medium-baseline reactor experiment: JUNO
For the simulations in this paper we adopt an experimental configuration for the JUNO
reactor experiment based on refs. [61, 62, 83], following the analysis described in ref. . A
20 kt liquid scintillator detector is considered at a distance of approximately 52 km from 10
reactors with a total power of 36 GW, with an exposure of 6 years, i.e., 4320 kt GW yr. The
energy resolution is assumed to be 3%p1 MeV/E. For further details see appendix B.1.
The unique feature of this setup is that the sensitivity to the mass ordering is rather
insensitive to the true values of the oscillation parameters within their uncertainties. Being
a e disappearance experiment, the survival probability depends neither on 23 nor on the
CP phase , and all the other oscillation parameters are known (or will be known at the
time of the data analysis of the experiment) with sufficient precision such that the mass
ordering sensitivity is barely affected. Therefore we are effectively very close to the situation
of simple hypotheses for this setup. Note that although the mass ordering sensitivity is
insensitive to the true values, the 2 minimization with respect to oscillation parameters,
especially |m231|, is crucial when calculating the value of the test statistic T .
In the left panel of figure 1 we show the distribution of the test statistic T from a Monte
Carlo simulation of 105 data sets for our default JUNO configuration. For each true mass
ordering we compare those results to the normal distributions expected under the Gaussian
approximation, namely N (T0NO, 2qT0NO) for normal ordering and N (T0IO, 2qT0IO) for
inverted ordering, where T0NO and T0IO are the values of the test statistic without statistical
fluctuation (Asimov data set). For the considered setup we find T0NO = 10.1 and T0IO =
11.1, and we observe excellent agreement of the Gaussian approximation with the Monte
Carlo simulation, see also, e.g., ref. .
Hence we can apply the formalism developed in section 3 directly to evaluate the
sensitivity of the experiment in terms of T0NO and T0IO. For instance, eq. (3.4) gives for
the median sensitivity = 7.3 (4.3) 104 for testing normal (inverted) ordering, which
corresponds to 3.4 (3.5). As discussed in section 3 those numbers are rather close to
the standard sensitivity based on n = T0, which would give 3.2 (3.3). For the given
values of T0NO and T0IO we can now use figure 2 to obtain the probability to reject an
ordering if it is false (i.e., the power of the test) for any desired confidence level (1 ).
T0NO (med. sens.) T0IO (med. sens.)
Table 2. Main characteristics of our default setups for INO and PINGU. We give energy resolutions
for neutrino energy and direction reconstruction and default exposure. For PINGU we assume an
energy dependent effective detector mass. The last two columns give the value of T0 and the median
sensitivity using eq. (3.5) for the two orderings, assuming 23 = 45.
The confidence level at which exactly one mass ordering can be rejected (crossing point
TcNO = TcIO) is obtained from eq. (3.6) as = 5.2% or 1.9, see also figure 1. Those
numbers are summarized in table 1. There we give also the corresponding results for the
same setup but with a slightly worse energy resolution of 3.5%p1 MeV/E, in which case
significantly reduced sensitivities are obtained, highlighting once more the importance to
achieve excellent energy reconstruction abilities. We have checked that also in this case
the distribution of T is very close to the Gaussian approximation.
Atmospheric neutrinos: PINGU and INO
We now move to atmospheric neutrino experiments, which try to determine the mass
ordering by looking for the imprint of the matter resonance in the angular and energy
distribution of neutrino induced muons. The resonance will occur for neutrinos (antineutrinos)
in the case of normal (inverted) ordering. The INO experiment  uses a magnetized
iron calorimeter which is able to separate neutrino and antineutrino induced events with
high efficiency, which provides sensitivity to the mass ordering with an exposure of around
500 kt yr (10 year operation of a 50 kt detector). Alternatively, the PINGU  experiment,
being a low-energy extension of the IceCube detector, is not able to separate neutrino and
antineutrino induced muons on an event-by-event basis. This leads to a dilution of the
effect of changing the mass ordering, which has to be compensated by exposures exceeding
10 Mt yr, which can be achieved for a few years of running time. In both cases the ability to
reconstruct neutrino energy and direction will be crucial to determining the mass ordering.
Our simulations for the INO and PINGU experiments are based on refs.  and ,
respectively. We summarize the main characteristics of our default setups in table 2, further
technical details and references are given in appendix B.2. Let us stress that the sensitivity
of this type of experiments crucially depends on experimental parameters such a systematic
uncertainties, efficiencies, particle identification, and especially the ability to reconstruct
neutrino energy and direction. Those parameters are still not settled, in particular for
the PINGU experiment, and final sensitivities may vary by few sigmas, see for instance
refs. [38, 48]. Our setups should serve as representative examples in order to study the
statistical properties of the resulting sensitivities. While the final numerical answer will
depend strongly on to be defined experimental parameters, we do not expect that the
statistical behavior will be affected significantly.
In figures 4 and 5 we show the distributions of the test statistic T for the INO and
PINGU experiments, respectively, obtained from a sample of 104 simulated data sets for
Figure 4. Simulated distributions of the test statistic T in the INO experiment. We use our default
setup as defined in table 2 and assume 23 = 45. Solid curves show the Gaussian approximation
from eq. (3.1).
-30 -20 -10 T0 10 20 30 -30 -20 -10 T0 10 20 30 -30 -20 -10 T0 10 20 30
Figure 5. Simulated distributions of the test statistic T in the PINGU experiment with 23 =
40, 45, 50 for the left, middle, right panel, respectively. We use our default setup as defined in
table 2. Solid curves show the Gaussian approximation from eq. (3.1).
each mass ordering, using the default setups from table 2. We observe good agreement
with the Gaussian approximation (see also ref.  for a simulation in the context of
PINGU). Those results justify the use of the simple expressions from section 3 also for
INO and PINGU in order to calculate median sensitivities or rates for errors of the first
and second kind.
In figure 5 we illustrate the dependence of the distributions for PINGU on the true
value of 23. From this figure it is clear that the true value of 23 plays an important
role for the sensitivity to the mass ordering, with better sensitivity for large values of 23
(a similar dependence is holds also for INO, see, e.g., refs. [54, 58]). The dependence on
other parameters is rather weak (taking into account that, at the time of the experiment,
13 will be known even better than today). Let us discuss the 23 dependence in more
detail for the case of PINGU, where from now on we use the Gaussian approximation. The
problem arises when calculating the critical value for the test statistic T in order to reject
the null-hypothesis at a given CL. If we follow our rule for composite hypothesis, eq. (2.6),
and minimize (for NO) or maximize (for IO) Tc(23) over 23 in the range 35 to 55 we
obtain the black dashed curves in figure 6. This is equivalent to using eq. (3.10). The
chosen range for 23 corresponds roughly to the 3 range obtained from current data .
Figure 6. Median sensitivity for PINGU after 3 years data taking as a function of the true
value of 23. Left (right) panel shows a test for NO (IO), which means that the true ordering
is inverted (normal). For the thick black dashed curve we consider the range 35 < 23 < 55
for the true value of 23 when calculating the critical value for the test statistic (Tc), and the
thin dashed curves indicate the corresponding 68.27% and 95.45% probability ranges of obtained
rejection significances. For the blue solid curve and the corresponding green (68.27%) and yellow
(95.45%) probability bands we assume that 23 is known up to its octant when calculating Tc.
The dotted curves show the 68.27% and 95.45% probability ranges assuming that 23 including its
octant is known (simple hypothesis test).
However, this may be too conservative, since at the time of the experiment T2K and NOA
will provide a very accurate determination of sin2 223. Hence, 23 will be known with good
precision up to its octant, see for instance figure 5 of ref. . If we minimize (maximize)
Tc(23) only over the two discrete values 2tr3ue and 90 2tr3ue we obtain the blue solid
curves in figure 6. The green and yellow bands indicate the corresponding 68.27% and
95.45% probability ranges of expected rejection significances. The dotted curves show the
corresponding information but using only the true value of 23 when calculating Tc. This
last case corresponds to the ideal situation of perfectly knowing 23 (including its octant),
in which case NO and IO become simple hypotheses. The median sensitivity for known
23 is not shown in the figure for clarity, but it is very similar to the blue solid curves.
We obtain the pleasant result that all three methods give very similar values for the
median sensitivity, ranging from 2 at 23 ' 35 up to 5 (6) rejection of NO (IO) at
23 ' 55. Only for the NO test and 23 ' 50 we find that taking the octant degeneracy
into account leads to a larger spread of the 68.27% and 95.45% probability ranges for the
sensitivity, implying a higher risk of obtaining a rather weak rejection. Actually, this region
of parameter space (true IO and 23 > 45) is the only one where the octant degeneracy
severely affects the sensitivity to the mass ordering . Let us emphasize that the octant
degeneracy is always fully taken into account when minimizing the 2. Here we are instead
concerned with the dependence of the critical value Tc on 23.
Long-baseline neutrino beam experiments try to identify the neutrino mass ordering by
exploring the matter effect in the e appearance channel. Whether the resonance
Table 3. Main characteristics of the long baseline setups considered in this work. In both cases
the beam power is 700 kW. The NOA detector is a Totally Active Scintillator Detector (TASD),
while for LBNE a Liquid Argon (LAr) detector is considered.
occurs for neutrinos or for antineutrinos will determine the mass ordering. A crucial
feature in this case is that the appearance probability, and therefore also the event rates,
depend significantly on the unknown value of the CP phase . Most likely will remain
unknown even at the time the mass ordering measurement will be performed, and
therefore taking the dependence into account is essential. In the nomenclature of sections 2
and 3 we are dealing with composite hypothesis testing. In this work we consider three
representative experimental configurations to study the statistical properties of the mass
ordering sensitivity, namely NOA , LBNE-10 kt, and LBNE-34 kt , which provide
increasing sensitivity to the mass ordering. Tab 3 summarizes their main features, while
further details are given in appendix B.3.
Figures 7 and 8 show the probability distributions for the test statistic T defined in
eq. (2.10), for the NOA and LBNE-10 kt setups, respectively. The distributions are shown
for both mass orderings, and for different values of , as indicated in each panel. Our results
are based on a sample of 6 105 simulations for NOA and 4 105 for LBNE-10 kt per
value of , and we scan in steps of 10. As can be seen from the figures, both the
shape and mean of the distributions present large variations with the value of . From the
comparison between the two figures it is clear that the NOA experiment will achieve very
limited sensitivity to the mass ordering. On the other hand, for the LBNE-10 kt setup the
situation is much better: the overlapping region is reduced, and is only sizable for certain
combinations of values of in the two mass orderings.
We also note that for NOA there are clear deviations from the Gaussian shape for
the T distributions, while for the LBNE-10 kt experiment they are close to the Gaussian
approximation discussed in section 3, namely T = N (T0(), 2pT0()). For comparison,
in figure 8 the Gaussian approximation is overlaid on the histograms from the Monte Carlo.
Those results are in agreement with the considerations of appendix A. As discussed there,
one expects that the median of the T distribution should remain around T0, even if
corrections to the shape of the distribution are significant. We have checked that this does
indeed hold for NOA. Furthermore, assuming that there is only one relevant parameter
( in this case), eq. (A.24) implies that deviations from Gaussianity can be expected if
T0 1, which is the case for NOA, whereas for T0 1 (such as for LBNE) one expects
close to Gaussian distributions for T .
One can also notice in figures 7 and 8 that the shape of the distributions for a given
value of in one ordering is rather similar to the mirrored image of the distribution
corresponding to the other mass ordering and . The reason for this is the well-known fact
that the standard mass ordering sensitivity is symmetric between changing the true
orderFigure 7. The simulated distributions of the test statistic T in the NOA experiment for different
true values of , as indicated by the labels. The red (blue) distributions assume a true normal
Figure 9. The critical value Tc corresponding to 95% confidence level as a function of the
CPviolating phase for NOA (left panel) and LBNE-34 kt (right panel). The solid (dashed) lines
correspond to testing the normal (inverted) ordering. The red (blue) region corresponds to values
of T which would reject all parameter values in the normal (inverted) ordering and thereby reject
normal (inverted) ordering at 95% confidence level. In the white region, there are parameter values
in both orderings which are allowed, while in the purple region none of the two orderings would be
compatible with data at 95% CL.
ing and , i.e., T0NO() T0IO(), see e.g., figures 8 and 9 of ref.  and figure 4-13
of ref. .7 Furthermore, using the formalism in appendix A, in particular eq. (A.24),
one can show that also the deviations from the Gaussian distribution will obey the same
symmetry. Below we will show that despite the deviations from Gaussianity for NOA, the
final sensitivities obtained from the Monte Carlo will be surprisingly close to the Gaussian
expectation. As expected, this will be even more true for LBNE-10 kt.
Due to the strong dependence on the CP phase we need to choose the critical value
Tc such that the null hypothesis can be rejected at (1 ) CL for all possible values of ,
see discussion in sections 2 and 3.2. This is illustrated in figure 9, which is analogous to
figure 1 (right panel) for a fixed CL. The continuous (dashed) black curves in figure 9 show
the values of Tc that lead to the probability of 5% to find a smaller (larger) value of T under
the hypothesis of a true normal (inverted) ordering as a function of the true value of . The
left panel shows the result for NOA, while the right panel corresponds to LBNE-34 kt.
The number of data sets simulated for LBNE-34 kt in this case is 105 per value of , which
is again scanned in steps of 10. As discussed in section 2, a composite null hypothesis can
only be rejected if we can reject all parameter sets H. In our case, this would imply
rejecting the hypothesis for all values of . Therefore, in order to guarantee a CL equal
to (1 ), the most conservative value of Tc will have to be chosen. This automatically
defines two values Tc(NO) and Tc(IO), which are the values which guarantee that a given
hypothesis can be rejected at the 95% CL. These values will generally be different, and are
indicated in the figures by the arrows. In figure 9 we encounter the two situations already
7This can be understood by considering the expressions for the oscillation probabilities, taking into
account the fact that, if matter effects are sufficiently strong, the 2 minimum in the wrong ordering tends
to take place close to = /2.
Figure 10. Probability of accepting normal ordering if inverted ordering is true (i.e., rate for an
error of the second kind) as a function of the true in IO for the NOA (left panel) and
LBNE10 kt (right panel) experiments. The different curves correspond to tests at 1, 2, 3 confidence
level, as labeled in the plot. Furthermore the corresponding critical values Tc are given. The
horizontal dotted lines indicate the median experiment, = 0.5.
discussed in section 2 (cf. figure 1):
Tc(IO) > Tc(NO): this is the case of NOA, left panel. There is an intermediate
region (shown in white) in which none of the hypotheses would be rejected at (1
) CL. The reason why this intermediate region appears is because the experiment
is not sensitive enough to the observable we want to measure, and a measurement at
the chosen CL may not be reached.
Tc(IO) < Tc(NO): this is the case of LBNE-34 kt, right panel. There is an overlap
region (shown in purple) in which both hierarchies would be rejected at (1 ) CL.
A statistical fluctuation may bring the result of the experiment into this region,
although this would typically not be expected.
The intermediate case Tc(IO) = Tc(NO) would correspond to the crossing point
discussed in section 2, figure 1, which defines the CL at which exactly one of the hypotheses
can be excluded.
Let us now evaluate the rate for an error of the second kind corresponding to a given
value of . After the value of Tc is determined for a given hypothesis and , we can
compute the rate for an error of the second kind, , as a function of the true value of , as
discussed in section 2. We show this probability in figure 10 for the NOA and the
LBNE10 kt experiments in the left- and right-hand panels, respectively. To be explicit, we show
the probability of accepting normal ordering at 1, 2, 3 CL, i.e., = 32%, 4.55%, 0.27%,
(regardless of the value of in the NO) although the true ordering is inverted. This
probability depends on the true value of in the IO, which is shown on the horizontal axis.
By doing a cut at = 0.5 on the left-hand panel (indicated by the dotted line), we can get
an idea on the median sensitivity that will be obtained for NOA: for = 90 it will be
around 1, while for = 90 it will reach almost the 3 level. This seems to be roughly
consistent with the expected standard sensitivities usually reported in the literature, see for
instance ref. . Similarly, for LBNE-10 kt, we expect that the sensitivity for the median
experiment will be around 3 for = 90, while for other values of we expect it to be
much larger. This is also in agreement with the results from ref. , for instance.
Let us now investigate in detail how our median sensitivity compares to the standard
sensitivities widely used in the literature. In figure 11 the solid thick curves show the
results for the median sensitivity derived from full MC simulations. The shaded green and
yellow bands are analogous to those shown in figure 3, and show the range in the number
of sigmas with which we expect to be able to reject NO if IO is true in 68.27% and 95.45%
of the experiments, respectively. We also show how these results compare to the Gaussian
approximation discussed in section 3. The value of the 2 is computed without taking
statistical fluctuations into account (what is called T0 in section 2). We then use eq. (3.10)
to compute the confidence level (1 ) at which the normal ordering can be rejected with a
probability of 50% if the inverted ordering is true, as a function of the true value of in the
IO. Then, for the dot-dashed curves we use a 2-sided Gaussian to convert into number
of , i.e., eq. (2.2), the same prescription is also used for the MC result. We observe good
agreement, in particular for LBNE. This indicates that, for the high-statistics data from
LBNE, we are very close to the Gaussian limit, whereas from the smaller data sample (and
smaller values of T0) in NOA deviations are visible, but not dramatic. We also show the
results using a 1-sided Gaussian, eq. (2.3), to convert into number of sigmas, which leads
to n = T0, i.e., the standard sensitivity. This is shown by the dashed lines. As discussed
in section 2 we observe that the standard sensitivity slightly under-estimates the true
sensitivity.8 Finally, the dotted horizontal line in figure 11 corresponds to the significance
8Note that traditionally the standard sensitivity for IO denotes the case when IO is true and refers
Figure 12. The left (right) panel shows the median sensitivity in number of sigmas for rejecting the
IO (NO) if the NO (IO) is true for different facilities as a function of the date. The width of the bands
correspond to different true values of the CP phase for NOA and LBNE, different true values
of 23 between 40 and 50 for INO and PINGU, and energy resolution between 3%p1 MeV/E
and 3.5%p1 MeV/E for JUNO. For the long baseline experiments, the bands with solid (dashed)
contours correspond to a true value for 23 of 40 (50). In all cases, octant degeneracies are fully
of the crossing point TcNO = TcIO defined in section 2, i.e., the confidence level at which
exactly one hypothesis can be excluded regardless of the outcome of the experiment. The
results are independent of the value of , and guarantee that the rate for an error of the
second kind is at most equal to , unlike for the median experiment where = 0.5. The
results for the crossing point are also consistent with the Gaussian expectation eq. (3.11).
Comparison between facilities: future prospects
In this section we give a quantitative comparison between the different experiments that
have been considered in this paper. We do a careful simulation of all the facilities using
the details available in the literature from the different collaborations, see appendix B
for details. We have checked that our standard sensitivities are in good agreement with
the respective proposals or design reports. Nevertheless, we do not explore in which way
the assumptions made in the literature towards efficiencies, energy resolution, angular
resolution, systematics, etc may affect the results, with the only exception of JUNO, as we
explain below. Since we are mainly interested in the statistical method for determining the
mass ordering, such analysis is beyond the scope of this paper. Our results will be shown
as a function of the date, taking the starting points from the official statements of each
collaboration. Obviously, such projections always are subject to large uncertainties.
to the sensitivity to reject NO. In the language of the present paper we call this a test for NO. This
is also consistent with the formula in the Gaussian approximation, eq. (3.10), which contains T0IO when
considering a test for NO. This has to be taken into account when comparing e.g., figure 11 (corresponding
to a test for NO) to similar curves in the literature.
Figure 12 shows the median sensitivities for the various experiments, i.e., the number
of sigmas with which an average experiment for each facility can rejected a given mass
ordering if it is false. In some sense this is similar to the standard sensitivity of T0
commonly applied in the literature. A different question is answered in figure 13, namely:
what is the probability that the wrong mass ordering can be rejected at a confidence level
of 3? The confidence level has been chosen arbitrarily to 3, based on the convention that
this would correspond to evidence that the wrong ordering is false. Below we discuss
those plots in some detail.
In order to keep the number of MC simulations down to a feasible level, we use the
Gaussian approximation whenever it is reasonably justified. As we have shown in section 4,
this is indeed the case for PINGU, INO, and JUNO. With respect to the LBL experiments,
even though we have seen that the agreement with the Gaussian case is actually quite
good (see figure 11), there are still some deviations, in particular in the case of NOA.
Consequently, in this case we have decided to use the results from the full MC simulation
whenever possible. The results for the NOA experiment are always obtained using MC
simulations, while in the case of LBNE-10 kt the results from a full MC are used whenever
the number of simulations does not have to exceed 4 105 (per value of ). As was
mentioned in the caption of figure 11, this means that, in order to reach sensitivities
above 4 (for the median experiment), results from the full MC cannot be used. In
these cases, we will compute our results using the Gaussian approximation instead. As
mentioned in appendix A, the approximation is expected to be quite accurate precisely for
large values of T0. Finally, for LBNE-34 kt, all the results have to be computed using the
Gaussian approximation, since the median sensitivity for this experiment reaches the 4
bound already for one year of exposure only, even for the most unfavorable values of .
For each experiment, we have determined the parameter that has the largest impact on
the results, and we draw a band according to it to show the range of sensitivities that should
be expected in each case. Therefore, we want to stress that the meaning of each band may
be different, depending on the particular experiment that is considered. In the case of long
baseline experiments (NOA, LBNE-10 kt and LBNE-34 kt), the results mainly depend on
the value of the CP-violating phase . In this case, we do a composite hypothesis test as
described in sections 2 and 3.2, and we draw the edges of the band using the values of true
in the true ordering that give the worst and the best results for each setup. Nevertheless,
since for these experiments the impact due to the true value of 23 is also relevant, we show
two results, corresponding to values of 23 in the first and second octant. In all cases, the
octant degeneracy is fully searched for (see appendix B.3 for details). In the case of PINGU
and INO, the most relevant parameter is 23. We find that, depending on the combination
of true ordering and 23 the results will be very different. Therefore, in this case we also do
a composite hypothesis test, using 23 as an extra parameter. Finally, the case of JUNO
is somewhat different. In this case, the uncertainties on the oscillation parameters do not
have a big impact on the results. Instead, the energy resolution is the parameter which
is expected to have the greatest impact, see for instance ref.  for a detailed discussion.
Therefore, in this case the width of the band shows the change on the results when the
energy resolution is changed between 3%p1 MeV/E and 3.5%p1 MeV/E. For JUNO we
do a simple hypothesis test, as described in section 3.1.
The starting dates assumed for each experiment are: 2017 for INO , 2019 for
PINGU  and JUNO  and 2022 for LBNE . Note that the official running times
for PINGU and JUNO are 5 and 6 years, respectively. For illustrative purposes we extend
the time in the plots to 10 years, in order to see how sensitivities would evolve under the
adopted assumptions about systematics. For the NOA experiment, we assume that the
nominal luminosity will be achieved by 2014  and we consider 6 years of data taking
from that moment on.
From the comparison of figures 12 and 13 one can see that, even though the median
sensitivity for INO would stay below the 3 CL, there may be a sizable probability (up
to 40%) that a statistical fluctuation will bring the result up to 3. For NOA, such
probability could even go up to a 60%, depending on the combination of 23, and the true
mass ordering. In the case of LBNE, the dependence on the true value of is remarkable,
in particular for the power of the test. We clearly observe the superior performance of the
34 kt configuration over the 10 kt one. For 34 kt a 3 result can be obtained at very high
probability for all values of , and for some values of a much higher rejection significance
of the wrong ordering is achieved with high probability.
For the atmospheric neutrino experiments INO and PINGU we show the effect of
changing the true value of 23 from 40 to 50. The effect is particularly large for PINGU
and a true NO. As visible in figure 6, for NO the sensitivity changes significantly between
40 and 50, whereas for IO they happen to be similar, as reflected by the width of the
bands in figures 12 and 13. The reason for this behavior is that for true IO and 23 > 45
the mass ordering sensitivity is reduced due to the octant degeneracy . In the context
of PINGU, let us stress that the precise experimental properties (in particular the ability
to reconstruct neutrino energy and direction) are still very much under investigation .
While we consider our adopted configuration (see section 4.2 and appendix B.2 for details)
as a representative bench mark scenario, the real sensitivity may be easily different by
few standard deviations, once the actual reconstruction abilities and other experimental
parameters are identified. To lesser extent this applies also to INO.
Let us also mention that in this work we only consider the sensitivity of individual
experiments, and did not combine different setups. It has been pointed out in a number
of studies that the sensitivity can be significantly boosted in this way [48, 49, 58, 59, 85].
We also expect that in this case, if the combined T0 is sufficiently large, the Gaussian
approximation should hold. However, we stress that a detailed investigation of this question
is certainly worth pursuing in future work.
Discussion and summary
The sensitivity of a statistical test is quantified by reporting two numbers:
1. the confidence level (1 ) at which we want to reject a given hypothesis, which
corresponds to a rate for an error of the first kind, ; and
2. the probability p with which a hypothesis can be rejected at CL (1 ) if it is false
(the power of the test), which is related to the rate for an error of the second kind,
= 1 p .
In this work we have applied this standard approach to the determination of the type
of the neutrino mass ordering. With the help of those concepts it is straight forward
to quantify the sensitivity of a given experimental configuration aiming to answer this
important question in neutrino physics. We consider a test statistic T (see eq. (2.10)) in
order to perform the test, which is based on the ratio of the likelihood maxima under the
two hypotheses normal and inverted ordering. Under certain conditions, see appendix A,
the statistic T is normal distributed (Gaussian approximation) . In the limit of no
statistical fluctuations (Asimov data set) the test statistic T becomes the usual 2 (up to
a sign) massively used in the literature for sensitivity calculations. In this work we denote
this quantity by T0 (in ref.  it has been denoted by 2). The sensitivity of an average
experiment (in the frequentist sense) can be defined as the confidence level (1 ) at which
a given hypothesis can be rejected with a probability = 50% (median sensitivity). An
important result of our work is the following:
Table 4. Sensitivity measures for the neutrino mass ordering in the Gaussian approximation
assuming T0NO = T0IO. The columns show T0, the standard sensitivity n = T0, the median sensitivity
(eqs. (3.4), (3.5)), the crossing sensitivity where exactly one hypothesis is rejected (equivalent to
testing the sign of T , eq. (3.6)), the probability of accepting a mass ordering at the 3 CL although
it is false (rate for an error of the second kind, eq. (3.3)), and the range of rejection confidence levels
obtained with a probability of 68.27% and 95.45%. We convert CL into standard deviations using
a 2-sided Gaussian.
the median sensitivity is close to the n = T0 rule. The crossing sensitivity corresponds
to the CL at which exactly one of the two hypotheses can be rejected. This is similar to
testing the sign of the test statistic T , a test which has been discussed in ref.  and also
mentioned in ref. . By construction, this test gives smaller confidence levels than the
median sensitivity and is not necessarily connected to what would be expected from an
experiment. We give in the table also the probability for accepting a hypothesis at the 3
level although it is false (rate for an error of the second kind). The last two columns in
the table give the range of obtained rejection significance with a probability of 68.27% and
95.45% (assuming that the experiment would be repeated many times). Those are a few
examples of how to apply the equations from section 3. These sensitivity measures provide
different information and all serve to quantify the sensitivity of an experiment within a
frequentist framework. They can be compared to similar sensitivity measures given in
ref.  in a Bayesian context (see, e.g., their table IV).
In the second part of the paper we report on the results from Monte Carlo
simulations for several experimental setups which aim to address the neutrino mass ordering: the
medium-baseline reactor experiment JUNO, the atmospheric neutrino experiments INO
and PINGU, and the long-baseline beam experiments NOA and LBNE. In each case we
have checked by generating a large number of random data sets how well the Gaussian
approximation is satisfied. Our results indicate that the Gaussian approximation is
excellent for JUNO, INO, and PINGU. For NOA the T distributions deviate significantly from
Gaussian (strongly dependent on the true value of the CP phase ), however the Gaussian
expressions for the sensitivities still provide a fair approximation to the results of the Monte
Carlo. For LBNE the Gaussian approximation is again fulfilled reasonably well. This is
in agreement with our analytical considerations on the validity of the Gaussian
approximation given in appendix A, where we find that for experiments with T0 large compared
to the number of relevant parameters Gaussiantiy should hold. Hence, we expect that the
Gaussian approximation should hold to very good accuracy also for experiments with a
high sensitivity to the mass ordering, such as for instance a neutrino factory [14, 88, 89]
or the LBNO experiment , when explicit Monte Carlo simulations become exceedingly
unpractical due to the very large number of data sets needed in order to explore the high
In section 5 we provide a comparison of the sensitivities of the above mentioned facilities
using the statistical methods discussed in this paper. Figures 12 and 13 illustrate how the
median sensitivity and the probability to reject the wrong mass ordering at 3 CL for
the various experiments, respectively, could evolve as function of time based on official
statements of the collaborations. While this type of plots is subject to large error bars on
the time axis (typically asymmetric) as well as concerning actual experimental parameters,
our results indicate that it is likely that the wrong mass ordering will be excluded at 3
CL within the next 10 to 15 years.
We thank Walter Winter for comments on the PINGU sensitivity and Enrique
FernandezMartinez for useful discussions. This work was supported by the Goran Gustafsson
Foundation (M.B.) and by the U.S. Department of Energy under award number DE-SC0003915
(P.C. and P.H.). T.S. acknowledges partial support from the European Union FP7 ITN
INVISIBLES (Marie Curie Actions, PITN-GA-2011-289442).
The distribution of T
Consider N data points xi, and the two hypotheses, H and H0, and we want to test whether
one of them can be rejected by the data. The theoretical predictions for the observed data
under the two hypotheses are denoted by i and 0i, respectively. The prediction i (0i)
may depend on a set of P (P 0) parameters (0) which have to be estimated from the
data. For the case of the mass ordering we have P = P 0 and H and H0 depend on the
same set of parameters. However, here we want to be more general.
Under H the data xi will be distributed as N (i(0), i), where N (m, ) denotes the
normal distribution with mean m and variance 2 and 0 are the unknown true values of
the parameters. If H0 is true xi will be distributed as N (0i(00), i0). Once the experiment
has been performed one can build for each hypothesis a least-square function:
and similar for H0. Here are the parameters at the minimum, which will be different
for each hypothesis. In practice often the variances have to be estimated from the data
itself, e.g., i i0 xi. In the following we will assume i = i0. Let us note that
generalization to correlated data is straight forward. The test statistic T from eq. (2.10) is
then given by T = X2(H0) X2(H). In the following we will derive the distribution of T .
The distributions of X(H) and X(H0). Let us assume for definiteness that H is true.
First we consider X2(H), and we derive the well-known result, that X2(H) is distributed
Asymptotically the parameter values at the minimum will converge to the true values
. Therefore we assume
yi() = ni + X Bi( 0) . (A.6)
Here and in the following sums run over , = 1, . . . , P and i, j, k = 1, . . . , N if not
explicitly noted otherwise. Then the minimum condition eq. (A.4) becomes
as 2 with N P d.o.f. Let us define the variables
Under H, the yi(0) = ni are N standard normal distributed variables with N (0, 1). Then
we have X2(H) = min Pi[yi()]2. The minimum condition is
where wr Pi Vrini are N P independent variables distributed as N (0, 1). This shows
explicitly that if H is true, X2(H) is distributed as a 2 with N P d.o.f. .
and we obtain
X2(H) = X[yi()]2 = X ni2 X( 0)BiBi( 0 ) .
i i i
Now we diagonalize the symmetric P P matrix BT B = (Pi BiBi) with the orthogonal
matrix R as BT B = RT b2R with b = diag(b). Then eq. (A.7) can be written as
X bR( 0 ) = X Vini with Vi b1 X RBi
The matrix (Vi) defined in eq. (A.9) is a rectangular N P matrix which per construction
obeys the orthogonality condition Pi ViVi = . Hence, we can always complete it
by N P columns to a full orthogonal N N matrix such that Pk VikVjk = ij and
Pk VkiVkj = ij. Then we have
Let us now derive the distribution of X2(H0) under the assumption that H is true.
Again we define
however, now yi0 will not be standard normal distributed as N (0, 1), since per assumption
xi have mean i(0) (and not 0i). Nevertheless we can assume that yi0(0) can be expanded
around a fixed reference point , such that the minimum in the wrong hypothesis, 0,
converges asymptotically towards it. We write
Here ni are N (0, 1) as before, but n0i are N (mi, 1). Now the calculation proceeds as before
and we arrive at
where wr0 Pi Vr0in0i are now N P 0 independent normal variables with mean hwr0i =
Pi Vr0imi. Then X2(H0) has a so-called non-central 2 distribution with N P 0 d.o.f. and
a non-centrality parameter = PN
The distribution of the test statistic T . Let us now consider the test statistic T =
X2(H0) X2(H). Using eqs. (A.11) and (A.15) we find:
(mj + nj )
The first term in eq. (A.17) is just a constant, independent of the data. Using the definition
of mi in eq. (A.14) and comparing with eq. (A.11) one can see that this term is identical
to X2(H0) but replacing the data xi by the prediction for H at the true values:
This is nothing else than the usual 2 between the two hypotheses without statistical
fluctuations, compare eq. (2.11).
The second term in eq. (A.17) is a sum of N standard normal variables, Pi aini. This
gives a normal variable with variance P a2. It is easy to show from eq. (A.17) that the
variance is 4T0 and eq. (A.17) can be written as
with n standard normal. Hence, we find that if the term in eq. (A.18) can be neglected, T
is gaussian distributed with mean T0 and standard deviation 2T0 .
Consider now the term in eq. (A.18). Using hninj i = ij and the orthonormality of
V and V 0 we obtain that the mean value of this term is P P 0. Hence, if the number of
parameters in the two hypotheses is equal (as it is the case for the neutrino mass ordering)
the mean value of T remains T0 as in the Gaussian approximation, eq. (A.20). For testing
hypotheses with different numbers of parameters the mean value will be shifted from T0.
However, even if the mean value remains unaffected, the higher moments of the distribution
can still be modified. Under which conditions can the term in eq. (A.18) be neglected?
Obviously this term is absent if no parameters are estimated from the data, P =
P 0 = 0, i.e., for simple hypotheses. This applies in particular, if we compare the two
hypotheses for fixed parameters.
The term in eq. (A.18) will also vanish for
or V V T = V 0V 0T .
This condition has a geometrical interpretation. Consider the N dimensional space
of data. Varying P parameters the predictions i() describe a P dimensional
subspace in the N dimensional space. The operator V V T is a projection operator
into the tangential hyperplane to this subspace at the X2 minimum. This can be seen
by considering the definition of V in eq. (A.9) and of B in eq. (A.5), which show that
V is determined by the derivatives i/ at the minimum. Similar, V 0V 0T projects
into the P 0 dimensional tangential hyperplane at the minimum corresponding to H0.
Hence, the condition (A.21) means that the hyperplanes of the two hypotheses have
to be parallel at the minima. Obviously this condition can be satisfied only if the
dimensions of the hyperplanes are the same, i.e., P = P 0.
We note also that the condition (A.21) is invariant under a change of
parameterization, which amounts to B BS with S / being a P P orthogonal
matrix describing the variable transformation . Such a transformation would
just change the orthogonal matrix R, but leave the operator V V T invariant. Roughly
speaking we can say that sufficiently close to the respective minima 0 and , the two
hypotheses should depend on the parameters in the same way, where the precise
meaning is given by eq. (A.21).
Irrespective of the above conditions, we can neglect eq. (A.18) if its variance is much
smaller than the variance of the term in eq. (A.17), which is given by 4T0. Eq. (A.18)
is the difference of two 2 distributions with P and P 0 d.o.f., respectively. The 2n
distribution has a mean and variance of n and 2n, respectively. Hence, we should be
able to neglect this term if T0 P, P 0, i.e., for high sensitivity experiments.
Example with one parameter. To simplify the situation let us consider the case where
just one parameter is estimated from the data, for both H and H0. The matrix Vi defined
in eq. (A.9) becomes now just a normalized column vector
and similar for Vi0. The term in eq. (A.18) is now just the difference of the square of two
standard normal variables: n2 n02, with n = Pi Vini and n0 = Pi Vi0ni. As mentioned
above for the general case, we see that hn2 n02i = 0. The variance of this term is
h(n2 n02)2i = Xhninj nknli(ViVj Vi0Vj0)(VkVl Vk0Vl0) = 4 1
where we have used that hni i = 3. We can write (V T V 0)2 = Tr[V V T V 0V 0T ] = cos2 ,
where is the angle between the two hyperplanes (i.e., lines, in this case) for H and H0.
Hence we find that the variance is zero if |V T V 0| = 1, i.e., the lines are parallel. And we
have a measure to estimate when eq. (A.20) is valid, namely when the variance of eq. (A.18)
is small compared to the variance of the second term in eq. (A.17). In the example of one
parameter this means
1 (V T V 0)2 = sin2
Since sin2 1 we find that for T0 1 the gaussian approximation is expected to be
valid if only one parameter is estimated from data.
In the following, we describe the main details that have been used to simulate the
experimental setups considered in this work. Unless stated otherwise the true values for the
oscillation parameters have been set to the following values , and the 2 (or the test
statistic T ) has been minimized with respect to them by adding Gaussian penalty terms
to the 2 with the following 1 errors:
Unless otherwise stated, we assume the true value of 23 to be in the first octant.
Nevertheless, the region around /2 23 would not be disfavored by the penalty term
since it is added in terms of sin2 223 instead of 23. Therefore, we also look for compatible
solutions around /2 23 (the so-called octant degeneracy ) and keep the minimum
of the 2 between the two.
Medium baseline reactor experiment: JUNO
We adopt an experimental configuration for the JUNO experiment based on refs. [61, 62,
83], following the analysis described in ref. . We normalize the number of events such
that for the default exposure of 20 kt 36 GW 6 yr = 4320 kt GW yr we obtain 105
events [61, 83]. The energy resolution is assumed to be 3%p1 MeV/E. We perform a
2 analysis using 350 bins for the energy spectrum. This number is chosen sufficiently
large such that bins are smaller (or of the order of) the energy resolution. We take into
account an overall normalization uncertainty of 5% and a linear energy scale uncertainty
of 3%. Uncertainties in the oscillation parameters sin2 13 and sin2 12 are included as
pull parameters in the 2 using true values and uncertainties according to eq. (B.1), while
|m231| is left free when fitting the data. For this parameter a dense grid is computed and
the minimum is manually searched for. We have updated the analysis from ref.  by
taking into account the precise baseline distribution of 12 reactor cores as given in table 1
of ref.  (including also the Daya Bay reactors at 215 and 265 km). This reduces T0
by about 5 units compared to the idealized situation of a point-like source at 52.47 km
(the latter being the power averaged distance of the 10 reactors not including the Daya
Bay reactors). Adopting the same assumptions as in ref.  we find for a 4320 kt GW yr
exposure T0 11.8, which is in excellent agreement with their results, see red-dashed curve
in figure 2 (right) of ref. .
Our analysis ignores some possible challenges of the experiment, in particular the effect
of a non-linearity in the energy scale uncertainty , see also ref. [62, 74]. While such
issues have to be addressed in the actual analysis of the experiment, our analysis suffices
to discuss the behavior of the relevant test statistic and sensitivity measures.
Atmospheric neutrino experiments: PINGU and INO
For the simulation of the ICal@INO experiment we use the same code as in ref. , where
further technical details and references are given. Here we summarize our main
assumptions. We assume a muon threshold of 2 GeV and assume that muon charge identification
is perfect with an efficiency of 85% above that threshold. As stressed in refs. [53, 54] the
energy and direction reconstruction resolutions are crucial parameters for the sensitivity
to the mass ordering. We assume here the high resolution scenario from ref. , which
corresponds to a neutrino energy resolution of E = 0.1E and neutrino angular resolution
of = 10, independent of neutrino energy and zenith angle. More realistic resolutions
have been published in ref. . While those results are still preliminary, we take our
choice to be representative (maybe slightly optimistic), justified by the fact that we obtain
sensitivities to the mass ordering in good agreement with ref. . With our assumptions
we find 242 -like events per 50 kt yr exposure assuming no oscillations (sum of neutrino
and anti-neutrino events) in the zenith angle range 1 < cos < 0.1. We divide the
simulated data into 20 bins in reconstructed neutrino energy from 2 GeV to 10 GeV, as
well as 20 bins in reconstructed zenith angle from cos = 1 to cos = 0.1. We then
fit the two-dimensional event distribution in the 20 20 bins by using the appropriate
2definition for Poisson distributed data. Our default exposure for INO is a 50 kt detector
operated for 10 yr.
For the PINGU simulation we use the same code as in ref. , where technical details
can be found. In particular, we adopt the same effective detector mass as a function of
neutrino energy, with the threshold around 3 GeV, and the effective mass rises to about 4 Mt
at 10 GeV and 7 Mt at 35 GeV. For the reconstruction abilities we assume that neutrino
parameters are reconstructed with a resolution of E = 0.2E and = 0.5/pE /GeV.
This corresponds to about 13 (9) angular resolution at E = 5 GeV (10 GeV). We stress
that those resolutions (as well as other experimental parameters) are far from settled. With
our choice we obtain mass ordering sensitivities in good agreement with ref. , which
are somewhat more conservative than the official PINGU sensitivities from ref. . For a
3 yr exposure and 23 = 45 we obtain T0 7.5.
For both, INO and PINGU, we include the following systematic uncertainties: a 20%
uncertainty on the over-all normalization of events, and 5% on each of the
neutrino/antineutrino event ratio, the to e flux ratio, the zenith-angle dependence, and on the energy
dependence of the fluxes. Moreover, in order to make the Monte Carlo simulation feasible
we set m221 = 0, which implies that also 12 and the CP phase disappear from the
problem. The validity of this approximation and/or the expected size of -induced effects
has been studied for instance in refs. [48, 49, 58, 59]. Typically T0 varies by roughly 12
units as a function of , which is small compared to uncertainties related to experimental
parameters such as reconstruction abilities. We do not expect that and m221 related
effects will change the statistical behavior of the test statistic T significantly, as also the
results of ref.  seem to indicate.
The sensitivity of this type of experiments is largely dependent on the baseline and neutrino
energies considered, which may vary widely from one setup to another. In this work we
have studied three different setups, NOA, LBNE-10 kt, LBNE-34 kt.
The first setup considered, NOA [7, 8], has a moderate sensitivity to the mass
ordering, estimated to reach at most 3 (see for instance refs. [8, 91]). The setup consists of a
narrow band beam with neutrino energies around 2 GeV, aiming to a 13 kt Totally Active
Scintillator Detector (TASD) placed at a baseline of L = 810 km. NOA has recently
started taking data. The beam is expected to reach 700 kW by mid-2014 , and by
the end of its scheduled running time it will have accumulated a total of 3.6 1021 PoT,
equally split between + and focusing modes. The detector performance has been
simulated following refs. [8, 92]. Systematic errors are implemented as bin-to-bin correlated
normalization uncertainties over the signal and background rates. These have been set to
5% and 10% for the signal and background rates, respectively, for both appearance and
Table 5. Expected total event rates in the appearance channels for the long baseline setups
considered in this work. Efficiencies are already accounted for, and the values of the oscillation
parameters are set to the central values in eq. (B.1) and = 0.
The second setup considered in this work is the LBNE proposal [10, 11]. LBNE would
use a wide band beam with an energy around 23 GeV and a baseline of L = 1300 km. The
first phase of the project (dubbed in this work as LBNE-10 kt) consists of a 10 kt Liquid
Argon (LAr) detector placed on surface. In a second stage, dubbed in this work as
LBNE34 kt, the detector mass would be upgraded to 34 kt and placed underground. The longer
baseline and higher neutrino energies make this setup more sensitive to the mass ordering:
in its first stage is already expected to reach at least a significance between 2.5 7,
depending on the value of . The results also depend significantly on the assumptions
on systematics and the beam design, see for instance ref. . In this work, the detector
performance has been simulated according to ref. . Systematic uncertainties have been
set at the 5% level for both signal and background rates in the appearance channels, and
at the 5% (10%) for the signal (background) rates in the disappearance channels. Table 5
shows the expected total event rates in the appearance channels for each of the long baseline
setups considered in this work. It should be noted the difference in statistics between the
LBNE-10 kt and LBNE-34 kt, which is not only due to the larger detector mass but also to
a different neutrino beam design. The first stage of the project, LBNE-10 kt, is simulated
using the fluxes from the October 2012 Conceptual Design Report, ref. , while for the
upgraded version, LBNE-34 kt, we consider the fluxes from ref. . In both cases the beam
power is set to 700 kW.
The simulations for the long baseline beam experiments have been performed using
GLoBES [93, 94]. In order to generate random fluctuations in the number of events, version
1.3 of the MonteCUBES  software was used. In addition to the true values and prior
uncertainties for the oscillation parameters given in eq. (B.1), a 2% uncertainty on the
matter density is also considered.
Open Access. This article is distributed under the terms of the Creative Commons
Attribution License (CC-BY 4.0), which permits any use, distribution and reproduction in
any medium, provided the original author(s) and source are credited.
 A. Ghosh and S. Choubey, Measuring the mass hierarchy with muon and hadron events in
atmospheric neutrino experiments, JHEP 10 (2013) 174 [arXiv:1306.1423] [INSPIRE].