Likelihood inferences in animal breeding under selection: a missing-data theory view point
R.L. F
0
ernando
D. Gianola
0
University of Illinois
1
Institut National de la Recherche Agrono!nique, laboratoire de bioraetrie
,
BP 27, 3i326 Castanet-Tolosan
,
France
of an open discussion within these columns. The Editorial Board here introduces a new kind of scientific report in the Journal, whereby a current field of research and debate is given emphasis, being the subject one of his latest contributions. As a first essay, we propose a discussion about a difficult and somehow trouble some question in applied animal genetics: how to take proper account of the observed data being selected data? Several attempts have been carried out in the past 15 years, without any clear and unanimous solution. In the following, Im, Fernando and Gianola propose a general approach that should make it possible to deal with every problem. In addition to the interest of an original article, we hope that their own discussion and response to the comments given by Henderson and Thompson will provide the reader with a sound insight into this complex topic.
-
on dveloppe les mthodes dinfrence fondes sur les vraiserrebdances, en explicitant dans
leur calcul le processus, d la slection, qui induit les donnes manquantes. On discute
les conditions dans lesquelles on peut ignorer la slection, et donc considrer seulement la
vraisemblance des donnes e,!ective!rcent recueillies.
gntique animale - slection - donnes manquantes - vraisemblance
Data available in animal breeding often come from populations undergoing
selection. Several authors have considered methods for the proper treatment of data
subject to selection in animal breeding. Examples are Henderson et al. (1959), Curnow
(1961), Thompson (1973), Henderson (1975), Rothshild et al. (1979), Goffinet
(1983), Meyer and Thompson (1984), Fernando and Gianola (1989), and Schaeffer
(1987).
Data subject to selection can be viewed as data with missing values, selection
being the process that causes missing data. The statistical literature discusses
missing data that arise intentionally. Rubin (1976) has given a mathematically precise
treatment which encompasses frequentist approaches that are not based on
likelihoods as well as inferences from likelihoods (including maximum likelihood and
Bayesien approaches). Whether it is appropriate to ignore the process that causes
the missing data depends on the method of inference and on the process that causes
the missing values. Rubin (1976) suggested that in many practical problems,
inferences based on likelihoods are less sensitive than sampling distribution inferences to
the process that causes data. Goffinet (1987) gave alternative conditions to those
of Rubin (1976) for ignoring the process that causes md-iasstinag when making
sampling distribution inferences, with an application to animal breeding.
The objective of this paper is to consider inferences based on likelihoods derived
from statistical models for the data and the missing-data process, in analysis of
data from populations undergoing selection. As in Little and Rubin (1987), we
consider inferences based on likelihoods, in the sense described above, because
of their flexibility and avoidance of ad-hoc methods. Assumptions underlying the
resulting methods can be displayed and evaluated, and large sample estimates of
variances based on second derivatives of the log-likelihood taking into account the
missing data process, can be obtained.
MODELING THE MISSING-DATA PROCESS
Ideas described by Little and Rubin (1987) are employed in subsequent
developments. Let y, the realized value of a random vector Y, denote the data that would
occur in the absence of missing values, or complete data. The vector y is partitioned
into observed values, oy,bs and missing values, .yi. Let
be the probability density function of the joint distribution of Y = o(bsY; Y!i!),
and 0 be an unknown parameter vector. We define for each component of Y an
indicator variable, Ri (with realized value )rt, taking the value 1 if the component
is observed and 0 if it is missing. In order to illustrate the notation, 3 types of
missing data are described in table 1. Consider 2 correlated traits measured on n
unrelated individuals; for example, first and second lactation yields of n cows. The
complete data are y = (y2!), where iyj is the realized value of trait j in individuali
(j = 1,2;i = 1... n). Suppose that selection acts on the first trait (case (a) in Table
I). As a result, a subset of y, oy,bs becomes available for analysis. The pattern of the
available data is a random variable. For example, if the better of two cows (n = 2)
is selected to have a second lactation, the complete data would be
Thus, in analysis of selected data, the pattern of records available for analysis,
characterized by the value of r, should be considered as part of the data. If this is
not done, there will be a loss of information.
To treat R = (i)R as a random variable, we need to specify the conditional
probability that R = r, f (rly, 41), given the complete data Y = y; the vector 41
The likelihood ignoring the missing-data process, or marginal density of oybs in
the absence of selection, is obtained by integrating out the missing data myis from
(equ.(l))
-
--The problem with using of[(b0ys) as a basis for inferences is that it does not take
into account the selection process. The information about R, a random variable
whose value r is also observed, is ignored. The actual likelihood is
The question now arises as to when inferences on 0 should be based on the joint
likelihood (equ.(4)), and when can it based on equ.(3), which ignores the missing
data process. Rubin (1976) has studied conditions under which inferences from
equ.(3) are equivalent to those obtained from equ.(4). If these hold, one can say
that the missing data process can be ignored. The conditions given by Rubin (1976)
are: 1) the missing data are missing at random, ie, /(r!yobs,ymis) 4*) = /(r!yobs) l4)
for all 4o and Ysmi evaluated at the observed values r and ogyb; and 2) the parameters
0 and + are distinct, in the sense that the joint parameter space of (0, ,) is the
product of the parameter space of 8 and the parameter space of !. Within the
contexte of Bayesian inference, the missing data process is ignorable when 1) the
missing data are missing at random, and 2) the prior density of 0 and, is the
product of the marginal prior density of 0 and the marginal prior density of ,.
IGNORABLE OR NON-IGNORABLE SELECTION
Without loss of generality, we examine ignorability of selection when making
likelihood inferences about 0 for each of the three examples given in Table I. Suppose
individuals 1, 2 ... m (< n) are selected.
Selection based on observations on the first trait, which are a part of the observed
data and all the data used to make selection decisions are available. The likelihood
for the observed data, ignoring selection, is
Because selection is based on the observed data only, the conditional probability
.f (r!Y! !) - f (...truncated)