Secondary Analysis under Cohort Sampling Designs Using Conditional Likelihood
Hindawi Publishing Corporation
Journal of Probability and Statistics
Volume 2012, Article ID 931416, 37 pages
doi:10.1155/2012/931416
Research Article
Secondary Analysis under Cohort Sampling
Designs Using Conditional Likelihood
Olli Saarela,1 Sangita Kulathinal,2, 3 and Juha Karvanen4, 5
1
Department of Epidemiology, Biostatistics and Occupational Health, McGill University,
Montreal, QC, Canada H3A 1A2
2
Indic Society for Education and Development (INSEED), Nashik, Maharashtra 422 011, India
3
Department of Vaccines, National Institute for Health and Welfare, 00271 Helsinki, Finland
4
Department of Mathematics and Statistics, University of Tampere, 33014 Tampere, Finland
5
Department of Mathematics and Statistics, University of Helsinki, 00014 Helsinki, Finland
Correspondence should be addressed to Olli Saarela,
Received 28 July 2011; Revised 29 December 2011; Accepted 24 January 2012
Academic Editor: Kari Auranen
Copyright q 2012 Olli Saarela et al. This is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Under cohort sampling designs, additional covariate data are collected on cases of a specific type
and a randomly selected subset of noncases, primarily for the purpose of studying associations
with a time-to-event response of interest. With such data available, an interest may arise to reuse
them for studying associations between the additional covariate data and a secondary non-timeto-event response variable, usually collected for the whole study cohort at the outset of the study.
Following earlier literature, we refer to such a situation as secondary analysis. We outline a general
conditional likelihood approach for secondary analysis under cohort sampling designs and discuss
the specific situations of case-cohort and nested case-control designs. We also review alternative
methods based on full likelihood and inverse probability weighting. We compare the alternative
methods for secondary analysis in two simulated settings and apply them in a real-data example.
1. Introduction
Cohort sampling designs are two-phase epidemiological study designs where information
on time-to-event outcomes of interest over a followup period and some basic covariate data
are collected on the whole first-phase study group, referred to as a cohort, and in the second
phase, more expensive or difficult-to-obtain additional covariate data are collected only on
a subset of the study cohort. This usually comprises the cases, that is, individuals with a
disease event of interest during the followup, and a randomly selected subset of noncases.
Examples are the case-cohort 1–3 and nested case-control 4, 5 designs. Primarily, such
designs are applied for the purpose of studying associations between the time-to-event
2
Journal of Probability and Statistics
outcomes and the covariates collected in the second phase. However, with such data having
been collected, an interest frequently arises to reuse it for studying associations between the
second-phase covariates and the other available covariate data. For instance, the covariates
collected in the second phase could be genotypes, while the other covariates may be various
phenotype measurements carried out at the outset of the followup period for the whole
cohort. The interest would then be to explain a phenotypic response with the genetic covariates. Following Jiang et al. 6 and Lin and Zeng 7, we refer to such a situation as secondary
analysis. Here, we concentrate specifically on non-time-to-event secondary outcomes. Analysis of secondary time-to-event outcomes under the nested case-control design has been
considered previously by Saarela et al. 8 and Salim et al. 9.
As our motivating example, we consider here a single cohort which was used in a
larger meta-analysis of association between the European lactase persistence genotype and
body mass index BMI 10, the latter being a secondary outcome in the cohort study in question. The cohort consists of 5073 men aged 55–77 years from southern and western Finland,
who originally formed the placebo group of the ATBC cancer prevention study 11. Whole
blood samples of the participants were taken between 1992 and 1993, which is here considered as the baseline of the cohort, with followup for cardiovascular disease events and
all-cause mortality available until the end of year 1999. There is no loss to followup, so the
only censoring present is of type I due to end of the followup period. This cohort is a part of
MORGAM project, an international pooling of cardiovascular cohorts 12. Genotype data
including the lactase persistence SNP rs4988235 under this project have been collected
under a case-cohort design described in detail by 13 and herein in Section 4.3.1. Given such
data, our aim is to estimate the association between the lactase persistence genotype and BMI
making use of genotype data collected on both the random subcohort and cases of all-cause
mortality.
Secondary analysis of case-control data has been studied previously, using profile likelihood 14, inverse selection probability weighting methods 15–17, or retrospective likelihood 6, 7. However, to the best of our knowledge, a systematic discussion on secondary
analysis under cohort sampling designs has been lacking, which we will aim to rectify here
by discussing alternative approaches for such an analysis under a generic two-phase study
design. We will briefly review the full likelihood approach which utilizes all observed data
Section 2, as well as pseudolikelihoods based on inverse selection probability weighting
Section 3. For these approaches, we propose a conditional likelihood-based alternative
Section 4, restricted to the fully observed second-phase study group. Conditional likelihood
inference under cohort sampling designs has been studied previously for the analysis of the
primary time-to-event outcome by Langholz and Goldstein 18 and Saarela and Kulathinal
19; here, we extend these methods to the secondary analysis setting. The main interest is
in continuous secondary outcomes, though the approach would also be valid for categorical
responses. As special cases of the general setting, we consider case-cohort and nested casecontrol designs. As extensions to the basic setting, we consider treatment of missing secondphase covariate data and adjustment for left truncation in the case of incident time-to-event
outcomes Section 5. In Section 6, we present two simulation studies, first comparing the
efficiencies of the alternative approaches and then demonstrating the potential adverse effects
of small sampling fraction in full likelihood inference. We also carry out the analysis in
the real-data example using all three alternative methods. As the model for the continuous
secondary response variable, in addition to the customary normal distr (...truncated)