Highly conserved extended haplotypes of the major histocompatibility complex and their relationship to multiple sclerosis susceptibility
Highly conserved extended haplotypes of the major histocompatibility complex and their relationship to multiple sclerosis susceptibility
Douglas S. Goodin 0 1
Pouya Khankhanian 1
Pierre-Antoine Gourraud 0 1
Nicolas Vince 1
0 Department of Neurology, University of California , San Francisco, CA , United States of America, 2 Center for Neuro-engineering and Therapeutics, University of Pennsylvania , Philadelphia, PA , United States of America, 3 Centre de Recherche en Transplantation et Immunologie UMR 1064, INSERM, Universit eÂ de Nantes, Nantes, France, 4 Institut de Transplantation Urologie NeÂphrologie (ITUN) , CHU Nantes, Nantes , France
1 Editor: Jianming Tang, University of Alabama at Birmingham , UNITED STATES
Data Availability Statement: All relevant data are
within the paper and its Supporting Information
Funding: The authors received no specific funding
for this work.
Competing interests: The authors have declared
that no competing interests exist.
To determine the relationship between highly-conserved extended-haplotypes (CEHs) in
the major histocompatibility complex (MHC) and MS-susceptibility.
Among the ~200 MS-susceptibility regions, which are known from genome-wide analyses of
single nucleotide polymorphisms (SNPs), the MHC accounts for roughly a third of the
currently explained variance and the strongest MS-associations are for certain Class II alleles
(e.g., HLA-DRB1*15:01; HLA-DRB1*03:01; and HLA-DRB1*13:03), which frequently
reside on CEHs within the MHC.
Autosomal SNPs (441,547) from 11,376 MS cases and 18,872 controls in the WTCCC
dataset were phased. The most significant MS associated SNP haplotype was composed of 11
SNPs in the MHC Class II region surrounding the HLA-DRB1 gene. We also phased alleles
at the HLA-A, HLA-C, HLA-B, HLA-DRB1, and HLA-DQB1 loci. This data was used to probe
the relationship between CEHs and MS susceptibility.
We phased a total of 59,884 extended haplotypes (HLA-A, HLA-C, HLA-B, HLA-DRB1,
HLA-DQB1 and SNP haplotypes) from 29,942 individuals. Of these, 10,078 unique
extended haplotypes were identified. The 10 most common CEHs accounted for 22%
(13,302) of the total. By contrast, the 8,446 least common extended haplotypes also
accounted for approximately 20% (12,298) of the total. This extreme frequency-disparity
among extended haplotypes necessarily complicates interpretation of reported
diseaseassociations with specific HLA alleles. In particular, the HLA motif
HLA-DRB1*15:01~HLADQB1*06:02 is strongly associated with MS risk. Nevertheless, although this motif is almost
always found on the a1 SNP haplotype, it can rarely be found on others (e.g., a27 and a36),
and, in these cases, it seems to have no apparent disease-association (OR = 0.7; CI = 0.3±
1.3 and OR = 0.7; CI = 0.2±2.2, respectively). Furthermore, single copy carriers of the a1
SNP-haplotype without this HLA motif still have an increased disease risk (OR = 2.2; CI =
1.2±3.8). In addition, even among the set of CEHs, which carry the Class II motif of
HLADRB1*15:01~HLA-DQB1*06:02~a1, different CEHs have differing strengths in their
The MHC in diverse human populations consists, primarily, of a very small collection of very
highly-selected CEHs. Our findings suggest that the MS-association with the
HLADRB1*15:01~HLA-DQB1*06:02 haplotype may be due primarily to the combined attributes
of the CEHs on which this particular HLA-motif often resides.
The basis of genetic susceptibility to multiple sclerosis (MS) is complex [1±3]. Thus, currently,
there are over 200 MS associated common risk variants in different genomic regions that have
been identified by genome wide association screens (GWAS) comparing MS patients to
controls [4±12]. These GWAS studies typically evaluate the disease associations for ~500,000
single nucleotide polymorphisms (SNPs) scattered throughout the genome [4±12]. Despite the
large number of genetic associations defined by these increasingly available GWAS studies,
several alleles of the human leukocyte antigens (HLA), located in the major histocompatibility
complex (MHC) on the short arm of chromosome 6 (6p21.3), were identified more than four
decades ago. The most prominent of these HLA associations (by far) is with the
HLA-DRB1 15:01 allele, which typically has an odds ratio (OR) of more than three for
heterozygotes and more than six for homozygotes [9, 13±20]. Also, other alleles at the DRB1 locus
(e.g., HLA-DRB1 03:01 and HLA-DRB1 13:03) are known to be associated with an increased
risk of getting MS [
1, 11, 21
]. However, even with the large number of defined genetic
associations with MS, most of the genetic risk in MS remains unexplained. In addition, as shown in
Figure A in S3 File, the large majority of the population does not even belong to the subgroup
of individuals who are ªgenetically susceptibleº to getting MS [
]. Observations such as these
have created a so-called ªheritability gapº. Such a gap is a common finding in many complex
genetic disorders [
] and is likely due (at least in part) to the phenomenon of ªsynthetic
associationº , in which a reported association is simply tagging a genomic region rather than
identifying a causal variant. Indeed, both single SNPs and single alleles can be associated with
several haplotypes sometimes spread over a considerable genetic distance [23±34]. For
example, despite the apparently well-established association of MS susceptibility with the
HLA-DRB1 15:01 allele, this association might be due to a synthetic association [
Moreover, as demonstrated in Figure A in S3 File, even for the HLA-DRB1 15:01 allele, the large
majority of its carriers do not even belong to the subset of individuals who are ªgenetically
susceptibleº to getting MS .
Some of the haplotypes in the MHC region are highly conserved extended haplotypes
(CEHs), which span more than 2.7 megabases (mb) [23±28, 30, 32±36]. These CEHs exist even
2 / 23
though the MHC region encompasses several recombination hotspots and the region as a whole
has an average recombination rate of ~0.4 centimorgans (cM) per mb [
27, 34, 37, 38
mechanisms to account for this kind of extended linkage are: ªfrozen blocksº of DNA,
preservation of ancestral lineages, haplotype-specific suppression of recombination / mutation in parts of
the MHC region, or some form of balancing evolution, in which heterozygosity is favored [24,
39±43]. Several of these CEHs include HLA-DRB1 15:01, HLA-DRB1 03:01, HLA-DRB1 13:03,
or other alleles. For example, the haplotypes:
B 07 : 02
DRB1 15 : 01
A 25 : 01
HLA C 12 : 03
DQB1 06 : 02
B 18 : 01
DRB1 15 : 01
have been consistently observed in Caucasian populations [23±28, 32, 35, 38]. Necessarily, the
existence of such CEHs in the MHC region complicates the interpretation of the disease
association with any specific HLA allele. We recently explored a method for reducing the size of the
heritability gap by analyzing SNP haplotypes (rather than single SNPs) throughout the genome [
In addition to improving significantly the explained genetic risk, this method also provides an
opportunity to explore in greater depth the genetic associations of the MHC reported previously.
For example, using the Wellcome Trust Case Control Consortium dataset (WTCCC), we
found an 11-SNP haplotype in the MHC region, which had the greatest MS disease association
of any, and which we labeled the a1 SNP haplotype (OR [single copy] 3; p<10−300) [
This SNP haplotype represents a specific string of 11 SNPs spanning a total of 246.3 kilobases
(kb) surrounding the HLA-DRB1 gene (Fig 1) and includes the SNPs (rs2395173, rs2395174,
rs3129871, rs7192, rs3129890, rs9268832, rs532098, rs17533090, rs2187668, rs1063355, and
rs9275141). These 11 SNPs define 174 haplotypes in this region (e.g., Table 1), with each SNP
haplotype having its own Class II HLA haplotype specificity (e.g., Table 1; Fig 2). As with other
previously reported SNP ªhitsº in this genomic region [9, 13±17], the a1 SNP haplotype is
Fig 1. Location of the 11 SNPs in the haplotype surrounding the Class II DRB1 gene on chromosome 6 (6p21.3), which had the greatest disease association of any
SNP haplotype in the region (see text). The blue rectangles span the regions from the start to the stop points of the Class II genes: HLA-DRB5, HLA-DRB1,
HLADQA1, and HLA-DQB1. The centromere of Chromosome 6 lies to the right of this portion of 6p21.3.
3 / 23
tightly coupled to the MHC Class II haplotype of HLA-DRB1 15:01~HLA-DQB1 06:02. In the
present paper, we have analyzed the haplotype structure of the MHC (including both HLA
alleles and SNP haplotypes) to better understand the specific genetic relationship of this
genomic region to MS.
Highly conserved haplotypes of the MHC
Some of the CEHs in the MHC region, which are highly conserved, involve both Class I and
Class II loci [24±38]. The different combinations of alleles at three Class I loci (HLA-A,
HLAB, and HLA-C) and two Class II loci (HLA-DRB1 and HLA-DQB1) together with a specific
11-SNP haplotype represent more than 4 billion possible unique haplotypes spanning a
genomic distance of 2.7 mb. Despite this huge number of possibilities, the frequency distribution
for these extended haplotypes in the WTCCC is definitely non-Gaussian, with many very rare
haplotypes together with a small number of very common haplotypes (e.g., Fig 3; Figure B in
S3 File; S1 Table; S2 Table). Thus, there were just 10,078 unique haplotypes represented within
the 29,942 individuals of the WTCCC accounting for 59,884 total observed haplotypes. Of
these, 13,302 (22%) were accounted for by the most common 10 CEHs, 30% by the most
common 25 CEHs, 48% by the 146 CEHs with 50 or more representations in the WTCCC, and
71% by the most common 810 CEHs (S1 Table). On the other end, 6,016 (60%) of the unique
extended haplotypes were observed only once in the WTCCC dataset. An additional 1,397
(14%) had only 2 representations so that 7,413 (74%) of the unique haplotypes had two or
fewer representations. However, these 74% of the unique haplotypes accounted for only 8,810
(15%) of the total number of observed haplotypes in the WTCCC dataset. Consequently, there
exists a small set of very common CEHs, which have been strongly selected (see S2 File), and
which, nonetheless, have notably different compositions in different populations, even among
relatively nearby geographic regions (Fig 4; S1 and S2 Tables). Moreover, there also appears to
be a substantial amount of mixing between specific Class I and Class II motifs (see S1 File).
In addition, the prevalence of individuals in the WTCCC who were homozygous for these
CEHs was increased relative to expectations (expected = 269; observed = 383; z = 6.97;
p<10−11). Such an increase was found for both the cases (expected = 152; observed = 208;
z = 4.59; p<10−5) and the controls (expected = 138; observed = 175; z = 3.13; p = 0.0018).
Haplotype associations with MS susceptibility
The fact that much (possibly most) of the MHC is composed of a small group of CEHs
necessarily complicates the interpretation of any disease associations previously reported for specific
alleles such as HLA-DRB1 15:01, HLA-DRB1 03:01, and HLA-DRB1 13:03 [1, 9, 13±17, 19, 21,
29, 30]. For example, it is unclear to what extent the effect of HLA-DRB1 15:01 on disease
susceptibility can be separated from an effect of the full CEHs (comprising both the 5 HLA alleles
and the SNP haplotypes) on which this allele resides. To investigate this, we undertook two
alternative approaches. First, we examined the disease association of different CEHs, which
contained HLA-DRB1 15:01~HLA-DQB1 06:02~a1, HLA-DRB1 03:01~HLA-DQB1 02:01~a2,
HLA-DRB1 03:01~HLA-DQB1 02:01~a6, or HLA-DRB1 13:03~HLA-DQB1 03:01~a14.
Second, we examined the disease associations for haplotypes that either contained these same
Class II HLA motifs but a different SNP haplotype motif or contained the same SNP haplotype
motif but a different Class II HLA motif.
HLA-DRB1 15:01~HLA-DQB1 06:02. The HLA-DRB1 15:01~HLA-DQB1 06:02
haplotype is very closely associated with the (a1) SNP haplotype; 99% of all (a1)-carriers also carry
HLA-DRB1 1501~HLA-DQB1 0602 and the reciprocal statement is true as well (Fig 2). The
4 / 23
² The "Name" is arbitrary and indicates the order of haplotype identification in the EPIC dataset [
]. The SNP haplotype represents the haplotypes identified using
the set of 11 SNPs shown in Fig 1 and provided in text. The number ª0º indicates the presence of the major allele and the number ª1º indicates the presence of the
minor allele (in the control population) at the particular SNP location. Only 14 selected SNP-haplotypes (of the 174 present in the WTCCC) are listed. Haplotype
frequencies found in two independent datasets (EPIC and WTCCC) are shown [
]. Frequencies are provided to 2 significant digits after the decimal. Those listed as
(0.00) were less than 0.005. Each of the 174 haplotypes had very specific associations with specific Class II haplotypes. For example, each of the associations (shown in
the Table) of specific SNP-haplotypes with specific HLA haplotypes were highly significant. Almost all had of p-value (by Chi square analysis) of (p<10−300). The only
two exceptions to this were for HLA-DRB1 07:01~HLA-DQB1 02:02~a3 (p<10−151) and for HLA-DRB1 15:01~HLA-DQB1 06:02~a34 (p<10−290). Moreover, both the
EPIC and the WTCCC datasets had the same Class II HLA associations with the different SNP-haplotypes.
²² In both EPIC and the WTCCC, a3 was equally associated with four HLA haplotypes: HLA-DRB1 04:01~HLA-DQB1 03:01, HLA-DRB1 04:01~HLA-DQB1 03:02,
HLA-DRB1 04:04~HLA-DQB1 03:02, and HLA-DRB1 07:01~HLA-DQB1 02:02.
§ In both EPIC and WTCCC, a27 is associated with two haplotypes: HLA-DRB1 15:01~HLA-DQB1 06:02, and HLA-DRB1 15:01~HLA-DQB1 05:02,. In WTCCC, 58%
(28/48) were HLA-DRB1 15:01~HLA-DQB1 06:02, whereas, in EPIC, none of the five a27 SNP haplotypes were associated with this particular HLA haplotype.
§§ The single example of the a34 SNP haplotype in EPIC was associated with the HLA-DRB1 15:01~HLA-DQB1 06:02 HLA haplotype. No examples of the a36 SNP
haplotype were present in EPIC who also had HLA information.
disease associations of all CEHs containing HLA-DRB1 15:01~HLA-DQB1 06:02~a1 with 50
or more representations in the WTCCC dataset are shown in Table 2. Each of these extended
haplotypes is significantly associated with an increased disease risk (Table 2). However, for
several of them, the magnitude of the association with disease risk varied significantly
(Figure C in S3 File). Indeed, for example, the disease-association for haplotype (c2) was
significantly greater that for than both the (c3) and the (c11)) haplotypes (Figure C in S3 File). By
contrast, the haplotype (c3) had a significantly smaller disease-association than that of several
5 / 23
Fig 2. The HLA haplotype/SNP haplotype associations±both by SNP haplotype (A) and also by HLA haplotype (B)±for selected SNP haplotypes
(some of which are presented in Table 1). Other haplotypes not presented also had very specific haplotype associations [
other haplotypes (Figure C in S3 File). Especially notable, however, was haplotype (c282),
consisting of HLA-A 03:01~HLA-C 15:02~HLA-B 51:01~HLA-DRB1 15:01~HLA-DQB1 06:02~a1,
which had an extremely strong disease association (OR = 20.3; CI = 6.1− 67.3; p<10−11), and
which differed significantly from every other haplotypes with the exception of the (c173)
Fig 3. The WTCCC dataset consists of 59,884 haplotypes, of which 10,078 represent different (unique) combinations of the 5 HLA alleles and
the SNP haplotypes (see text). For the purpose of this graph, these unique haplotypes (CEHs) have been sorted according to their descending
frequency of occurrence in the WTCCC dataset. The cumulative number of unique haplotypes (beginning with the highest frequency haplotype) has
been plotted against the percentage of total number of haplotypes in the population. As can be appreciated from the graph, the large majority (~80%)
of the different CEHs have only a very low frequency, whereas 80% of the haplotypes in the population are accounted for by only small number of very
common CEHs (i.e., ~10 haplotypes).
7 / 23
Fig 4. Rank order for the 10 most common extended haplotypes for the entire WTCCC dataset (labeled: c1 to c10078; in descending order of frequency). The rank
order of the haplotypes for each participating region are shown separately (see S1 Table for definitions of those haplotypes, which have been colored in the figure based
on the overall 10 most common haplotypes in the WTCCC). Regions are ordered (from left to right) based on the descending frequency of the c2 haplotype. Only cases
are available for all regions. Nevertheless, both the complete WTCCC (Case and Control) and the EPIC (Case and Control) populations are also included for
haplotype (Figure C in S3 File). However, regardless of the fact that the magnitude of disease
association depends upon the particular CEH, on which the HLA-DRB1
15:01~HLADQB1 06:02~a1 motif resides, some disease risk seems to be attributable to the
HLADRB1 15:01~HLA-DQB1 06:02~a1 haplotype by itself because the disease risk is still
significantly increased for those individuals who both carry this complete Class II motif and, yet,
whose full CEH has only a single representation in the WTCCC (OR = 3.0; CI = 2.7−3.4;
Despite the extremely strong association of the (a1) SNP-haplotype with this particular
HLA haplotype, some HLA-DRB1 15:01~HLA-DQB1 06:02 motifs occur in association with
other SNP-haplotypes and some of these combinations seem not to have any
disease-association (Fig 5A). Thus, for example, single-copy carriers of either HLA-DRB1
15:01~HLADQB1 06:02~a27 or the HLA-DRB1 15:01~HLA-DQB1 06:02~a36 haplotypes, seem not to
have any increase in their disease-risk (OR = 0.7; CI = 0.3−1.3 and OR = 0.7; CI = 0.2−2.2,
respectively). These ORs are significantly different for both the (a27)-containing haplotype
(z = 2.5; p = 0.01) and for the (a36)-containing haplotype (z = 4.2; p<10−4) compared to the
same HLA-haplotype containing (a1). Similarly, as shown in Fig 5A, considering together all
non-(a1)-containing haplotypes carrying the HLA-DRB1 15:01~HLA-DQB1 06:02 motif these
also had significantly smaller ORs than the (a1)-containing haplotypes (z = 3.9; p<10−4). By
contrast, single copy carriers of the (a1) SNP haplotype who lack the HLA-DRB1
15:01~HLADQB1 06:02 HLA haplotype, still have a significantly increased disease risk (OR = 2.2;
CI = 1.2±3.8). Moreover, although this OR is less than that found for single copy carriers of the
HLA-DRB1 15:01~HLA-DQB1 06:02~a1 haplotype, the confidence intervals overlap and the
two ORs did not differ significantly (z = 1.2; p = 0.24).
In the WTCCC dataset, the HLA alleles were imputed [
] and, thus, it is possible that
either errors of imputation or errors in SNP identification could have influenced these
findings. We addressed these possibilities in two ways. First, we compared the HLA associations of
the different SNP haplotypes in the imputed WTCCC dataset with the HLA haplotype
associations in the Expression, Proteomics, Imaging, and Clinical (EPIC) Study dataset, which had
been determined by sequence based typing methods [
]. There was an excellent agreement in
the corresponding Class II SNP haplotype associations found in the two datasets (Table 1). In
8 / 23
²² a1 containing haplotypes with 50 representations in the WTCCC. Two additional haplotypes with fewer representations are also shown.
² Arbitrary name for haplotype (sorted in descending order of frequency) for the entire WTCCC population.
Odds ratio (OR) of disease for individuals having 1 copy of the listed haplotype compared to having no other copies of the HLA-DRB1 15:01~HLA-DQB1 06:02~a1
Class II haplotype (95% CI range in parenthesis). A Bonferroni correction for the number of haplotypes with 50 or more representations (146) would require a
significance level of p<3 E-4.
Significance of the association between having 1 copy of the specific allele and the disease (MS) compared to having no copies. The p-values are expressed in scientific
notation as powers of 10 (E). All observations with (p<0.001) still demonstrated a statistically significant effect even after adjustment for population stratification,
geographic stratification, and gender. Moreover, including each of these haplotypes in the same regression equation demonstrated that each of the listed CEHs was
independently associated with having MS.
§ These two haplotypes also differed (non-significantly) in their disease-association for having two copies of each allele compared to having no copies of the
HLADRB1 15:01~HLA-DQB1 06:02~a1 Class II haplotype. Thus, these ORs are
For c2: OR [two copies] = 5.8 (3.4±9.9)
And, for c3: OR [two copies] = 2.7 (1.3±5.5)
§§ The Class I and Class II portions of each listed haplotype were significantly associated with each other beyond the Bonferroni-adjusted level of significance. The only
exception to this rule was for the haplotype c139. In this case, the association had a p-value of: p = 4.42 E−8
addition, several of the rare HLA-DRB1 15:01~HLA-DQB1 06:02 containing SNP haplotypes
were found in both datasets (Table 1). Second, we analyzed the hamming distance between the
various HLA-DRB1 15:01~HLA-DQB1 06:02 containing SNP haplotypes to assess how close
these haplotypes were to each other (Figs 6 and 7). Presumably, if errors in SNP identification
9 / 23
Fig 5. Disease-associations for the different SNP-haplotype combinations with the Class II HLA haplotypes of: (A)
DRB1 1501~DQB1 0602 and: (B) DRB1 03:01~DQB1 02:01 & DRB1 13:03~DQB1 03:01. The odds ratios (OR) are given
comparing cases to controls with regard to carrying either one or two copies of the risk-haplotype as opposed to carrying
zero copies. In these circumstances, the disease association varied markedly, depending upon which SNP-haplotype carried
the HLA-haplotype. Such an observation indicates that the observed disease-associations were not due to these specific HLA
alleles but, rather, to something else, which was present on these SNP-haplotypes (see text). For unclear reasons, this data set
did not replicate the findings of Chao and coworkers [
] with respect to the HLA-B 08, HLA-B 13, HLA-B 27, HLA-B 32,
and HLA-B 52 haplotypes (see text). In the WTCCC data, however, vast majority (96−100%) of the haplotypes that carried
these HLA-B alleles, when they included the HLA-DRB1 15:01 allele, also carried the (a1) SNP haplotype. As a result,
because they also carried the (a1) SNP haplotype, each of these haplotypes was strongly associated with an increased MS-risk
except for the extremely rare HLA-B 52~HLA-DRB1 15:01~a1 haplotype (where OR = 1.01).
were responsible for occasionally assigning the HLA-DRB1 15:01~ HLA-DQB1 06:02
haplotype to rare SNP haplotypes, the percentage of these errors would tend to be higher for
haplotypes at short hamming distances from (a1). However, no such relationship was evident (Figs
6 and 7).
HLA-DRB1 03:01~HLA-DQB1 02:01. The haplotype HLA-DRB1 03:01~HLA
DQB1 02:01 is divided between the (a2) and the a6 SNP haplotypes (Figs 2 & 5B; Table 3).
These two haplotypes seem to have distinctive disease associations. Thus, the a2-containing
haplotype show dominance (or dose dependence), such that both the heterozygotes and
homozygotes have an increased disease risk (Fig 5B). This is the case for all the common
a2-containing extended haplotypes (Table 3). By contrast, the (a6)-containing haplotypes, for the most
part, show a recessive pattern such that heterozygotes seem not to have any increased risk (Fig
5B). Thus, the increased risk in (a2)-containing heterozygotes is significantly different from
the (a6)-containing heterozygotes (z = 5.9; p<10−8), and, in addition, the (a6)-containing
homozygotes have a substantially increased disease risk, which is significantly greater than that
found for a6-containing heterozygotes (z = 8.0; p <10−14). Again, the lack of any increased
risk for heterozygotes seems to be true for most of the (a6)-containing CEHs (Table 3).
However, this was not the case for the extended haplotype HLA-A 24:02~HLA-C
07:01~HLAB 08:01~HLA-DRB1 03:01~HLA-DQB1 02:01~a6. Thus, for this haplotype, the disease risk
for the heterozygote was both significantly increased (Table 3) and, with the exception of (c27)
and (c90), significantly greater than that for other (a6)-containing CEHs (range of z-scores:
2.2±4.6; range of p-values: 0.03±10−5).
HLA-DRB1 13:03~HLA-DQB1 03:01. The haplotype HLA-DRB1 13:03~HLA
DQB1 03:01 is essentially confined to the (a14I) SNP haplotype (Figs 2 & 5B; Table 3). This
haplotype was clearly associated with an increased disease risk in the heterozygote (Fig 5B);
roughly similar for all the most common (a14)-containing extended haplotypes (Fig 5). The
disease risk may also be increased in individuals homozygous for this haplotype although there
were too few observations to be sure (Fig 5B).
Other extended haplotypes. Several other CEHs also seemed to be associated with disease
risk (Table 3). Many of these were protective and this protective effect was evident despite the fact
that those individuals who carried the HLA-DRB1 15:01~HLA-DQB1 06:02~a1 haplotype were
removed from the analysis (Table 3). By contrast, as is also shown in Table 3, the extended
haplotype HLA-A 24:02~HLA-C 07:02~HLA-B 39:06~HLA-DRB1 08:01~HLA-DQB1 04:02~a16 was
associated with a significant increase in disease risk (OR = 3.0; CI = 1.8±5.5).
Regression analysis confirmed the significance of these observations and no significant
interactions were identified. Moreover, adjustment for population stratification, geographic
stratification and for gender did not alter these findings (Tables 2 and 3).
The EPIC cohort. The cohort of patients from the EPIC study was considerably smaller
than those in the WTCCC study and, consequently, only a limited amount of comparative
11 / 23
Fig 6. Different SNP haplotypes at distances of 1 to 4 hamming units from the a1 SNP haplotype (SNP differences highlighted in red; for SNP
definitions see text). Several of these SNP haplotypes (indicated in yellow), at times, carried the HLA-DRB1 15:01~HLA-DQB1 06:02 HLA haplotype
whereas others (indicated in blue) never did. HLA haplotypes are highlighted in green. Thus, whether or not a given SNP haplotype carried this HLA
haplotype seemed to be, not a function of the hamming distance, but rather, a property of the specific SNP haplotype involved.
information is available. For example, only 6 CEHs (c1, c2, c3, c5, c6, and c11) had 20 or more
representations in the EPIC dataset (S3 Table). Nevertheless, all four of the HLA-DRB1 15:01~
HLA-DQB1 06:02~a1 containing haplotypes (c2, c3, c6, and c11) were significantly associated
with MS and had ORs [single copy] ranging from 2.5 to 3.9, with the largest being for (c11)
and the smallest being for (c2). The haplotype (c1) had a non-significant OR [single copy] of
1.3 and the haplotype (c5) had an OR, which was significantly less than one (OR [single copy]
= 0.2). In general these findings are consistent with those reported above for the WTCCC
cohort (Tables 2 and 3; S3 Table).
In the WTCCC dataset, the MHC region seemed to be composed, largely, of a relatively small
collection of very highly-selected CEHs (see S1 File) stretching, at least, from the HLA-A locus
Fig 7. Plot of the proportion of carriers of the HLA-DRB1 15:01~HLA-DQB1 06:02 haplotype at different hamming distances from the (a1) SNP haplotype.
The magenta line represents the average of all haplotypes at a given Hamming distance. Also plotted are the subgroups of haplotypes carrying
HLADRB1 15:01~HLA-DQB1 06:02 less than 10 percent of the time (blue) and those carrying this HLA haplotype 10 or more percent of the time (orange line). Black
dots represent individual observations. Certainly, as hamming distance increased, the percentage of haplotypes carrying HLA-DRB1 15:01~HLA-DQB1 06:02
diminishes (magenta). However, even at a hamming distance of 4, some specific SNP haplotypes carry this HLA haplotype almost half of the time.
13 / 23
²² haplotypes with 50 representations in the WTCCC. All such haplotypes carrying the a2, a6, or a14 SNP haplotype are included. For each of the listed haplotypes,
the Class I and Class II portions were significantly associated with each other far beyond the Bonferroni-adjusted level of significance.
² Arbitrary name for haplotype (sorted in descending order of frequency) for the entire WTCCC population.
Odds ratio (OR) of disease for individuals having 1 copy of the listed haplotype compared to having no copies of the particular HLA-DRB1~HLA-DQB1~SNP Class II
haplotype (95% CI range in parenthesis). All haplotypes carrying the HLA-DRB1 15:01~HLA-DQB1 06:02~a1 Class II motif were excluded in this analysis. A
Bonferroni correction for the number of haplotypes with 50 or more representations (146) would require a significance level of (p<3 E-4).
Significance of the association between having 1 copy of the specific allele and the disease (MS) compared to having no copies. The p-values are expressed in scientific
notation as powers of 10 (E); ns = not significant. With exception of c23 and c46, all observations with p<0.001 still showed a statistically significant effect even after
adjustment for population stratification, geographic, stratification, and gender. Moreover, even c23 and c46 trended in this direction (p 0.10)
§ Only the c1 haplotype had enough observations to explore the disease association for having two copies of an allele compared to having no copies of the
HLADRB1 03:01~HLA-DQB1 02:01~a6 Class II haplotype. Thus, this OR was
For c1: OR [two copies] = 2.1 (1.5±2.9); p = 2.1 E-6
This effect was still statistically significant even after adjustment for population stratification (p = 3.13 E-6).
The other Class II haplotypes containing HLA-DRB1 03:01~HLA-DQB1 02:01~a6, combined, had an OR of:
OR [two copies] = 0.8 (0.1±3.4); p = ns
§§ This group of haplotypes is composed of those that also had a significant association with this disease. Most of these haplotypes seem to be protective and this
protective effect remained significant (p<0.05) even after excluding all individuals who carried the HLA-DRB1 15:01~HLA-DQB1 06:02~a1 haplotype.
to beyond the HLA-DQB1 locus (a distance spanning more than 2.7 mb of DNA). The
occurrence of homozygous CEHs was increased both in cases and controls. Such an increase might
be expected in the patient population, where the homozygotes of certain haplotypes have an
especially high disease risk [9,13±20]. However, it should not be the case for the control
population if a balancing selection (i.e., one in which some heterozygous combinations have higher
fitness than homozygous combinations) was expected [
]. Alternatively, such a finding might
be due to population stratification effects. Thus, such an increase might be expected if local
PLOS ONE | https://doi.org/10.1371/journal.pone.0190043
14 / 23
sub-populations had different CEH frequencies (e.g., like Fig 4, but with finer grained
population subdivisions) and if individuals from these sub-populations had a propensity to find
mates within their same sub-population [
Also, and as developed more fully in S2 File, when classifying the WTCCC haplotypes into
ªrareº and ªfrequentº CEHs (i.e., those found once or more than once, respectively), there is a
significant excess in the number of homozygotes observed for both ªrareº and ªfrequentº
CEHs compared to HWE expectations. For this analysis, homozygotes are considered ªrare
−rareº and ªfrequent−frequentº individuals regardless of the actual CEHs that make up the
haplotype pair. The conversion of CEHs from ªrareº to ªfrequentº or vice versa can be caused
either by biologic mechanisms (e.g., recombination or mutation) or by mistakes (e.g., typing,
imputation, or phasing errors). These errors cannot be avoided entirely due to the marked
similarity of many HLA alleles [
]. However, regardless of the underlying mechanism,
haplotype conversion, by itself, does not produce any deviation from HWE (S2 File). Also, mistakes
don't produce actual changes in CEH frequencies that accumulate over time. By contrast, over
time, actual haplotype conversions (e.g. those caused by biologic mechanisms), which are
unopposed, would reach a stable state in the population only once the net conversion rate is
zero±i.e., when the probability of frequent!rare and rare!frequent transitions are equal (S2
File). This, however, is decidedly not the state of the WTCCC, the EPIC, or other populations
here, each of which is composed predominantly of a small number of very common CEHs
(Fig 3; Figure B in S3 File). Consequently, it must be that the force of actual haplotype
conversion is being opposed by another force (i.e., selection) that both retains ªfrequentº CEHs in the
population and also perturbs HWE (S2 File). Such a selection is already strongly suggested just
based on the typical CEH composition of the different human populations (Fig 3, Figure B in
S3 File). Indeed, using the observed magnitude of the deviation from HWE, and presuming
the forces of selection and haplotype conversion balance each other, leads both to the
conclusion that the relative probability of survival for individuals with homozygous ªrareº CEHs is
less than 80% of that for individuals with homozygous ªfrequentº CEHs and also that the net
frequent ! rare haplotype conversion rate is on the order of 3−6% for the MHC region in
each generation (S2 File).
Naturally, there are possible explanations, other than selection, which could also produce a
deviation from HWE expectations. Most conspicuous and widely recognized among these is
the possibility that the WTCCC population is composed of two or more sub-populations, each
of which is in HWE but with each sub-population having different haplotype frequencies.
Such a circumstance would violate the HWE assumption of random mating and would lead to
the circumstance in which homozygotes are in excess of expectations (as we observed).
Moreover, there is no doubt that the exact CEH composition of the WTCCC varies considerably
from region to region (e.g., Fig 4; S2 Table). Nevertheless, as discussed in S2 File, there are
several reasons that even this simple mechanism seems inadequate to account for our observed
deviations from HWE, Most importantly, we examined the impact that the observed
differences in the percentage of ªrareº CEHs among the sub-populations would have had on the
HWE deviation. This analysis indicated that these differences could account for only about a
quarter of the difference in HWE that we actually observed (S2 File). Consequently, our
observations seem likely to be the result of a combination of both haplotype conversion and
haplotype selection±each representing processes that take place in every generation.
Moreover, the strong selection of CEHs implies that certain allelic combinations ªwork well
togetherº whereas other combinations do not (S2 File). Presumably, this ªworking well
togetherº, in a biological sense, means that a particular combination of these five alleles (but
likely also including other specific alleles of the many intervening genes) permit the host to
respond to a variety of abiotic and biotic threats (or opportunities) in a manner that improves
15 / 23
fitness (regardless of whether these come from the external environment, the internal
microbiome, or both). However, it is also clear from these findings that no single allelic combination
is being selected above all others. Rather, a relatively small number (in the hundreds) of
combinations are being selected simultaneously (e.g., Tables 2 and 3; S1 Table). Perhaps this is
because the nature of these abiotic and biotic threats (or opportunities) result in a very
complex ªfitness landscapeº, which is highly variable both in space and in time and, thus, in which
fitness depends upon the precise environmental context of the individual, including specific
host factors such as the exact location of their residence, their particular micro-environment,
their diet, their lifestyle, or other individual idiosyncrasies. In such a case, no single CEH may
be favored in all circumstances and, consequently, such highly variable landscape topography
might help to rationalize why so many haplotypes seem to be selected simultaneously. It might
also help to rationalize why the group composition of the selected CEHs seems to be so fluid
between separated populations (e.g., Fig 4; S1 Table). Thus, even within European populations,
the beginning of such a divergence can already be recognized (Fig 4; S2 Table) and, based on
limited data, this divergence in the group composition of the selected haplotypes in long
separated populations (i.e., Africans, AmerIndians, Asians, and Caucasians) seems to be
substantially greater (S1 File; S1 Table).
The main hypothesis of the present study was that any observed allelic disease association is
a reflection of those CEHs, which confer MS disease risk. The present study sheds considerable
light on this hypothesis. For example, although many CEHs, which include the Class II motif
HLA-DRB1 15:01~HLA-DQB1 06:02~a1, are associated with an increased disease risk
(Table 2), the actual risk varies significantly among the different extended haplotypes (Table 2;
Figure C in S3 File). Moreover, some haplotypes, which include the motif
HLA-DRB1 15:01~HLA-DQB1 06:02 but don't include the SNP-haplotype a1, seem not to
carry any risk (Fig 5A). By contrast, the (a1)-containing haplotypes, which don't include this
Class II motif, still carry substantial risk (Fig 5A). These observations suggest that the motif of
HLA-DRB1 15:01~HLA-DQB1 06:02, by itself, does not fully account for the disease risk
associated with these extended haplotypes. Regardless of this conclusion, however, some disease
risk seems to be attributable to some aspect of the HLA-DRB1 15:01~HLA-DQB1 06:02~a1
haplotype by itself. Thus, even correcting for population stratification effects, the disease risk is
still significantly increased for those individuals who both carry this Class II haplotype and,
yet, whose full extended haplotype had only a single representation in the WTCCC.
In the case of the Class II HLA motif of HLA-DRB1 03:01~HLA-DQB1 02:01, this
dependence on the extended haplotype is even more evident. Thus, most of the common extended
haplotypes, which include the Class II motif of HLA-DRB1 03:01~HLA-DQB1 02:01~a2 seem
to associate with a disease risk that is either dominant or dose dependent (Table 3; Fig 5B). By
contrast, those haplotypes, which include the motif of
HLA-DRB1 03:01~HLA-DQB1 02:01~a6, as a group, seem to associate with a disease risk that
is recessive (Fig 5B). Nevertheless, at least one of these (a6)-containing haplotypes (i.e.,
HLA-A 24:02~C 07:01~HLA-B 08:01~HLA-DRB1 03:01~HLA-DQB1 02:01~a6) is associated
with a disease risk, which is either dominant or dose dependent (Table 3).
In summary, the MHC is organized into a relatively small group of extended haplotypes
(CEHs), which seem to be under a very strong selection pressure, presumably based upon
favorable biological properties of the complete haplotype. If so, then, of necessity, this means
that disease susceptibility is probably not attributable to any specific HLA allele but rather
susceptibility is likely to be dependent upon the nature of each CEH. This conclusion seems to be
borne out by the data. Moreover, it is of note that the most highly selected of these CEHs (in
Caucasians) also seem to be the ones most likely to be associated with and increased risk of
MS. The reasons for this apparent relationship are unclear. However, it is a fact that for the
16 / 23
WTCCC population as a whole, for each of the WTCCC regions individually (Fig 4), and also
for the EPIC cohort, the three most common CEHs (and 11 of the most common 25 CEHs)
were associated with a significantly increased risk of MS (Tables 2 and 3; S3 Table). This
observation that the most highly-selected CEHs also carry the greatest MS risk presumably indicates
that there must be a net survival advantage for individuals carrying these CEHs, which
outweighs the small increased chance of getting MS±a circumstance that is also suggested by the
observation (Figure A in S3 File) that only a very small proportion of the individuals who carry
these disease-associated CEHs are even within the set of individuals who are ªgenetically
susceptibleº to getting the disease [
Materials & methods
This research has been approved by the University of California, San Francisco's Institutional
Review Board (IRB) has been conducted according to the principles expressed in the
Declaration of Helsinki.
Wellcome Trust Case Control Consortium (WTCCC). The study cohort was assembled
as a prospective multicenter, multinational, effort. This study population has been described in
detail previously [
12,14, 16, 17
]. However, in brief, this cohort included 18,872 controls and
11,376 cases with MS, although SNP haplotype data was unavailable for 380 controls and 232
cases. Of the cases, 72.9% were women, the average age-of-clinical-onset was 33.1 years, and
the mean Extended Disability Status Score (EDSS) was 3.7 . Fifteen different countries
from around the world participated (Australia, Belgium, Denmark, Finland, France, Germany,
Ireland, Italy, Poland, New Zealand, Norway, Spain, Sweden, the United Kingdom, and the
United States). The data from Australia and New Zealand were combined so that data from 14
different world regions was available. Consequently, the patients enrolled in this study (except
for a few African Americans from the United States) were of European ancestry. Although all
clinical MS subtypes were included, the large majority (89%) had a relapsing-remitting onset
]. The diagnosis of MS was made based upon internationally recognized criteria [47±49].
Control subjects were composed of a combined group, which consisted of several different
cohorts of healthy individuals with European ancestry [
]. The Ethical Committees or
Institutional Review Boards at each of the participating centers approved the protocol and informed
consent was obtained from each study participant. The WTCCC granted data access for this
Expression, Proteomics, Imaging, and Clinical (EPIC) study. An independent cohort,
for certain comparative purposes, consisted of the patients and controls enrolled in the EPIC
study of MS genetics at UCSF and this cohort, also, has been described in detail previously [
Briefly, this study included data from 964 patients with MS and 868 controls. Both patients
and controls were matched for age and gender, and all participants provided their informed
]. The cohort was drawn from the United States and, essentially, all participants
were of European ancestry. The diagnosis of MS, also, was made using internationally
recognized criteria [47±49].
Genotyping, and quality control
The genotyping methods and quality control for the WTCCC have been described in detail
]. All genotyping was performed on the Illumina Infinium platform at the
17 / 23
Wellcome Trust Sanger Institute. Case samples were genotyped using a customized
Human660-Quad chip. Common controls were genotyped on a second customized
Human1M-Duo chip (utilizing the same probes). After quality control, this provided data on
441,547 autosomal SNPs scattered throughout the genome in both MS patients and controls
. The identities of the five HLA alleles in the MHC region (A, C, B, DRB1 and DQB1) were
determined for each participant by imputation using the HIBAG method [
Genotyping and quality control methods for the EPIC cohort have also been described in
detail previously [
]. In this study, SNP genotyping was done at the Illumina facilities using the
Sentrix HumanHap550 Bead Chip. This analysis provided genotype information on 551,642
SNPs. The identities of the five HLA alleles in the MHC region (HLA-A, HLA-C, HLA-B,
HLA-DRB1 and HLA-DQB1) were determined by sequence based typing methods [
Phasing. The phasing of alleles at each of five HLA loci (HLA-A, HLA-C, HLA-B,
HLA-DRB1 and HLA-DQB1) was accomplished using a previously published probabilistic
phasing algorithm [
]. Phased SNP haplotypes were constructed using a previously
published probabilistic method [
] at sliding windows of 2 to 15 SNPs throughout the 1 mb
span surrounding the Class II region of the DRB1 gene. The SNP-window of the most
significant MS-associated SNP haplotype was carried forward as a haplotype locus, a multi-allelic
gene to be phased with the 5 classic HLA genes. As discussed earlier, this haplotype locus
consisted of 11 phased SNPs surrounding the HLA-DRB1 gene (Fig 1). The accuracy of the
phasing was confirmed by the method of SHAPEIT2 [52±54], with better than 99%
correspondence between methods.
Phasing was accomplished by determining the probability of each possible combination
and assigning the phasing to the most likely combination. At times, however, there were
several possible combinations and this method, potentially, might designate a haplotype pair in
circumstances where several compatible haplotype pairs existed and each pair had a very
similar posterior probability. Such a situation did occur, but rarely. Thus, for the HLA-A~HLA-C
~HLA-B~HLA-DRB1~HLA-DQB1 haplotypes, 98% of the designations had a posterior
probability of more than (0.5), 92% had posterior probability of more than (0.6), and 85% had a
posterior probability of more than (0.7). For the Class II haplotypes
(HLA-DRB1~HLA-DQB1~SNP), these same respective percentages were (100%, 99.997%, and
Haplotype frequencies and association testing. Disease association tests, as measured by
ORs and confidence intervals (CIs), were undertaken for each of the HLA haplotypes and
HLA plus SNP haplotypes. Because of the very strong association between the
HLADRB1 15:01~HLA-DQB1 06:02~a1 haplotype, all other associations were assessed after
excluding those individuals who carried the HLA-DRB1 15:01~HLA-DQB1 06:02~a1
haplotype. Similarly, when the association of a specific CEH carrying the HLA-DRB1
15:01~HLADQB1 06:02~a1 haplotype was assessed, all other HLA-DRB1 15:01~HLA-DQB1 06:02~a1
carriers were excluded from the analysis.
In our previous study [
] we found an association of certain Class I alleles with MS (i.e.,
HLA-A 02:01, HLA-C 05:01, HLA-B 37:01, HLA-B 38:01, and HLA-B 44:02). Consequently,
for each of the reported Class II associations (Fig 5), we undertook a regression analysis using
these Class I alleles as covariates in the regression equations. This analysis confirmed that the
reported Class II associations (Fig 5) were unaffected.
In our previous report [
] we assessed the significance of the association of each SNP
haplotype with MS and adjusted these associations for the millions of comparisons made across
18 / 23
the entire genome using the Benjamini-Hochberg method [
]. In the present manuscript, by
contrast, we analyzed the 174 distinct SNP haplotypes composed of variants at 11 SNP
locations (rs2395173, rs2395174, rs3129871, rs7192, rs3129890, rs9268832, rs532098, rs17533090,
rs2187668, rs1063355, and rs9275141). Among these haplotypes was the (a1) SNP-haplotype
(Table 1), which had the single largest disease-association with MS of any in the genome. In
the present manuscript, however, these 174 SNP haplotypes in this genomic region served
simply (and only) as an additional genetic marker to be included in the haplotype analysis with
the other 5 HLA loci and, thus, no additional statistical adjustment is necessary (or
appropriate) as a consequence of their inclusion in the analysis. Nevertheless, because only haplotypes
with 50 or more representations in the WTCCC dataset were analyzed, and because there were
146 such haplotypes, a Bonferonni correction for these multiple comparisons would require a
significance of (p < 0.05/146 = 0.0003) to be achieved.
Because of the tight linkage that exists among the Class II loci (HLA-DRB1, HLA-DQB1,
and SNP haplotype) as well as among the Class I loci (HLA-A, HLA-C, and HLA-B), the
association of the different Class I and Class II haplotype combinations (with more than 2
representations in the WTCCC dataset) was determined by the association of specific
HLA-A~HLA-C~HLA-B combinations with a specific HLA-DRB1~HLA-DQB1~SNP
haplotype combinations. The p-values for the association of different Class I with different Class II
combinations were determined using a Fisher exact test if any expected cell frequencies was 5
or less and otherwise using a Chi square test [
]. The Benjamini-Hochberg method was used
to correct for multiple testing of the different possible Class I / Class II combinations [
Significance of the difference in ORs for disease association between any two haplotypes
was determined by z-scores calculated from the difference in the natural logarithm of the ORs
for the haplotypes. Also, because of the marked predominance of the MS association with the
HLA-DRB1 15:01~HLA-DQB1 06:02~a1 haplotype, all disease association tests for other
haplotypes were assessed after persons carrying the HLA-DRB1 15:01~HLA-DQB1 06:02~a1
haplotype were excluded from the analysis. Similarly, in the case of disease association tests for
individual CEHs that carried the HLA-DRB1 15:01~HLA-DQB1 06:02~a1 Class II motif, all
other persons carrying this same Class II motif were excluded from the analysis.
Significance of disease associations were also confirmed using a regression analysis
equating phenotype (case or control) with the dose (0, 1, or 2) of each of haplotypes identified as
being disease associated. An analysis of the potential interactions between the haplotypes was
also undertaken with these regression equations.
The expected occurrence of individuals homozygous for the different CEHs (or different
CEH-types) was calculated from the measured CEH (or CEH-type) frequencies. These
individual expectations were then summed and the expected total compared to the observed total
number of homozygous individuals using a z-score.
Population stratification. We used principal components (PC) analysis excluding MHC
SNPs (Eigensoft) to correct for population stratification within the WTCCC cohort [
There was evidence of considerable population structure in the WTCCC data. An analysis of
variance test carried out between cases and controls demonstrated a significant difference for
most of the first 10 PCs (which accounted for 84% of the of the population stratification).
None of other PCs were significantly different between cases and controls (neither were PC4
or PC10). The potential impact of this population structure on our findings was assessed by
the inclusion of these 10 PCs in the final regression equation.
Geographic, gender, and age stratification. We also adjusted for geographic heterogene
ity (in addition to our adjustment for population stratification) by using dummy variable
coding for each of the different geographic regions and including these in the final regression
equation. Similarly, and adjustment for gender (male = 1; female = 0) was also included in the
19 / 23
final regression equation. Neither information about the individual chronological age nor
information about individual age-at-clinical-onset was available for either the WTCCC of
EPIC data sets. Nevertheless, because this study analyzed only DNA-based haplotypes (which
are independent of chronological age), chronological age is not a relevant factor. It is possible,
however, that the age at disease-onset could be more relevant. Certainly, some authors have
argued that ªchildhood-onsetº MS cases might somehow be different (either genetically or
environmentally, or both) from ªadult-onsetº cases. Nevertheless, within an ªadult-onsetº MS
population (e.g., the WTCCC population), there is no evidence to suggest genetic
heterogeneity with respect to age-at-clinical-onset. Also, it is worth pointing out that many patients with
ªadult-onsetº MS, can be demonstrated to have MRI evidence of disease activity that precedes,
by many years (oftentimes decades), the clinical-onset of MS. Moreover, there is no established
(or suggested) relationship between the age-at-clinical-onset and the age of disease-onset.
Consequently, any analysis, regarding the impact of the age at disease-onset based solely upon
the age observed at the clinical-onset of disease activity, would be unreliable, even if such data
S1 File. This section considers the wide-spread occurrence of high frequency CEHs in different human populations and also how the WTCCC population differs from certain other populations of the world.
S2 File. This section develops the mathematical model for understanding the dynamics of haplotype conversion and selection as they relate to the MHC. This is the model used in the
Text to estimate the values of these two parameters from the WTCCC and EPIC data.
S3 File. This section includes the data for Figures (A–C).
S1 Table. Conserved extended haplotypes (CEHs) in the WTCCC, the EPIC, and in other
populations. This includes an estimate of the overlap (in % of the total number of CEHs) of
each ethnicity with the CEHs from another ethnicity.
S2 Table. Variations in the frequency of the 25 most common CEHs in different geo
graphic regions participating in the WTCCC and in the EPIC.
S3 Table. Common extended haplotypes (CEHs) in the EPIC population.
Conceptualization: Douglas S. Goodin.
Data curation: Douglas S. Goodin.
Formal analysis: Douglas S. Goodin, Pouya Khankhanian, Nicolas Vince.
Investigation: Douglas S. Goodin.
Methodology: Douglas S. Goodin, Pouya Khankhanian, Pierre-Antoine Gourraud, Nicolas
20 / 23
Software: Douglas S. Goodin, Pouya Khankhanian.
Supervision: Douglas S. Goodin.
Visualization: Douglas S. Goodin.
Writing ± original draft: Douglas S. Goodin.
Writing ± review & editing: Pouya Khankhanian, Pierre-Antoine Gourraud, Nicolas Vince.
21 / 23
22 / 23
1. Gourraud PA , Harbo HF , Hauser SL , Baranzini SE . ( 2012 ) The genetics of multiple sclerosis: an up-todate review . Immunol Rev 248 : 87 ± 103 . https://doi.org/10.1111/j. 1600 - 065X . 2012 . 01134 . x PMID : 22725956
2. Hofker MH , Fu J , Wijmenga C. ( 2014 ) The genome revolution and its role in understanding complex diseases . Biochim Biophys Acta 1842 : 1889 ± 1895 . https://doi.org/10.1016/j.bbadis. 2014 . 05 .002 PMID: 24834846
3. Goodin DS . The nature of genetic susceptibility to multiple sclerosis: Constraining the Possibilities . BMC Neurology 2016 ; 16 : 56 . https://doi.org/10.1186/s12883-016 -0575-6 PMID: 27117889
4. GAMES, the Transatlantic Multiple Sclerosis Genetics Cooperative . ( 2003 ) A meta-analysis of whole genome linkage screens in multiple sclerosis . J Neuroimmunol 143 : 39 ± 46 . PMID: 14575912
5. de Bakker PIW , Yelensky R , Pe'er I , Gabriel SB , Daly MJ , Altshuler D. ( 2005 ) Efficiency and power in genetic association studies . Nat Genet 37 : 1217 ± 1223 . https://doi.org/10.1038/ng1669 PMID: 16244653
6. The Wellcome Trust Case Control Consortium & The Australo-Anglo-American Spondylitis Consortium. ( 2007 ) Associations can of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants . Nature Genet 39 : 1329 ± 1337 . https://doi.org/10.1038/ng. 2007 .17 PMID: 17952073
7. The ANZgene Consortium. ( 2009 ) Genome-wide association study identifies new multiple sclerosis susceptibility loci on chromosomes 12 and 20 . Nature Genet 41 : 824 ± 828 . https://doi.org/10.1038/ng. 396 PMID: 19525955
8. Baranzini SE , Wang J , Gibson RA , Galwey N , Naegelin Y , Barkhof F , et al. ( 2009 ) Genome-wide association analysis of susceptibility and clinical phenotype in multiple sclerosis . Hum Mol Genet . 18 : 767 ± 778 . https://doi.org/10.1093/hmg/ddn388 PMID: 19010793
9. De Jager PL , Jia X , Wang J , de Bakker PI , Ottoboni L , Aggarwal NT , et al. ( 2009 ) Meta-analysis of genome scans and replication identify CD6, IRF8 and TNFRSF1A as new multiple sclerosis susceptibility loci . Nature Genet 41 : 776 ± 782 . https://doi.org/10.1038/ng.401 PMID: 19525953
10. Sanna S. Pitzalis M , Zoledziewska M , Zara I , Sidore C , Murru R , et al. ( 2010 ) Variants within the immunoregulatory CBLB gene are associated with multiple sclerosis . Nature Genet 42 : 495 ± 497 . https://doi. org/10.1038/ng.584 PMID: 20453840
11. The International Multiple Sclerosis Genetics Consortium & the Wellcome Trust Case Control Consortium . Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis . Nature 2011 ; 476 : 214 ± 219 . https://doi.org/10.1038/nature10251 PMID: 21833088
12. International Multiple Sclerosis Genetics Consortium (IMSGC). ( 2014 ) Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis Nat Genet 45 : 1353 ± 60 .
13. Dyment DA , Herrera BM , Cader MZ , Willer CJ , Lincoln MR , Sadovnick AD . et al. ( 2005 ) Complex interactions among MHC haplotypes in multiple sclerosis: susceptibility and resistance . Hum Mol Genet 14 : 2019 ± 2026 . https://doi.org/10.1093/hmg/ddi206 PMID: 15930013
14. Hafler DA , Compston A , Sawcer S , Lander ES , Daly MJ , De Jager P , et al. ( 2007 ) Risk alleles for multiple sclerosis identified by a genomewide study . N. Engl. J. Med . 357 , 851 ± 862 . https://doi.org/10.1056/ NEJMoa073493 PMID: 17660530
15. Ramagopalan SV , Anderson C , Sadovnick AD , Ebers GC . ( 2007 ) Genomewide study of multiple sclerosis . N. Engl. J. Med . 357 , 2199 ± 2200 . https://doi.org/10.1056/NEJMc072836 PMID: 18032773
16. Link J , Kockum I , Lorentzen AR , Lie BA , Celius EG , Westerlind H , et al. ( 2012 ) Importance of Human Leukocyte Antigen (HLA) Class I and II Alleles on the Risk of Multiple Sclerosis . PLoS One 7 ( 5 ): e36779. https://doi.org/10.1371/journal.pone. 0036779 PMID: 22586495
17. Patsopoulos NA , Barcellos LF , Hintzen RQ , Schaefer C , van Duijn CM , Noble JA , et al. ( 2014 ) FineMapping the Genetic Association of the Major Histocompatibility Complex in Multiple Sclerosis: HLA and Non-HLA Effects . PLoS Genet 9 ( 11 ): e1003926 .
18. Chao MJ , Barnardo MC , Lincoln MR , Ramagopalan SV , Herrera BM , Dyment DA , et al. HLA class I alleles tag HLA-DRB1*1501 haplotypes for differential risk in multiple sclerosis susceptibility . Proc Natl Acad Sci USA 2008 ; 105 : 13069 ± 74 . https://doi.org/10.1073/pnas.0801042105 PMID: 18765817
19. Lincoln MR , Ramagopalan SV , Chao MJ , Herrera BM , Deluca GC , Orton SM , et al. Epistasis among HLA-DRB1, HLA-DQA1, and HLA-DQB1 loci determines multiple sclerosis susceptibility . Proc Natl Acad Sci USA 2009 ; 106 : 7542 ±7. https://doi.org/10.1073/pnas.0812664106 PMID: 19380721
20. Multiple Sclerosis Genetics Group ( 1998 ) Linkage of the MHC to familial multiple sclerosis suggests genetic heterogeneity . Hum Molec Genet 7 : 1229 ± 1234 . PMID: 9668163
21. McElroy JP , Cree BAC , Caillier SJ , Gregersen PK , Herbert J , Khan OA , et al. Refining the association of MHC with multiple sclerosis in African Americans . Hum Mol Genet 2010 ; 19 : 3080 ± 3088 . https://doi. org/10.1093/hmg/ddq197 PMID: 20466734
22. Dikson SP , Wang K , Krantz I , Hakonarson H , Goldstein DB . ( 2010 ) Rare Variants Create Synthetic Genome-Wide Associations . PLoS Biol 8 ( 1 ): e1000294. https://doi.org/10.1371/journal.pbio. 1000294 PMID: 20126254
23. Bugawan TL , Klitz W , Blair A , Erlich HA . High-resolution HLA class I typing in the CEPH families: analysis of linkage disequilibrium among HLA loci . Tissue Antigens 2000 : 56 : 392 ± 404 . PMID: 11144287
24. Ahmad T , Neville M , Marshall SE , Armuzzi A , Mulcahy-Hawes K , Crawshaw J , et al. Haplotype-specific linkage disequilibrium patterns define the genetic topography of the human MHC . Hum Mol Genet 2003 ; 12 : 647 ± 656 . PMID: 12620970
25. Yunis EJ , Larsen CE , Fernandez-Viña M , Awdeh ZL , Romero T , Hansen JA , et al. Inheritable variable sizes of DNA stretches in the human MHC: conserved extended haplotypes and their fragments or blocks . Tissue Antigens 2003 ; 62 :1± 20 . PMID: 12859592
26. Horton R , Gibson R , Coggill P , Miretti M , Allcock RJ , Almeida J , et al. Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project . Immunogenetics 2008 ; 60 :1± 18 . https:// doi.org/10.1007/s00251-007 -0262-2 PMID: 18193213
27. WennerstroÈm A , Vlachopoulou E , Lahtela LE , Paakkanen R , Eronen KT , SeppaÈnen M , et al. Diversity of Extended HLA DRB1 Haplotypes in the Finnish Population . PLoS One 2013 ; 8 ( 11 ):e79690. https:// doi.org/10.1371/journal.pone. 0079690 PMID: 24278156
28. ZuÂñiga J , Yu N , Barquera R , Alosco S , Ohashi M , Lebedeva T , et al. HLA Class I and Class II Conserved Extended Haplotypes and Their Fragments or Blocks in Mexicans: Implications for the Study of Genetic Diversity in Admixed Populations . PLoS One 2013 ; 8 ( 9 ):e74442 https://doi.org/10.1371/ journal.pone. 0074442 PMID: 24086347
29. Goodin DS , Khankhanian P. Single Nucleotide Polymorphism (SNP ) -Strings: An Alternative Method for Assessing Genetic Associations . PLoS One 2014 ; 9 ( 4 ):e90034. https://doi.org/10.1371/journal.pone. 0090034 PMID: 24727690
30. Khankhanian P , Gourraud PA , Lizee A , Goodin DS . Haplotype-based approach to known MS-associated regions increases the amount of explained risk . J Med Genet . 2015 ; 52 : 587 ± 594 . https://doi.org/ 10.1136/jmedgenet-2015 -103071 PMID: 26185143
31. Testi M , Battarra M , Lucarelli G , et al. HLA-A-B-C- DRB1- DQB1 phased haplotypes in 124 Nigerian families indicate extreme HLA diversity and low linkage disequilibrium in Central-West Africa . Tissue Antigens 2015 ; 86 : 285 ±292 https://doi.org/10.1111/tan.12642 PMID: 26300115
32. Isobe N , Keshavan A , Gourraud PA , Zhu AH , Datta E , Schlaeger R , et al. Association of HLA Genetic Risk Burden With Disease Phenotypes in Multiple Sclerosis . JAMA Neurol May 31 , 2016 ( E-pub ahead of print).
33. Sanchez-Mazas A Djoulah M , Busson M , Le Monnier de Gouville I , Poirier JC , Dehay C , et al., 2000 A linkage disequilibrium map of the MHC region based on the analysis of 14 loci haplotypes in 50 French families . Eur J Hum Genet 2000 ; 8 : 33 ± 41 . https://doi.org/10.1038/sj.ejhg. 5200391 PMID: 10713885
34. Arnheim A , Calabrese P , Nordborg M. Hot and cold spots of recombination in the human genome: the reason we should find them and how this can be achieved . Am J Hum Genet 2003 ; 73 :5± 16 . https://doi. org/10.1086/376419 PMID: 12772086
35. Gragert L , Madbouly A , Freeman F , Maiers M . Six-locus high resolution HLA haplotype frequencies derived from mixed-resolution DNA typing for the entire US donor registry . Hum Immunol 2013 : 1313 ± 1320 . https://doi.org/10.1016/j.humimm. 2013 . 06 .025 PMID: 23806270
36. Pappas DP , Tomich A , Garnier F , Marry E , Gourraud PA . Comparison of high-resolution human leukocyte antigen haplotype frequencies in different ethnic groups: Consequences of sampling fluctuation and haplotype frequency distribution tail truncation . Hum Immunol 2015 : 374 ± 380 . https://doi.org/10. 1016/j.humimm. 2015 . 01 .029 PMID: 25637668
37. Taylan F , Altiok E. Meiotic recombinations within major histocompatibility complex of human embryos . Immunogenetics 2012 ; 64 : 839 ± 44 . https://doi.org/10.1007/s00251-012-0644-y PMID: 22893033
38. Vandiedonck C , Knight JC . The human major histocompatability complex as a paradigm in genomics research . Brief Funct Genomic Proteomic . 2009 ; 8 : 379 ± 394 . https://doi.org/10.1093/bfgp/elp010 PMID: 19468039
39. Dawkins R , Leelayuwat C , Gaudieri S , Tay G , Hui J , Cattley S , et al. Genomics of the major histocompatibility complex: haplotypes, duplication, retroviruses and disease . Immunol Rev 1999 ; 167 : 275 ± 304 . PMID: 10319268
40. Trowsdale J , Knight JC . Major histocompatibility complex genomics and human disease . Annu Rev Genomics Hum Genet 2013 ; 14 : 301 ± 323 . https://doi.org/10.1146/annurev-genom- 091212 -153455 PMID: 23875801
41. van Oosterhout C. A new theory of MHC evolution: beyond selection on the immune genes . Proc. R. Soc. B 2009 ; 276 : 657 ± 66541 . https://doi.org/10.1098/rspb. 2008 .1299 PMID: 18986972
42. Ardlie KG , Kruglyak L , Seielstad. Patterns of linkage disequilibrium in the human genome . Nat Rev Genet . 2002 ; 3 : 299 ± 309 . https://doi.org/10.1038/nrg777 PMID: 11967554
43. Paul P , Nag D , Chakraborty S. Recombination hotspots: Models and tools for detection . DNA Repair 2016 ; 40 : 47 ± 56 . https://doi.org/10.1016/j.dnarep. 2016 . 02 .005 PMID: 26991854
44. Zheng X , J Shen J , Cox C , Wakefield JC , Ehm MG , Nelson MR , et al. ( 2014 ) HIBAG±HLA genotype imputation with attribute bagging . Pharmacogenom J 14 : 192 ± 200 .
45. Gillespie JH . Population Genetics: A Precise Guide . Johns Hopkins University Press, ( Baltimore and London), 1998 .
46. IPD-IMGT/HLA Database . http://www.ebi.ac.uk/ipd/imgt/hla
47. Poser CM , Paty DW , Scheinberg L , McDonald WI , Davis FA , Ebers GC , et al. New diagnostic criteria for multiple sclerosis: guidelines for research protocols . Ann Neurol 13 , 227 ± 231 ( 1983 ). https://doi.org/ 10.1002/ana.410130302 PMID: 6847134
48. McDonald WI , Compston A , Edan G , Goodkin D , Hartung HP , Lublin FD , et al. Recommended diagnostic criteria for multiple sclerosis: guidelines from the International Panel on the diagnosis of multiple sclerosis . Ann Neurol 50 , 121 ± 127 ( 2001 ). PMID: 11456302
49. Polman CH , Reingold SC , Edan G , Filippi M , Hartung HP , Kappos L , et al. Diagnostic criteria for multiple sclerosis: 2005 revisions to the "McDonald Criteria" . Ann Neurol 58 , 840 ± 846 ( 2005 ). https://doi.org/10. 1002/ana.20703 PMID: 16283615
50. Gourraud PA , Lamiraux P , El-Kadhi N , Raffoux C , Cambon-Thomsen A . Inferred HLA haplotype information for donors from hematopoietic stem cells donor registries . Hum Immunol 2005 ; 66 : 563 ± 70 . https://doi.org/10.1016/j.humimm. 2005 . 01 .011 PMID: 15935894
51. Gourraud PA , Khankhanian P , Cereb N , Yang SY , Feolo M , Maiers M , et al. HLA diversity in the 1000 genomes dataset . PLoS One 2014 ; 9 : e9782 .
52. Delaneau O , Zagury JF , Marchini J ( 2012 ) Improved whole-chromosome phasing for disease and population genetic studies . Nat Methods 10 : 5 ± 6 .
53. Delaneau O , Marchini J , Zagury JF ( 2011 ) A linear complexity phasing method for thousands of genomes . Nat Methods 9 : 179 ± 81 . https://doi.org/10.1038/nmeth.1785 PMID: 22138821
54. Howie B , Marchini J , Stephens M ( 2011 ) Genotype imputation with thousands of genomes . G3 (Bethesda) 1 ( 6 ): 457 ± 470 .
55. Benjamini Y , Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing . J R Statist Soc B . ( 1995 ) 57 : 289 ± 300 .
56. Lydersen S , Fagerland MW , Laake P. Tutorials in biostatistics: Recommended tests for association in 2x2 tables . Statist Med . ( 2009 ) 28 : 1159 ± 1175 .
57. Price AL Patterson NJ , Plenge RM , Weinblatt ME , Shadick NA , Reich D. ( 2006 ) Principal components analysis corrects for stratification in genome-wide association studies . Nat Genet 38 : 904 ± 909 . https:// doi.org/10.1038/ng1847 PMID: 16862161