Automatic workflow for the classification of local DNA conformations
Petr ech
0
3
4
Jaromr Kukal
2
3
Ji ern
1
5
Bohdan Schneider
1
5
Daniel Svozil
0
4
0
Laboratory of Informatics and Chemistry, ICT Prague
,
Technicka 5, Prague 6 166 28, Czech republic
1
Institute of Biotechnology AS CR, v. v. i.
,
Videnska 1083, Prague 4 142 00, Czech republic
2
Faculty of Nuclear Sciences and Physical Engineering, CTU Prague
,
Trojanova 13, Prague 2 122 00, Czech republic
3
Department of Computing and Control Engineering, ICT Prague
,
Technicka 5, Prague 6 166 28, Czech republic
4
Laboratory of Informatics and Chemistry, ICT Prague
,
Technicka 5, Prague 6 166 28, Czech republic
5
Institute of Biotechnology AS CR, v. v. i.
,
Videnska 1083, Prague 4 142 00, Czech republic
Background: A growing number of crystal and NMR structures reveals a considerable structural polymorphism of DNA architecture going well beyond the usual image of a double helical molecule. DNA is highly variable with dinucleotide steps exhibiting a substantial flexibility in a sequence-dependent manner. An analysis of the conformational space of the DNA backbone and the enhancement of our understanding of the conformational dependencies in DNA are therefore important for full comprehension of DNA structural polymorphism. Results: A detailed classification of local DNA conformations based on the technique of Fourier averaging was published in our previous work. However, this procedure requires a considerable amount of manual work. To overcome this limitation we developed an automatic classification method consisting of the combination of supervised and unsupervised approaches. A proposed workflow is composed of k-NN method followed by a nonhierarchical single-pass clustering algorithm. We applied this workflow to analyze 816 X-ray and 664 NMR DNA structures released till February 2013. We identified and annotated six new conformers, and we assigned four of these conformers to two structurally important DNA families: guanine quadruplexes and Holliday (four-way) junctions. We also compared populations of the assigned conformers in the dataset of X-ray and NMR structures. Conclusions: In the present work we developed a machine learning workflow for the automatic classification of dinucleotide conformations. Dinucleotides with unassigned conformations can be either classified into one of already known 24 classes or they can be flagged as unclassifiable. The proposed machine learning workflow permits identification of new classes among so far unclassifiable data, and we identified and annotated six new conformations in the X-ray structures released since our previous analysis. The results illustrate the utility of machine learning approaches in the classification of local DNA conformations.
-
Background
The antiparallel double helical structure of DNA and its
self-recognition form the basis for the conservation and
the transfer of genetic information. The model of the
canonicalB-DNA form proposed by Watson and Crick
[1] has later been enriched by detailed structural data
from single-crystal structures of the biologically
prevailing B-form [2] and of its kin right-handed A-form
[3,4]. In addition, the first DNA single crystal [5]
revealed atomic details of a third major form of a DNA
double helix, left-handed Z-DNA. The atomic resolution
structures of B-DNA duplexes [6] revealed the existence of
sequence-dependent structural deviations which provide
the required specificity for DNA recognition by proteins
and drugs [7]. The association of DNA with proteins is
known to induce a local deformation of the B-form toward
the A-form [8-13] in various protein-DNA complexes such
as, e.g. high mobility group (HMG) proteins [14], trp
repressor/operator complex [15], TATA box binding protein
[16-18], HIV-1 reverse transcriptase [19], various DNA
polymerases [20-23], zinc finger protein [24],
hyperthermophile Sac7d protein [25], and EcoRV endonuclease [26-28].
Along the transition pathway between the B- and A-forms
[29] various intermediate B-to-A conformations were
identified [9,30-32]. The importance of conformational
substates of the DNA backbone for protein binding to the
minor groove was suggested by several analyses [13,33,34].
Besides the A-, B- and Z-forms, DNA can also adopt other
biologically relevant structures, such as single-stranded
hairpins [35], triple helices [36], three- and four-way
junctions [37,38], four-stranded G-quadruplexes [39] or parallel
helices [40]. Their existence indicates that DNA structure
is much more polymorphic than it might be deduced from
the misleading simplicity of the canonical B-DNA duplex.
The base morphology in a DNA double helix is
commonly described [12,41-46] by parameters giving mutual
position between bases in a base-pair (e.g., propeller
twist or stagger) and in a base-step (e.g. rise or twist)
[47]. The same parameters can also be used for other
unusual DNA structures such as triple helices [48-50],
G-quadruplexes [51] or three- and four-way junctions
[52,53]. In addition, for the last two groups of structures
additional specific parameters such as the G-quartet
planarity [54] or the angle between the junction arms [55]
were also defined. Another set of quantitative measures
that can be used to characterize secondary structure of
DNA are backbone torsional angles , , , , ,
together with the glycosidic torsion [56]. Though the
relationship between the phosphodiester backbone states
and local distortions of DNA double helix was described
in the '80 and '90s [57,58], the backbone was regarded as
a passive link holding bases at their positions in several
early analyses [7,59,60]. However, nowadays it is clear
that the backbone must be considered as an active
dynamic element while defining the conformational
properties of double-helical DNA [34,61-69]. The main role
of the backbone is in restricting the conformational
space available for the placement of bases, and in steric
coupling of the adjacent base steps [61]. An overall
conformational flexibility of DNA thus results from the
interplay between the optimal base positions and the
preferred conformations of the sugar-phosphate
backbone. An increasing number and quality of DNA
structures led to several detailed analyses of the conformational
space of the DNA backbone, most of these studies have
been based on crystal structures [32,70-73] but structures
determined by various solution-based techniques of NMR
spectroscopy have also contributed significantly to our
understanding of biology of nucleic acids [74-76]. NMR
methods were successfully applied to study a dynamics of
DNA phosphodiester backbone in solution [77-82], NMR
studies also provide evidence for the BII states in solution
and help to unravel a role of the phosphorus atom in a
BIBII transition [68,83-87].
To uncover a potential role of the sugar-phosphate
backbone in the DNA structural polymorphism we have
analyzed a set of carefully selected double-helical
structures of naked and protein bound DNA resolved at high
resolution (1.9 ) [32] (...truncated)