Using sequence data to predict the self-assembly of supramolecular collagen structures.
Article
Using sequence data to predict the self-assembly of
supramolecular collagen structures
Anna M. Puszkarska,1 Daan Frenkel,1 Lucy J. Colwell,1,2 and Melinda J. Duer1,*
1
Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, United Kingdom and 2Google Research, Mountain View,
California
ABSTRACT Collagen fibrils are the major constituents of the extracellular matrix, which provides structural support to vertebrate connective tissues. It is widely assumed that the superstructure of collagen fibrils is encoded in the primary sequences of
the molecular building blocks. However, the interplay between large-scale architecture and small-scale molecular interactions
makes the ab initio prediction of collagen structure challenging. Here, we propose a model that allows us to predict the periodic
structure of collagen fibers and the axial offset between the molecules, purely on the basis of simple predictive rules for the interaction between amino acid residues. With our model, we identify the sequence-dependent collagen fiber geometries with the
lowest free energy and validate the predicted geometries against the available experimental data. We propose a procedure
for searching for optimal staggering distances. Finally, we build a classification algorithm and use it to scan 11 data sets of vertebrate fibrillar collagens, and predict the periodicity of the resulting assemblies. We analyzed the experimentally observed variance of the optimal stagger distances across species, and find that these distances, and the resulting fibrillar phenotypes, are
evolutionary well preserved. Moreover, we observed that the energy minimum at the optimal stagger distance is broad in all
cases, suggesting a further evolutionary adaptation designed to improve the assembly kinetics. Our periodicity predictions
are not only in good agreement with the experimental data on collagen molecular staggering for all collagen types analyzed,
but also for synthetic peptides. We argue that, with our model, it becomes possible to design tailor-made, periodic collagen structures, thereby enabling the design of novel biomimetic materials based on collagen-mimetic trimers.
SIGNIFICANCE The pathway for protein self-assembly is determined by the free energy landscape coded in the
noncovalent interactions between the building blocks. We use this basic principle to develop a model that describes the
mechanisms involved in the staggering of collagen molecules in fibrillar assemblies. In this work we present a simple,
parameter-free model for collagen fibril design that allows us to predict the structure of self-assembling collagen fibers on
the basis of the amino acid sequence of the constituent a-chain subunits. We develop a classification algorithm and use it
to scan through large data sets of collagen molecules to predict the periodicity of the resulting assemblies. We argue that
the interaction model presented in this work provides a foundation for engineering of novel collagen molecules with specific
material properties for targeted applications.
INTRODUCTION
The material properties of connective tissues, such as
tendon, skin, bone, and cartilage, are largely controlled by
fibrillar assemblies of collagen proteins. Collagen molecules are long (z 300 nm), rope-like structures, formed
from three monomeric a-chains twisted together into a triple
helix (1). In vertebrates, there are at least 10 distinct
collagen molecules, each comprising 3 monomers, drawn
Submitted March 9, 2022, and accepted for publication July 12, 2022.
*Correspondence:
from 12 different a-chains, encoded by 11 genes. The primary structure of the individual a-chains determines the
geometrical and biophysical parameters of the collagen helix, which in turn govern the organization of molecules
within the fibril, thereby establishing interactions necessary
for quaternary structures to form.
Collagen fibrils are composed of hundreds of aligned helices. The major collagens, types I, II, and III, form wide,
long, unbranched fibrils, which are the dominant components of structural tissue, typically in conjunction with
smaller quantities of the minor collagens, types V and XI,
which are thought to act as fibril nucleators (1). TEM studies
of these fibrils show periodic dark-light bands along their
Editor: Markus Buehler.
https://doi.org/10.1016/j.bpj.2022.07.019
2022 Biophysical Society.
This is an open access article under the CC BY-NC-ND license (http://
creativecommons.org/licenses/by-nc-nd/4.0/).
Biophysical Journal 121, 3023–3033, August 16, 2022 3023
Puszkarska et al.
length with periodicity Dz67 nm, attributed to the constituent molecules being longitudinally staggered relative to
their neighbors by integer multiples of D (2–5). Such fibrils
are found in tendons, cornea, skin, and cartilage (6–8). However, not all collagen molecular species assemble into these
classical periodic fibrils. Regulatory or developmental
collagen proteins do not form wide, striated fibrils under
physiological conditions. These polymers are incorporated
into the structurally defined suprastructure as a result of heterotypic interactions (collagen type V and XI) (9). In addition, some collagens form thin, nonbanded assemblies
(type XXIV and XXVII) (10–13).
To unravel the design principles of collagen assembly,
we must find a mapping between the primary sequence
of the collagen trimer and the phenotypic, structural features of the collagen fibril. Given the primary sequence
of the a-chain subunits, is it possible to predict the value
of the axial offset between assembled polymers? Previous
work has provided evidence for a link between sequence
and the supramolecular structure of collagen assemblies
(14–17). In fact, interaction-based scoring systems for
linear sequences have been proposed in (14,15,17). In
what follows, we use a more physically detailed model
to arrive at a simple theoretical tool to predict the observed
molecular geometry. Given the size of each collagen
monomer of around 3000 amino acid residues, and the
lack of detailed structural data, a fully atomistic (free) energy optimization procedure to model collagenous assemblies would be prohibitively expensive. Consequently, we
take a coarse-grained approach to estimate the free energy
of assembly. We make use of well-established empirical
estimates of the strength of residue-residue interactions,
based on so-called statistical contact potentials (CPs).
We integrate these CPs in a simplified representation of
collagen molecular structure. The resulting model allows
us to estimate the relative stability of various collagen ar-
rangements. We analyzed the primary structures of
collagen proteins that can be classified into various functional types, across several vertebrate organisms. We
used primary sequence data for collagen types for which
experimental data regarding the phenotype of higher-order
structure are available (Table 1), to establish a procedure
for periodicity p (...truncated)