Application of a sensitive collection heuristic for very large protein families: Evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases
BMC Bioinformatics
Methodology article Application of a sensitive collection heuristic for very large protein families: Evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases
Georg Schneider 1
Georg Neuberger 1
Michael Wildpaner 1
Sun Tian 1
Igor Berezovsky 0
Frank Eisenhaber 1
0 Department of Chemistry and Chemical Biology, Harvard University , 12 Oxford str., M-105, 02138 Cambridge, MA , USA
1 IMP - Research Institute of Molecular Pathology , Dr. Bohr-Gasse 7, A-1030 Vienna , Republic of Austria
Background: Manually finding subtle yet statistically significant links to distantly related homologues becomes practically impossible for very populated protein families due to the sheer number of similarity searches to be invoked and analyzed. The unclear evolutionary relationship between classical mammalian lipases and the recently discovered human adipose triglyceride lipase (ATGL; a patatin family member) is an exemplary case for such a problem. Results: We describe an unsupervised, sensitive sequence segment collection heuristic suitable for assembling very large protein families. It is based on fan-like expanding, iterative database searches. To prevent inclusion of unrelated hits, additional criteria are introduced: minimal alignment length and overlap with starting sequence segments, finding starting sequences in reciprocal searches, automated filtering for compositional bias and repetitive patterns. This heuristic was implemented as FAMILYSEARCHER in the ANNIE sequence analysis environment and applied to search for protein links between the classical lipase family and the patatin-like group. Conclusion: The FAMILYSEARCHER is an efficient tool for tracing distant evolutionary relationships involving large protein families. Although classical lipases and ATGL have no obvious sequence similarity and differ with regard to fold and catalytic mechanism, homology links detected with FAMILYSEARCHER show that they are evolutionarily related. The conserved sequence parts can be narrowed down to an ancestral core module consisting of three -strands, one -helix and a turn containing the typical nucleophilic serine. Moreover, this ancestral module also appears in numerous enzymes with various substrate specificities, but that critically rely on nucleophilic attack mechanisms.
-
Background
The failure to develop a rational, generally applicable cure
for obesity-related diseases can be attributed to the highly
complex regulation of energy metabolism, which is not
yet fully understood. On the other hand considering the
historic successes in deciphering the underlying
biochemical pathways, it is assumed that the chemical
transformation steps of basic metabolites are known in their entirety.
This view is seriously questioned in light of the recent
discovery of ATGL, a protein that catalyzes the initial step of
hydrolysis of triacylglycerides at the surface of lipid
droplets in adipocytes [1]. It is surprising that the fundamental
activity of this key enzyme escaped from attention so far
[2,3]. Just considering the many dozens of additional
hypothetical human protein sequences with low but
statistically significant sequence-similarity to known
metabolic enzymes that can be collected with PSI-BLAST
searches [4], more such findings are still expected to be
ahead.
One of the key steps in energy metabolism is the
separation of fatty acids from glycerol moieties. A diverse set of
lipases performs this task in various contexts by
hydrolyzing the connecting ester-bonds [5]. One of the best
characterized lipases, pancreatic lipase, acts at the stage of food
digestion [6]. Other lipases, such as hormone sensitive
lipase or lipoprotein lipase, are involved in lipid
accumulation and release in tissue [7,8].
Most lipases share a common type of 3D structure known
as /-hydrolase fold, which is present in enzymes with
quite diverse substrate specificities [9,10]. The catalytic
mechanism of most lipases is reminescent of serine
proteases as it proceeds via the nucleophilic attack of a
serinehistidine-aspartate triad [10].
The recently discovered, novel key enzyme involved in
fatty acid release from adipocytes, adipose triglyceride
lipase (ATGL) [1], does not share any direct sequence
similarity with known mammalian lipases. In fact, it appears
to belong to a protein family that is centered around
patatin, a potato storage protein with lipid acyl hydrolase
activity [11,12]. The catalytic mechanism of these
enzymes is inherently different from classic lipases as it
proceeds via a serine-aspartate dyad [13,14] as opposed to
the well described serine-histidine-aspartate triad.
In this work, we present sequence-analytic evidence that
the ATGL/patatin family and the classic mammalian
lipases represented by the human pancreatic lipase
evolved from a common ancestor. Moreover, we display a
set of structural and sequence key features that are
conserved between these two enzyme groups including also
related protein families.
The analysis of homology relationships within large
superfamilies of protein sequences are a reoccurring
theme in biomolecular sequence analysis. Finding the
pancreatic lipase/ATGL relationship is just one
application for the respective methodologies. It should be noted
that detecting subtle yet statistically significant and
structurally plausible relationships in families involving
thousands of members is not a straightforward task since the
manual analysis of myriads of reports generated by
standard BLAST/PSI-BLAST [4] installations for sequence
comparisons in databases is impossible in practice. Progress in
this area was hampered by insufficiently developed tools.
Here, we developed a computer implementation of a
family searching heuristic involving: (i) Automated
invocation of fan-like iterative PSI-BLAST [4] searches with
starting sequences. (ii) Filtering of starting sequences with
various sequence-analytic methods for detecting
compositional and repetitive pattern bias. (iii) Automatic
re-detection of starting sequence segments in reciprocal searches.
(iv) Criteria for alignment length and overlap with the
starting sequence segments. (v) Automated parsing of
outputs and (vi) database-supported analysis of similarity
networks. The user-parameterized measures (ii-iv) are
designed to suppress the detection of unrelated hits for
the case of a starting sequence that are thought to
represent a single globular domain, a functionally and
structurally independent elementary module. This
FAMILYSEARCHER is part of the sequence-analytic
workbench ANNIE [15] that is being developed in our
laboratory. To our knowledge, this article describes the first
software package for sequence family collection with fully
automated checks for bidirectional search criteria,
transitive hit overlap criteria and generic procedures for
masking repetitive regions that is applicable for extremely large
sequence families.
Results
FAMILYSEARCHER: Methodical (...truncated)