LEAping to conclusions: A computational reanalysis of late embryogenesis abundant proteins and their possible roles
BMC Bioinformatics
LEAping to conclusions: A computational reanalysis of late embryogenesis abundant proteins and their possible roles Michael J Wise*
0 Address: Department of Genetics Cambridge University Cambridge U.K
Background: The late embryogenesis abundant (LEA) proteins cover a number of loosely related groups of proteins, originally found in plants but now being found in non-plant species. Their precise function is unknown, though considerable evidence suggests that LEA proteins are involved in desiccation resistance. Using a number of statistically-based bioinformatics tools the classification of a large set of LEA proteins, covering all Groups, is reexamined together with some previous findings. Searches based on peptide composition return proteins with similar composition to different LEA Groups; keyword clustering is then applied to reveal keywords and phrases suggestive of the Groups' properties. Results: Previous research has suggested that glycine is characteristic of LEA proteins, but it is only highly over-represented in Groups 1 and 2, while alanine, thought characteristic of Group 2, is over-represented in Group 3, 4 and 6 but under-represented in Groups 1 and 2. However, for LEA Groups 1 2 and 3 it is shown that glutamine is very significantly over-represented, while cysteine, phenylalanine, isoleucine, leucine and tryptophan are significantly under-represented. There is also evidence that the Group 4 LEA proteins are more appropriately redistributed to Group 2 and Group 3. Similarly, Group 5 is better found among the Group 3 LEA proteins. Conclusions: There is evidence that Group 2 and Group 3 LEA proteins, though distinct, might be related. This relationship is also evident in the overlapping sets of keywords for the two Groups, emphasising alpha-helical structure and, at a larger scale, filaments, all of which fits well with experimental evidence that proteins from both Groups are natively unstructured, but become structured under stress conditions. The keywords support localisation of LEA proteins both in the nucleus and associated with the cytoskeleton, and a mode of action similar to chaperones, perhaps the cold shock chaperones, via a role in DNA-binding. In general, non-globular and low-complexity proteins, such as the LEA proteins, pose particular challenges in determining their functions and modes of action. Rather than masking off and ignoring low-complexity domains, novel tools and tool combinations are needed which are capable of analysing such proteins in their entirety.
-
Background
The late embryogenesis abundant (LEA) proteins cover a
number of loosely related groups of proteins whose
precise function is unknown. While considerable evidence
suggests that LEA proteins are involved in desiccation
resistance, a variety of mechanisms for achieving this end
have been proposed including protecting cellular
structures from the effects of water loss by retention of water,
sequestration of ions, direct protection of other proteins
or membranes, or renaturation of unfolded proteins [1
4]. LEA proteins are primarily found in plants, where they
were originally found in seeds [57], and then other plant
tissues. In addition, a number of putative LEA genes have
been found in a non-plant species, including eubacteria
Haemophilus influenzae and Bacillus subtilis [8],
extremophile Deinococcus radiodurans [9] and the nematodes
Caenorhabditis elegans and Aphelenchus avenae [10]. Most
of the literature to date on LEA proteins has been in the
form of reports on individual LEA proteins with general
surveys appearing some time ago [1,11,12]. The
somewhat more recent survey by Close [13] of Group 2 LEA
proteins also includes a discussion of predicted secondary
structure for this Group.
LEA proteins are generally grouped on the basis of their
similarity to prototypical LEA proteins from the cotton
plant Gossypium hirsutum. In the Dure naming scheme,
LEA protein groups are named after particular G. hirsutum
cDNA clones, resulting in Group names such as D7, D11,
D19, D95 and D113. Many authors since Dure, however,
use an assignment to Groups originating with [12],
though revised (and to some extent contradictory)
assignments also appear in [3] and [4]. There is, however, a
consensus only for three LEA protein groups: Group 1 (D19),
Group 2 (also known as dehydrins, D11) and Group 3
(D7). Other LEA protein groups from [12] are Group 4
(D113), Group 5 (D29) and Group 6 (D34). Four of the
LEA protein groups are also represented by Pfam [14]
domain families:
Small Hydrophilic Plant Seed Protein (PF00477)
Group 1
Dehydrin (PF00257) Group 2
LEA (PF02987) Group 3
LEA-1 (PF03760) Group 4
In addition, there are groups which do not appear in the
Bray [1] scheme: Lea5 (D73) and Lea14 (D95) [15],
although both are represented by Pfam families:
Lea5(D73) by LEA-3, PF03242, and Lea14(D95) by
LEA2, PF03168.
Previous work, using just amino acid percentage
composition and the Kyte Doolittle hydrophobicity metric,
found that LEA proteins are characterised by a
preponderance of hydrophilic amino acids together with high
glycine content, resulting in their characterisation as
"hydrophilins" [16]. Certain LEA protein Groups are also
said to be rich in alanine, but deficient in cysteine and
tryptophan [3,4].
However, a significant, though often overlooked, feature
of LEA proteins is that the majority are low complexity
proteins. This is amply demonstrated through the use of
the low complexity sequence demarcation tool, 0j.py [17],
which was applied, first to all the sequences above 40aa in
SwissProt and SpTrEMBL (also called Swall) and then to a
database of 112 LEA proteins, which will be described
shortly. The sequences in the large database returned a
median score of 3, with 13% having a score of 0 and 32%
a score greater than then 3; a low score implies that the
protein has high sequence complexity. By contrast, the
LEA sequences had a median score of 11.5, and 80%
return a score greater than 3 (equivalent to a p-value of 1.
1 10-25).
Low complexity sequences pose a particular problem for
the local alignment tools such as BLAST which owe much
of their discriminative power to scoring schemes based on
the extreme value distribution [18]. For example, [19]
compares the efficacy of both BLAST and FASTA with an
implementation of the Smith-Waterman algorithm, each
both with and without the use of scoring schemes based
on the extreme-value distribution. The benefit of having
statistically based scoring schemes is conclusively
demonstrated [19]. However, it is well known that low
complexity sequences prejudice extreme value distribution based
statistical scoring [20]. The standard way of dealing with
low complexity regions in the context of database searches
is to mask these off in the query sequence using
applications such as SEG [21]. When SEG was run across the set
of 112 LEA proteins, 11 high complexity sequences are
returned unaltered; the remainder were mask (...truncated)