Metagenomics: Exploring unseen communities
TECHNOLOGY FEATURE METAGENOMICS
NATURE|Vol 453|29 May 2008
Exploring unseen communities
This year marks the tenth birthday for metagenomics — the cloning and functional
analysis of the collective genomes of previously unculturable soil microorganisms in
an attempt to reconstruct and characterize
individual community inhabitants. Since the
term was coined by Jo Handelsman and her
colleagues at the University of Wisconsin in
Madison, its scope has expanded greatly with
descriptions of the microbial inhabitants of
environments as diverse as the human gut,
the air over New York, the Sargasso Sea and
honeybee colonies. And within these communities researchers are now uncovering a wider
range of microorganisms, thanks in large part
to advances in DNA-sequencing technology.
“We can look at the metagenomic analysis
so much more deeply, at such a better cost,”
says Jane Peterson, associate director of
the Division of Extramural Research of the
National Human Genome Research Institute in Bethesda, Maryland, which recently
launched a five-year initiative to explore the
human microbiome.
Although sequencing technology is creating opportunities for metagenomics research,
The 454 Life Sciences GS FLX sequencing system
is used in many metagenomics projects.
all these new data are straining downstream
analysis. “Computational analysis of metagenomic data still has quite a few outstanding
questions,” says Isidore Rigoutsos, manager
of the bioinformatics and pattern-discovery
group at IBM’s Thomas J. Watson Research
Center in Yorktown Heights, New York. The
assembly and prediction of gene function for
high-complexity microbial communities still
poses challenges1, for example (see ‘Benchmarks and standards’).
Maybe it is the promise of rapidly improving sequencing technology or the new environments being explored, but Peterson says
that she has seen a growing interest in large
metagenomics projects — particularly the
Human Microbiome Project, which aims to
unravel the microbial communities associated
with various parts of the human body, including the gut (see page 578). “People somehow
identify with the Human Microbiome Project.
It is interesting how this project, especially as
it is studying the gut, has really caught a lot of
people’s attention.”
Over the past few years, the race to sequence
DNA faster and more cheaply has been taking
BENCHMARKS AND STANDARDS
The complexity of microbial
communities can vary drastically,
from a couple of microorganisms to
thousands or even millions, making
the reconstruction of whole
genomes from some samples
tricky. “If the community is low in
complexity, it should allow one to
reconstruct genomes with high
accuracy,” says Isidore Rigoutsos,
manager of the bioinformatics
and pattern-discovery group at
IBM’s Thomas J. Watson Research
Center in Yorktown Heights, New
York. But when it comes to highly
complex communities, things are
less straightforward.
Rigoutsos and his team have
tested several genome assemblers
and gene-prediction tools on
simulated metagenomic data sets
with varying degrees of complexity.
Knowing the composition of the
community allowed the team to
benchmark and evaluate the tools.
“We found that as the
complexity increased, many of
the computational tools had
an increasingly hard time,”
says Rigoutsos. For most highcomplexity samples, he says,
the genome assemblers could
not generate larger contigs,
and several contigs that were
assembled were actually chimaeric
mixtures of sequences.
For metagenomic analysis,
smaller contigs and single reads
make assigning the sequence to a
specific microorganism difficult.
“We want to be able to assign a read
of less than 1,000 nucleotides,”
says Rigoutsos, which might allow
researchers to determine species
composition from high-complexity
samples without the need to
generate larger contigs.
Rigoutsos and his colleagues
have made three simulated data
sets available to researchers
interested in testing assembly and
prediction programs.
The problem of data analysis is
not restricted to metagenomics — a
growing number of researchers are
using next-generation sequencing
platforms and generating the
quantity of data that in the past
might only have been possible
at large genome centres. Several
companies are developing software
to address this issue.
CLC bio in Cambridge,
Massachusetts, offers the CLC
Genomics Workbench, which
provides reference assemblies of
data from various next-generation
sequencing systems as well as
mutation detection. A future
version of the program will
incorporate algorithms for the de
novo assembly of Sanger as well as
next-generation sequence data.
Meanwhile, Geospiza in Seattle,
Washington, and GenomeQuest
in Westborough, Massachusetts,
are developing software to
analyse data generated by Applied
Biosystems SOLID next-generation
sequencing platform.
The combination of assembly
software and data sets to
benchmark results should help
solve some of the complexity
problems associated with
metagenomics. “If you sequence
sufficiently, even 200 base-pair
reads are enough,” says Rigoutsos.
But he adds that the real question
is how many 200 base-pair reads
will be needed before we can truly
understand complex communities.
Others are finding that with
enough reads, fewer than 200
base pairs might be sufficient. Jens
Stoye from Bielefeld University in
Germany has compared a data set
of 35 base pair reads generated on
the Genome Analyzer from Illumina
in San Diego, California, with a
454 data set for the same lowcomplexity sample. Although 99%
of the Genome Analyzer’s sequence
data were discarded, because the
system generates up to 50 million
reads he could assign the species in
the sample with the same efficiency
N.B.
from both data sets.
687
454 LIFE SCIENCES
Advances in sequencing technology and tools for analysis are allowing researchers to unravel the
environmental diversity of microbes faster and in greater detail than ever before. Nathan Blow reports.
centre stage. Several next-generation DNA
sequencing systems are now available, boasting
gigabase outputs for a variety of genetic applications. But when it comes to sequencing environmental samples that contain many different
microorganisms in varying amounts, the nextgeneration options have their limitations.
“I would say the only next-generation
sequencing technology suitable for metagenomics at the moment is the 454 system,”
says Stephan Schuster a biochemist at Pennsylvania State University in University Park.
Schuster is not alone — almost all metagenomic studies currently being reported rely on
either 454 technology or conventional Sanger
sequencing. The main reason is simple: read
length.
Long-term tool
Developed by 454 Life Sciences in Branford,
Connecticut, the 454 system relies on an
emulsion polymerase chain reaction (PCR)
step that is coupled to pyrosequencing. Individual fragments of DNA, 300–500 base pairs
long, are attached to beads in vitro and amplified with PCR to generate millions of identical copie (...truncated)