Metagenomics: Exploring unseen communities (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/453687a.pdf

Metagenomics: Exploring unseen communities

TECHNOLOGY FEATURE METAGENOMICS NATURE|Vol 453|29 May 2008 Exploring unseen communities This year marks the tenth birthday for metagenomics — the cloning and functional analysis of the collective genomes of previously unculturable soil microorganisms in an attempt to reconstruct and characterize individual community inhabitants. Since the term was coined by Jo Handelsman and her colleagues at the University of Wisconsin in Madison, its scope has expanded greatly with descriptions of the microbial inhabitants of environments as diverse as the human gut, the air over New York, the Sargasso Sea and honeybee colonies. And within these communities researchers are now uncovering a wider range of microorganisms, thanks in large part to advances in DNA-sequencing technology. “We can look at the metagenomic analysis so much more deeply, at such a better cost,” says Jane Peterson, associate director of the Division of Extramural Research of the National Human Genome Research Institute in Bethesda, Maryland, which recently launched a five-year initiative to explore the human microbiome. Although sequencing technology is creating opportunities for metagenomics research, The 454 Life Sciences GS FLX sequencing system is used in many metagenomics projects. all these new data are straining downstream analysis. “Computational analysis of metagenomic data still has quite a few outstanding questions,” says Isidore Rigoutsos, manager of the bioinformatics and pattern-discovery group at IBM’s Thomas J. Watson Research Center in Yorktown Heights, New York. The assembly and prediction of gene function for high-complexity microbial communities still poses challenges1, for example (see ‘Benchmarks and standards’). Maybe it is the promise of rapidly improving sequencing technology or the new environments being explored, but Peterson says that she has seen a growing interest in large metagenomics projects — particularly the Human Microbiome Project, which aims to unravel the microbial communities associated with various parts of the human body, including the gut (see page 578). “People somehow identify with the Human Microbiome Project. It is interesting how this project, especially as it is studying the gut, has really caught a lot of people’s attention.” Over the past few years, the race to sequence DNA faster and more cheaply has been taking BENCHMARKS AND STANDARDS The complexity of microbial communities can vary drastically, from a couple of microorganisms to thousands or even millions, making the reconstruction of whole genomes from some samples tricky. “If the community is low in complexity, it should allow one to reconstruct genomes with high accuracy,” says Isidore Rigoutsos, manager of the bioinformatics and pattern-discovery group at IBM’s Thomas J. Watson Research Center in Yorktown Heights, New York. But when it comes to highly complex communities, things are less straightforward. Rigoutsos and his team have tested several genome assemblers and gene-prediction tools on simulated metagenomic data sets with varying degrees of complexity. Knowing the composition of the community allowed the team to benchmark and evaluate the tools. “We found that as the complexity increased, many of the computational tools had an increasingly hard time,” says Rigoutsos. For most highcomplexity samples, he says, the genome assemblers could not generate larger contigs, and several contigs that were assembled were actually chimaeric mixtures of sequences. For metagenomic analysis, smaller contigs and single reads make assigning the sequence to a specific microorganism difficult. “We want to be able to assign a read of less than 1,000 nucleotides,” says Rigoutsos, which might allow researchers to determine species composition from high-complexity samples without the need to generate larger contigs. Rigoutsos and his colleagues have made three simulated data sets available to researchers interested in testing assembly and prediction programs. The problem of data analysis is not restricted to metagenomics — a growing number of researchers are using next-generation sequencing platforms and generating the quantity of data that in the past might only have been possible at large genome centres. Several companies are developing software to address this issue. CLC bio in Cambridge, Massachusetts, offers the CLC Genomics Workbench, which provides reference assemblies of data from various next-generation sequencing systems as well as mutation detection. A future version of the program will incorporate algorithms for the de novo assembly of Sanger as well as next-generation sequence data. Meanwhile, Geospiza in Seattle, Washington, and GenomeQuest in Westborough, Massachusetts, are developing software to analyse data generated by Applied Biosystems SOLID next-generation sequencing platform. The combination of assembly software and data sets to benchmark results should help solve some of the complexity problems associated with metagenomics. “If you sequence sufficiently, even 200 base-pair reads are enough,” says Rigoutsos. But he adds that the real question is how many 200 base-pair reads will be needed before we can truly understand complex communities. Others are finding that with enough reads, fewer than 200 base pairs might be sufficient. Jens Stoye from Bielefeld University in Germany has compared a data set of 35 base pair reads generated on the Genome Analyzer from Illumina in San Diego, California, with a 454 data set for the same lowcomplexity sample. Although 99% of the Genome Analyzer’s sequence data were discarded, because the system generates up to 50 million reads he could assign the species in the sample with the same efficiency N.B. from both data sets. 687 454 LIFE SCIENCES Advances in sequencing technology and tools for analysis are allowing researchers to unravel the environmental diversity of microbes faster and in greater detail than ever before. Nathan Blow reports. centre stage. Several next-generation DNA sequencing systems are now available, boasting gigabase outputs for a variety of genetic applications. But when it comes to sequencing environmental samples that contain many different microorganisms in varying amounts, the nextgeneration options have their limitations. “I would say the only next-generation sequencing technology suitable for metagenomics at the moment is the 454 system,” says Stephan Schuster a biochemist at Pennsylvania State University in University Park. Schuster is not alone — almost all metagenomic studies currently being reported rely on either 454 technology or conventional Sanger sequencing. The main reason is simple: read length. Long-term tool Developed by 454 Life Sciences in Branford, Connecticut, the 454 system relies on an emulsion polymerase chain reaction (PCR) step that is coupled to pyrosequencing. Individual fragments of DNA, 300–500 base pairs long, are attached to beads in vitro and amplified with PCR to generate millions of identical copie (...truncated)