TOFU-MAaPO: fast, scalable and reproducible analysis of large metagenome sequence data from the Sequence Read Archive

Nature Communications, Jun 2026

Metagenomic shotgun sequencing data from over 600,000 metagenomes are publicly available in repositories such as NCBI’s Sequence Read Archive (SRA). Technically advanced and easy-to-use best-practice metagenome software workflows for raw data pre-processing, assembly of metagenome-assembled genomes, and taxonomic and functional annotation of metagenome-assembled genomes are needed for reproducible analysis and harmonization of large-scale metagenomic datasets. We introduce TOFU-MAaPO (Taxonomic Or FUnctional Metagenomic Assembly and PrOfiling), a portable, automated single-command Nextflow pipeline for large-scale analysis of metagenomic short-read sequencing data. It analyzes metagenome files locally or directly from the SRA using accession or study IDs. In a benchmark against three established metagenome software pipelines, the TOFU-MAaPO workflow yielded 12%, 42% to 77% more high-quality metagenome-assembled genomes, likely reflecting the integration of multiple complementary binning tools with a unified refinement strategy. Using its assembly-free taxonomic abundance profiling module, we also automatically downloaded 16,462 uniquely identifiable and accessible human gut metagenome samples from the SRA and taxonomically annotated them against the Genome Taxonomy Database on a high-performance cluster in less than 55 hours, including download time. TOFU-MAaPO makes large metagenome projects more accessible to individual research groups and is freely available at https://github.com/ikmb/TOFU-MAaPO.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-026-74033-9.pdf

TOFU-MAaPO: fast, scalable and reproducible analysis of large metagenome sequence data from the Sequence Read Archive

Article https://doi.org/10.1038/s41467-026-74033-9 TOFU-MAaPO: fast, scalable and reproducible analysis of large metagenome sequence data from the Sequence Read Archive Received: 20 March 2025 1234567890():,; 1234567890():,; Accepted: 27 May 2026 Check for updates Eike Matthias Wacker David Ellinghaus 1 1 , Malte Christoph Rühlemann 1,2 , Andre Franke 1 & Metagenomic shotgun sequencing data from over 600,000 metagenomes are publicly available in repositories such as NCBI’s Sequence Read Archive (SRA). Technically advanced and easy-to-use best-practice metagenome software workflows for raw data pre-processing, assembly of metagenome-assembled genomes, and taxonomic and functional annotation of metagenomeassembled genomes are needed for reproducible analysis and harmonization of large-scale metagenomic datasets. We introduce TOFU-MAaPO (Taxonomic Or FUnctional Metagenomic Assembly and PrOfiling), a portable, automated single-command Nextflow pipeline for large-scale analysis of metagenomic short-read sequencing data. It analyzes metagenome files locally or directly from the SRA using accession or study IDs. In a benchmark against three established metagenome software pipelines, the TOFU-MAaPO workflow yielded 12%, 42% to 77% more high-quality metagenome-assembled genomes, likely reflecting the integration of multiple complementary binning tools with a unified refinement strategy. Using its assembly-free taxonomic abundance profiling module, we also automatically downloaded 16,462 uniquely identifiable and accessible human gut metagenome samples from the SRA and taxonomically annotated them against the Genome Taxonomy Database on a high-performance cluster in less than 55 hours, including download time. TOFU-MAaPO makes large metagenome projects more accessible to individual research groups and is freely available at https://github.com/ikmb/ TOFU-MAaPO. Metagenome sequencing methods enable the analysis of the collective genomes of microbial communities in biological samples. However, comprehensive and high-quality reference genome catalogs are required for robust functional characterization and taxonomic classification of microbiome data. To catalog the human gut microbiome, for example, the Unified Human Gastrointestinal Genome (UHGG) project1 compiled more than 200,000 published reference genomes representing 4644 gut prokaryotes from multiple databases, without re-assembling from the raw data. Such collections support the detection and quantification of microorganisms in 1 Institute of Clinical Molecular Biology, Kiel University, Kiel, Germany. 2Institute for Medical Microbiology and Hospital Epidemiology, Hannover Medical e-mail: School, Hannover, Germany. Nature Communications | (2026)17:5215 1 Article https://doi.org/10.1038/s41467-026-74033-9 biological samples and are excellent for rough estimation of the microbial content of new sequenced samples, but naturally do not cover all ecological niches or host- and disease-associated microbal variation. To address this, we and others have generated specieslevel genome resources from specific patient cohorts. For example, we previously assembled 27,745 metagenome-assembled genomes (MAGs) from raw FASTQ files of 839 gut metagenomes from a prospective inflammatory bowel disease family cohort2, resulting in 1652 non-redundant species-level genome bins. Such catalogs provide an important basis for downstream analyses, including phenotypespecific microbiome-wide association studies (MWAS)3. Very large genome catalogs for different body regions are expected to be published in the near future, e.g., genome catalogs for different regions of the human body (The Million Microbiome of Humans Project4), which will further refine phenotype-specific studies. At the same time, the generation of such resources depends critically on robust, scalable, and reproducible computational workflows for processing and harmonizing large metagenomic datasets. In general, phenotype-specific genome catalog studies, MWAS, and other metagenome analyses place high demands on the processing and harmonization of raw sequencing data. Differences in software selection, parameterization, and computational environments can substantially reduce reproducibility. From a software perspective, integrated workflow systems such as Nextflow5 or Snakemake6 are therefore strongly recommended7. Nextflow and Snakemake improve scalability, portability, reproducibility and ease of use by supporting modular workflow design, code sharing via platforms such as GitHub, BitBucket and GitLab, automated error handling, resumption of interrupted runs, visualization of software dependencies and workflow structure, use of software containers such as Docker7 and Apptainer (formely Singularity)8, and automated job control on high- performance computing (HPC) systems. From a user’s perspective, a sophisticated bioinformatics workflow for metagenomic analysis must provide quality control (QC), assembly of metagenomes, binning of contigs into species, estimation of pathway and taxonomic abundances either assembly-free or based on MAGs with subsequent bin refinement9 in a simple manner. Analytical pipelines10–12 and software packages13–15 for subtasks have been published for basic QC and processing of metagenome samples, but none of these pipelines combine all the steps required above. For example, MetaWRAP12 supports QC, taxonomic annotation, assembly, binning, and bin refinement. Yet, it does not feature pathway annotation. In addition, its software implementation does not integrate workflow management systems such as Nextflow or Snakemake, with corresponding limitations for parallelization and error handling. Similarly, the recently published pipeline metaFun16, although implemented in Nextflow, required multiple commands for different analysis steps and does not provide the same streamlined support for large-scale distributed execution on HPC systems. ATLAS11 provides QC, single-sample assembly, and bin refinement via Snakemake (Table 1) and is fast, but lacks the use of software containers and the ability to perform co-assembly and assembly-free taxonomic and pathway annotations. nf-core/mag10 is a Nextflow-based pipeline for metagenome assembly and binning that covers many of the required steps (so far without assembly free pathway annotation; Table 1). But even on an HPC, nf-core/mag results in long runtimes of several hours for processing a single metagenome, making the use of nf-core/mag unsuitable for processing and assembling hundreds, thousands, or even tens of thousands of raw metagenome samples from the SRA and creating genome catalogs. The microbiome community, therefore, lacks an easy-to-use (i.e., singlecommand) and efficient software tool for large-scale metagenome analysis implemented in a workflow manager environment that can Table 1 | Comparison of TOFU-MAaPO’s functionality with metagenome software tools implemented with workflow management systems and used for quality and qu (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41467-026-74033-9.pdf
Article home page: https://www.nature.com/articles/s41467-026-74033-9

Eike Matthias Wacker, Malte Christoph Rühlemann, Andre Franke, David Ellinghaus. TOFU-MAaPO: fast, scalable and reproducible analysis of large metagenome sequence data from the Sequence Read Archive, Nature Communications, 2026, DOI: 10.1038/s41467-026-74033-9