TOFU-MAaPO: fast, scalable and reproducible analysis of large metagenome sequence data from the Sequence Read Archive (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-026-74033-9.pdf

TOFU-MAaPO: fast, scalable and reproducible analysis of large metagenome sequence data from the Sequence Read Archive

Article https://doi.org/10.1038/s41467-026-74033-9 TOFU-MAaPO: fast, scalable and reproducible analysis of large metagenome sequence data from the Sequence Read Archive Received: 20 March 2025 1234567890():,; 1234567890():,; Accepted: 27 May 2026 Check for updates Eike Matthias Wacker David Ellinghaus 1 1 , Malte Christoph Rühlemann 1,2 , Andre Franke 1 & Metagenomic shotgun sequencing data from over 600,000 metagenomes are publicly available in repositories such as NCBI’s Sequence Read Archive (SRA). Technically advanced and easy-to-use best-practice metagenome software workﬂows for raw data pre-processing, assembly of metagenome-assembled genomes, and taxonomic and functional annotation of metagenomeassembled genomes are needed for reproducible analysis and harmonization of large-scale metagenomic datasets. We introduce TOFU-MAaPO (Taxonomic Or FUnctional Metagenomic Assembly and PrOﬁling), a portable, automated single-command Nextﬂow pipeline for large-scale analysis of metagenomic short-read sequencing data. It analyzes metagenome ﬁles locally or directly from the SRA using accession or study IDs. In a benchmark against three established metagenome software pipelines, the TOFU-MAaPO workﬂow yielded 12%, 42% to 77% more high-quality metagenome-assembled genomes, likely reﬂecting the integration of multiple complementary binning tools with a uniﬁed reﬁnement strategy. Using its assembly-free taxonomic abundance proﬁling module, we also automatically downloaded 16,462 uniquely identiﬁable and accessible human gut metagenome samples from the SRA and taxonomically annotated them against the Genome Taxonomy Database on a high-performance cluster in less than 55 hours, including download time. TOFU-MAaPO makes large metagenome projects more accessible to individual research groups and is freely available at https://github.com/ikmb/ TOFU-MAaPO. Metagenome sequencing methods enable the analysis of the collective genomes of microbial communities in biological samples. However, comprehensive and high-quality reference genome catalogs are required for robust functional characterization and taxonomic classiﬁcation of microbiome data. To catalog the human gut microbiome, for example, the Uniﬁed Human Gastrointestinal Genome (UHGG) project1 compiled more than 200,000 published reference genomes representing 4644 gut prokaryotes from multiple databases, without re-assembling from the raw data. Such collections support the detection and quantiﬁcation of microorganisms in 1 Institute of Clinical Molecular Biology, Kiel University, Kiel, Germany. 2Institute for Medical Microbiology and Hospital Epidemiology, Hannover Medical e-mail: School, Hannover, Germany. Nature Communications | (2026)17:5215 1 Article https://doi.org/10.1038/s41467-026-74033-9 biological samples and are excellent for rough estimation of the microbial content of new sequenced samples, but naturally do not cover all ecological niches or host- and disease-associated microbal variation. To address this, we and others have generated specieslevel genome resources from speciﬁc patient cohorts. For example, we previously assembled 27,745 metagenome-assembled genomes (MAGs) from raw FASTQ ﬁles of 839 gut metagenomes from a prospective inﬂammatory bowel disease family cohort2, resulting in 1652 non-redundant species-level genome bins. Such catalogs provide an important basis for downstream analyses, including phenotypespeciﬁc microbiome-wide association studies (MWAS)3. Very large genome catalogs for different body regions are expected to be published in the near future, e.g., genome catalogs for different regions of the human body (The Million Microbiome of Humans Project4), which will further reﬁne phenotype-speciﬁc studies. At the same time, the generation of such resources depends critically on robust, scalable, and reproducible computational workﬂows for processing and harmonizing large metagenomic datasets. In general, phenotype-speciﬁc genome catalog studies, MWAS, and other metagenome analyses place high demands on the processing and harmonization of raw sequencing data. Differences in software selection, parameterization, and computational environments can substantially reduce reproducibility. From a software perspective, integrated workﬂow systems such as Nextﬂow5 or Snakemake6 are therefore strongly recommended7. Nextﬂow and Snakemake improve scalability, portability, reproducibility and ease of use by supporting modular workﬂow design, code sharing via platforms such as GitHub, BitBucket and GitLab, automated error handling, resumption of interrupted runs, visualization of software dependencies and workﬂow structure, use of software containers such as Docker7 and Apptainer (formely Singularity)8, and automated job control on high- performance computing (HPC) systems. From a user’s perspective, a sophisticated bioinformatics workﬂow for metagenomic analysis must provide quality control (QC), assembly of metagenomes, binning of contigs into species, estimation of pathway and taxonomic abundances either assembly-free or based on MAGs with subsequent bin reﬁnement9 in a simple manner. Analytical pipelines10–12 and software packages13–15 for subtasks have been published for basic QC and processing of metagenome samples, but none of these pipelines combine all the steps required above. For example, MetaWRAP12 supports QC, taxonomic annotation, assembly, binning, and bin reﬁnement. Yet, it does not feature pathway annotation. In addition, its software implementation does not integrate workﬂow management systems such as Nextﬂow or Snakemake, with corresponding limitations for parallelization and error handling. Similarly, the recently published pipeline metaFun16, although implemented in Nextﬂow, required multiple commands for different analysis steps and does not provide the same streamlined support for large-scale distributed execution on HPC systems. ATLAS11 provides QC, single-sample assembly, and bin reﬁnement via Snakemake (Table 1) and is fast, but lacks the use of software containers and the ability to perform co-assembly and assembly-free taxonomic and pathway annotations. nf-core/mag10 is a Nextﬂow-based pipeline for metagenome assembly and binning that covers many of the required steps (so far without assembly free pathway annotation; Table 1). But even on an HPC, nf-core/mag results in long runtimes of several hours for processing a single metagenome, making the use of nf-core/mag unsuitable for processing and assembling hundreds, thousands, or even tens of thousands of raw metagenome samples from the SRA and creating genome catalogs. The microbiome community, therefore, lacks an easy-to-use (i.e., singlecommand) and efﬁcient software tool for large-scale metagenome analysis implemented in a workﬂow manager environment that can Table 1 | Comparison of TOFU-MAaPO’s functionality with metagenome software tools implemented with workﬂow management systems and used for quality and qu (...truncated)