TOFU-MAaPO: fast, scalable and reproducible analysis of large metagenome sequence data from the Sequence Read Archive
Article
https://doi.org/10.1038/s41467-026-74033-9
TOFU-MAaPO: fast, scalable and
reproducible analysis of large metagenome
sequence data from the Sequence Read
Archive
Received: 20 March 2025
1234567890():,;
1234567890():,;
Accepted: 27 May 2026
Check for updates
Eike Matthias Wacker
David Ellinghaus 1
1
, Malte Christoph Rühlemann
1,2
, Andre Franke
1
&
Metagenomic shotgun sequencing data from over 600,000 metagenomes are
publicly available in repositories such as NCBI’s Sequence Read Archive (SRA).
Technically advanced and easy-to-use best-practice metagenome software
workflows for raw data pre-processing, assembly of metagenome-assembled
genomes, and taxonomic and functional annotation of metagenomeassembled genomes are needed for reproducible analysis and harmonization
of large-scale metagenomic datasets. We introduce TOFU-MAaPO (Taxonomic
Or FUnctional Metagenomic Assembly and PrOfiling), a portable, automated
single-command Nextflow pipeline for large-scale analysis of metagenomic
short-read sequencing data. It analyzes metagenome files locally or directly
from the SRA using accession or study IDs. In a benchmark against three
established metagenome software pipelines, the TOFU-MAaPO workflow
yielded 12%, 42% to 77% more high-quality metagenome-assembled genomes,
likely reflecting the integration of multiple complementary binning tools with
a unified refinement strategy. Using its assembly-free taxonomic abundance
profiling module, we also automatically downloaded 16,462 uniquely identifiable and accessible human gut metagenome samples from the SRA and
taxonomically annotated them against the Genome Taxonomy Database on a
high-performance cluster in less than 55 hours, including download time.
TOFU-MAaPO makes large metagenome projects more accessible to individual
research groups and is freely available at https://github.com/ikmb/
TOFU-MAaPO.
Metagenome sequencing methods enable the analysis of the collective genomes of microbial communities in biological samples. However, comprehensive and high-quality reference genome catalogs are
required for robust functional characterization and taxonomic classification of microbiome data. To catalog the human gut
microbiome, for example, the Unified Human Gastrointestinal Genome (UHGG) project1 compiled more than 200,000 published
reference genomes representing 4644 gut prokaryotes from multiple
databases, without re-assembling from the raw data. Such collections
support the detection and quantification of microorganisms in
1
Institute of Clinical Molecular Biology, Kiel University, Kiel, Germany. 2Institute for Medical Microbiology and Hospital Epidemiology, Hannover Medical
e-mail:
School, Hannover, Germany.
Nature Communications | (2026)17:5215
1
Article
https://doi.org/10.1038/s41467-026-74033-9
biological samples and are excellent for rough estimation of the
microbial content of new sequenced samples, but naturally do not
cover all ecological niches or host- and disease-associated microbal
variation. To address this, we and others have generated specieslevel genome resources from specific patient cohorts. For example,
we previously assembled 27,745 metagenome-assembled genomes
(MAGs) from raw FASTQ files of 839 gut metagenomes from a prospective inflammatory bowel disease family cohort2, resulting in 1652
non-redundant species-level genome bins. Such catalogs provide an
important basis for downstream analyses, including phenotypespecific microbiome-wide association studies (MWAS)3. Very large
genome catalogs for different body regions are expected to be
published in the near future, e.g., genome catalogs for different
regions of the human body (The Million Microbiome of Humans
Project4), which will further refine phenotype-specific studies. At the
same time, the generation of such resources depends critically on
robust, scalable, and reproducible computational workflows for
processing and harmonizing large metagenomic datasets.
In general, phenotype-specific genome catalog studies, MWAS,
and other metagenome analyses place high demands on the processing and harmonization of raw sequencing data. Differences in software selection, parameterization, and computational environments
can substantially reduce reproducibility. From a software perspective,
integrated workflow systems such as Nextflow5 or Snakemake6 are
therefore strongly recommended7. Nextflow and Snakemake improve
scalability, portability, reproducibility and ease of use by supporting
modular workflow design, code sharing via platforms such as GitHub,
BitBucket and GitLab, automated error handling, resumption of
interrupted runs, visualization of software dependencies and workflow
structure, use of software containers such as Docker7 and Apptainer
(formely Singularity)8, and automated job control on high-
performance computing (HPC) systems. From a user’s perspective, a
sophisticated bioinformatics workflow for metagenomic analysis must
provide quality control (QC), assembly of metagenomes, binning of
contigs into species, estimation of pathway and taxonomic abundances either assembly-free or based on MAGs with subsequent bin
refinement9 in a simple manner. Analytical pipelines10–12 and software
packages13–15 for subtasks have been published for basic QC and processing of metagenome samples, but none of these pipelines combine
all the steps required above. For example, MetaWRAP12 supports QC,
taxonomic annotation, assembly, binning, and bin refinement. Yet, it
does not feature pathway annotation. In addition, its software implementation does not integrate workflow management systems such as
Nextflow or Snakemake, with corresponding limitations for parallelization and error handling. Similarly, the recently published pipeline
metaFun16, although implemented in Nextflow, required multiple
commands for different analysis steps and does not provide the same
streamlined support for large-scale distributed execution on HPC
systems. ATLAS11 provides QC, single-sample assembly, and bin
refinement via Snakemake (Table 1) and is fast, but lacks the use of
software containers and the ability to perform co-assembly and
assembly-free taxonomic and pathway annotations. nf-core/mag10 is a
Nextflow-based pipeline for metagenome assembly and binning that
covers many of the required steps (so far without assembly free
pathway annotation; Table 1). But even on an HPC, nf-core/mag results
in long runtimes of several hours for processing a single metagenome,
making the use of nf-core/mag unsuitable for processing and assembling hundreds, thousands, or even tens of thousands of raw metagenome samples from the SRA and creating genome catalogs. The
microbiome community, therefore, lacks an easy-to-use (i.e., singlecommand) and efficient software tool for large-scale metagenome
analysis implemented in a workflow manager environment that can
Table 1 | Comparison of TOFU-MAaPO’s functionality with metagenome software tools implemented with workflow management systems and used for quality and qu (...truncated)