ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition (pdf)

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0140644&type=printable

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

October ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition David Koslicki 0 1 2 Saikat Chatterjee 0 1 2 Damon Shahrivar 0 1 2 Alan W. Walker 0 1 2 Suzanna C. Francis 0 1 2 Louise J. Fraser 0 1 2 Mikko Vehkaperä 0 1 2 Yueheng Lan 0 1 2 Jukka Corander 0 1 2 0 1 Dept of Mathematics, Oregon State University, Corvallis, United States of America, 2 Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden, 3 Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen , Aberdeen , United Kingdom, 4 MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine , London , United Kingdom , 5 Illumina Cambridge Ltd. , Chesterford Research Park, Essex , United Kingdom , 6 Dept of Electronic and Electrical Engineering, University of Sheffield , Sheffield , United Kingdom , 7 Dept of Physics, Tsinghua University , Beijing , China , 8 Dept of Mathematics and Statistics, University of Helsinki , Helsinki , Finland 1 Funding: This work was supported by the Swedish Research Council Linnaeus Centre ACCESS (S.C.), ERC grant 239784 (J.C.), the Academy of Finland Center of Excellence COIN (J.C.), the Academy of Finland (M.V.), the Scottish Government's Rural and Environment Science and Analytical Services 2 Editor: Jonathan H. Badger, National Cancer Institute, UNITED STATES - Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Division (RESAS) (A.W.W), and the UK MRC/DFID grant G1002369 (S.C.F). L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials. An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware. The advent of high-throughput sequencing technologies has enabled detection of bacterial community composition at an unprecedented level of detail. A technological approach is to produce for each sample a large number of reads from amplicons of the 16S rRNA gene, which enables an identification and comparison of the relative frequencies of different taxonomic units present across samples. The rapidly increasing number of reads produced per sample results in the need for fast taxonomic classification of samples. This problem has attracted considerable recent attention [1–5]. Many existing approaches to the bacterial community composition estimation problem use 16S rRNA gene amplicon sequencing where a large amount of moderate length reads (around 250–500 bp) are produced from each sample and then generally either clustered or classified to obtain a composition estimate of taxonomic units. In the clustering approach, reads are grouped into taxonomic units by either distance-based or probabilistic methods [6–8], such that the actual taxonomic labels are assigned to the clusters afterwards by matching their consensus sequences to a reference database. In contrast to the clustering methods, the classification approach is based on usi (...truncated)