Treffer: kMermaid: Ultrafast metagenomic read assignment to protein clusters by hashing of amino acid k-mer frequencies.
Weitere Informationen
Shotgun metagenomic sequencing can determine both the taxonomic and functional content of microbiomes. However, functional classification for metagenomic reads remains highly challenging as protein mapping tools require substantial computational resources and yield ambiguous classifications when short reads map to homologous proteins originating from different bacteria. Here we introduce kMermaid for the purpose of uniquely mapping bacterial short reads to taxa-agnostic clusters of homologous proteins, which can then be used for downstream analysis tasks such as read quantification and pathway or global functional analysis. Using a nested hash map containing amino acid k-mer profiles as a model for protein assignment, kMermaid achieves the sensitivity of popular existing protein mapping tools while remaining highly resource efficient. We evaluate kMermaid on simulated data and data from human fecal samples as well as demonstrate the utility of kMermaid for classifying reads originating from new, unseen proteins. kMermaid allows for highly accurate, unambiguous and ultrafast metagenomic read assignment into protein clusters, with a fixed memory usage, and can easily be employed on a typical computer. Author summary: Whole-genome shotgun sequencing has allowed for the collection of a wealth of metagenomic data. Evidence that microbiomes play key roles in human health and disease is growing, but approaches for studying functional metagenomic content are still limited. Current protein mapping approaches do not allow for direct quantification of protein coding potential because short reads commonly map to similar proteins in different bacteria. Mapping metagenomic sequencing reads to proteins in such a way that a microbiome's coding potential can be quantified is a key first step to pinpointing specific functional mechanisms or associations of disease. Here, we present a framework to first group similar proteins together, then uniquely map reads directly to these homologous protein groups. Our results show that by using k-mer frequencies stored in a two-layer hash map, we can sensitively classify metagenomic reads from high-depth sequencing data in only a few hours. We present our protein mapping method in an easy-to-use, resource efficient Python package, kMermaid. kMermaid results can be directly quantified which in turn will enable linkage of microbiome amino acid content to numerous health and disease phenotypes. [ABSTRACT FROM AUTHOR]
Copyright of PLoS Computational Biology is the property of Public Library of Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)