Treffer: TaxaPLN: a taxonomy-aware augmentation strategy for microbiome-trait classification including metadata.
Gilbert JA, Blaser MJ, Caporaso JG, Jansson JK, Lynch SV, Knight R. Current understanding of the human microbiome. Nat Med. 2018;24(4):392–400.
Lloyd-Price J, Arze C, Ananthakrishnan AN, Schirmer M, Avila-Pacheco J, Poon TW, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569(7758):655–62.
Kostic AD, Gevers D, Siljander H, Vatanen T, Hyötyläinen T, Hämäläinen AM, et al. The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. Cell Host Microbe. 2015;17(2):260–73.
Zeller G, Tap J, Voigt AY, Sunagawa S, Kultima JR, Costea PI, et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol Syst Biol. 2014;10(11):766.
Zhu F, Ju Y, Wang W, Wang Q, Guo R, Ma Q, et al. Metagenome-wide association of gut microbiome features for schizophrenia. Nat Commun. 2020;11(1):1612.
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, et al. Applications of machine learning in human microbiome studies: a review on feature selection, biomarker identification, disease prediction and treatment. Front Microbiol. 2021;12:634511.
Qian W, Stanley KG, Aziz Z, Aziz U, Siciliano SD. SPLANG—a synthetic Poisson-Lognormal-based abundance and network generative model for microbial interaction inference algorithms. Sci Rep. 2024;14(1):25099.
Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front Microbiol. 2017;8:2224.
Mumuni A, Mumuni F. Data augmentation: a comprehensive survey of modern approaches. Array. 2022;16:100258.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
Zhang H, Cissé M, Dauphin YN, Lopez-Paz D. mixup: Beyond Empirical Risk Minimization. In: ICLR. OpenReview.net; 2018. Vancouver, Canada.
Gordon-Rodríguez E, Quinn TP, Cunningham JP. Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome. In: NeurIPS; 2022. Los Angeles, USA.
Sayyari E, Kawas B, Mirarab S. TADA: phylogenetic augmentation of microbiome samples enhances phenotype classification. Bioinformatics. 2019;35(14):i31–40.
Jiang Y, Liao D, Zhu Q, Lu YY. PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation. Bioinformatics. 2025;41(2):btaf014.
Sharma D, Lou W, Xu W. phylaGAN: data augmentation through conditional GANs and autoencoders for improving disease prediction accuracy using microbiome data. Bioinformatics. 2024;40(4):btae161.
Chaussard A, Bonnet A, Gassiat E, Le Corff S. Tree-based variational inference for Poisson log-normal models. Stat Comput. 2025;35(5):1–35.
Tomczak JM, Welling M. VAE with a VampPrior. In: AISTATS. vol. 84. PMLR; 2018. p. 1214–1223.
Perez E, Strub F, de Vries H, Dumoulin V, Courville AC. FiLM: Visual Reasoning with a General Conditioning Layer. In: AAAI. AAAI Press; 2018. p. 3942–3951.
Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong DT, et al. Accessible, curated metagenomic data through ExperimentHub. Nat Methods. 2017;14(11):1023–4.
Bastiaanssen TF, Quinn TP, Loughman A. Bugs as features (part 1): concepts and foundations for the compositional data analysis of the microbiome-gut-brain axis. Nat Mental Health. 2023;1(12):930–8.
Chiquet J, Mariadassou M, Robin S. The Poisson-Lognormal model as a versatile framework for the joint analysis of species abundances. Front Ecol Evol. 2021;9:588292.
Chadebec C, Thibeau-Sutre E, Burgos N, Allassonnière S. Data augmentation in high dimensional low sample size setting using a geometry-based variational autoencoder. IEEE Trans Pattern Anal Mach Intell. 2022;45(3):2879–96.
Wirbel J, Pyl PT, Kartal E, Zych K, Kashani A, Milanese A, et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med. 2019;25(4):679–89.
Rubel MA, Abbas A, Taylor LJ, Connell A, Tanes C, Bittinger K, et al. Lifestyle and the presence of helminths is associated with gut microbiome composition in Cameroonians. Genome Biol. 2020;21:1–32.
Yachida S, Mizutani S, Shiroma H, Shiba S, Nakajima T, Sakamoto T, et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat Med. 2019;25(6):968–76.
Yu J, Feng Q, Wong SH, Zhang D, Liang QY, Qin Y, et al. Metagenomic analysis of Faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut. 2017;66(1):70–8.
Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014;32(8):822–8.
Davis J, Goadrich MH. The relationship between Precision-Recall and ROC curves. In: ICML. vol. 148 of ACM International Conference Proceeding Series. ACM; 2006. p. 233–240.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Batardiere B, Kwon J, Chiquet J. pyPLNmodels: a Python package to analyze multivariate high-dimensional count data. J Open Sour Softw. 2024;9(104):6969.
Aitchison J. The statistical analysis of compositional data. J Roy Stat Soc: Ser B (Methodol). 1982;44(2):139–60.
Wu H, Li Y, Jiang Y, Li X, Wang S, Zhao C, et al. Machine learning prediction of obesity-associated gut microbiota: identifying Bifidobacterium Pseudocatenulatum as a potential therapeutic target. Front Microbiol. 2025;15:1488656.
Wu H, Lv B, Zhi L, Shao Y, Liu X, Mitteregger M, et al. Microbiome–metabolome dynamics associated with impaired glucose control and responses to lifestyle changes. Nat Med 2025;p. 1–10.
Rong R, Jiang S, Xu L, Xiao G, Xie Y, Liu DJ, et al. MB-GAN: microbiome simulation via generative adversarial network. GigaScience. 2021;10(2):giab005.
Kumar A, Poole B. On Implicit Regularization in β-VAEs. In: ICML. vol. 119. PMLR; 2020. p. 5480–5490.
Weitere Informationen
Background: The gut microbiome plays a crucial role in human health, making it a cornerstone of modern biomedical research. To study its structure and dynamics, machine learning models are increasingly used to identify key microbial patterns associated with disease and environmental factors, but their performance is often limited by the intrinsic complexity of microbiome data and the small size of available cohorts. In this context, data augmentation has emerged as a promising strategy to overcome these challenges by generating artificial microbiome profiles.
Results: We introduce TaxaPLN, a data augmentation method based on PLN-Tree generative models, which leverages the taxonomy and a data-driven sampler to generate realistic synthetic microbiome compositions. Additionally, we propose a conditional extension based on feature-wise linear modulation, enabling covariate-aware generation. Experiments on diverse curated microbiome datasets show that TaxaPLN preserves ecological properties and generally improves or maintains predictive performances, outperforming state-of-the-art baselines on most tasks. Furthermore, the conditional variant of TaxaPLN establishes a new benchmark for metadata-aware microbiome augmentation.
Conclusion: TaxaPLN provides a model-based framework for augmenting microbiome datasets while preserving their ecological and clinical relevance. By integrating taxonomic structure and host metadata, it enhances predictive modeling across diverse real-world settings. To facilitate reproducible and scalable microbiome analysis using our method, TaxaPLN is released as an open-source Python package available on PyPI (plntree), with MIT-licensed source code hosted at https://github.com/AlexandreChaussard/PLNTree-package.
(© 2025. The Author(s).)
Declarations. Ethics approval and consent to participate: Microbiome data used in this study originate from the publicly available curatedMetagenomicData database, which aggregates datasets approved by the respective institutional review boards. No additional ethics approval was required for our work. Consent for publication: Not applicable. Conflict of interest: The authors declare no conflict of interest.