Treffer: Sustainable data analysis with Snakemake.

Title:
Sustainable data analysis with Snakemake.
Authors:
Mölder F; Bioinformatics and Computational Oncology, Institute for AI in Medicine (IKIM), University Hospital Essen, University of Duisburg-Essen, Essen, Germany.; Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Essen, Germany., Jablonski KP; Swiss Institute of Bioinformatics (SIB), Basel, Switzerland.; Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland., Letcher B; EMBL-EBI, Hinxton, UK., Hall MB; EMBL-EBI, Hinxton, UK., van Dyken PC; Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada., Tomkins-Tinch CH; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, USA.; Broad Institute of MIT and Harvard, Cambridge, USA., Sochat V; Stanford University Research Computing Center, Stanford University, Stanford, USA., Forster J; Bioinformatics and Computational Oncology, Institute for AI in Medicine (IKIM), University Hospital Essen, University of Duisburg-Essen, Essen, Germany.; German Cancer Consortium (DKTK, partner site Essen) and German Cancer Research Center, DKFZ, Heidelberg, Germany., Vieira FG; Centre for Ancient Environmental Genomics, University of Copenhagen Globe Institute, Copenhagen, Denmark., Meesters C; University of Mainz, Mainz, Germany., Lee S; Biomedical Informatics, Harvard Medical School, Harvard University, Boston, USA., Twardziok SO; Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health (BIH), Center for Digital Health, Berlin, Germany., Kanitz A; Biozentrum, University of Basel, Basel, Switzerland.; SIB Swiss Institute of Bioinformatics / ELIXIR Switzerland, Lausanne, Switzerland., VanCampen J; Earle A Chiles Research Institute, Portland, Oregon, USA.; Providence Cancer Institute, Portland, Oregon, USA., Malladi V; Biomedical Platforms and Genomics, Microsoft Research, Redmond, USA., Wilm A; Microsoft Singapore, Singapore, Singapore., Holtgrewe M; Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health (BIH), Center for Digital Health, Berlin, Germany.; CUBI - Core Unit Bioinformatics, Berlin Institute of Health, Berlin, Germany., Rahmann S; Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany., Nahnsen S; Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany., Köster J; Bioinformatics and Computational Oncology, Institute for AI in Medicine (IKIM), University Hospital Essen, University of Duisburg-Essen, Essen, Germany.; Medical Oncology, Harvard Medical School, Harvard University, Boston, USA.
Source:
F1000Research [F1000Res] 2021 Jan 18; Vol. 10, pp. 33. Date of Electronic Publication: 2021 Jan 18 (Print Publication: 2021).
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: F1000 Research Ltd Country of Publication: England NLM ID: 101594320 Publication Model: eCollection Cited Medium: Internet ISSN: 2046-1402 (Electronic) Linking ISSN: 20461402 NLM ISO Abbreviation: F1000Res Subsets: MEDLINE
Imprint Name(s):
Original Publication: London : F1000 Research Ltd
Contributed Indexing:
Keywords: adaptability; data analysis; reproducibility; scalability; sustainability; transparency; workflow management
Entry Date(s):
Date Created: 20250926 Date Completed: 20250926 Latest Revision: 20250926
Update Code:
20250926
PubMed Central ID:
PMC8114187
DOI:
10.12688/f1000research.29032.3
PMID:
34035898.3. Version: 3. Publisher Version ID: 3. Version Date: 2025/09/23
Database:
MEDLINE

Weitere Informationen

Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.
(Copyright: © 2025 Mölder F et al.)

No competing interests were disclosed.