Treffer: Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems.

Title:
Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems.
Authors:
Djaffardjy M; Universite Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay 91405, France., Marchment G; Universite Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay 91405, France., Sebe C; Universite Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay 91405, France., Blanchet R; Nantes Université, CNRS, INSERM, l'institut du thorax, 8 quai Moncousu, Nantes F-44000, France., Bellajhame K; PSL, Universite Paris-Dauphine, LAMSADE, Place du Maréchal de Lattre de Tassigny, Paris 75775, France., Gaignard A; Nantes Université, CNRS, INSERM, l'institut du thorax, 8 quai Moncousu, Nantes F-44000, France., Lemoine F; Institut Pasteur, Université Paris Cité, G5 Evolutionary Genomics of RNA Viruses, 28, rue du Dr Roux, Paris 75015, France.; Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, France, 28, rue du Dr Roux, Paris 75015, France., Cohen-Boulakia S; Universite Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay 91405, France.
Source:
Computational and structural biotechnology journal [Comput Struct Biotechnol J] 2023 Mar 07; Vol. 21, pp. 2075-2085. Date of Electronic Publication: 2023 Mar 07 (Print Publication: 2023).
Publication Type:
Journal Article; Review
Language:
English
Journal Info:
Publisher: Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology Country of Publication: Netherlands NLM ID: 101585369 Publication Model: eCollection Cited Medium: Print ISSN: 2001-0370 (Print) Linking ISSN: 20010370 NLM ISO Abbreviation: Comput Struct Biotechnol J Subsets: PubMed not MEDLINE
Imprint Name(s):
Publication: Amsterdam : Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology
Original Publication: Gothenburg, Sweden : Research Network of Computational and Structural Biotechnology
References:
Bioinformatics. 2017 Aug 15;33(16):2580-2582. (PMID: 28379341)
PLoS One. 2017 May 11;12(5):e0177459. (PMID: 28494014)
Genome Biol. 2014 Feb 20;15(2):403. (PMID: 25001293)
Biol Direct. 2015 Aug 19;10:43. (PMID: 26282399)
Bioinformatics. 2016 Oct 1;32(19):3047-8. (PMID: 27312411)
Stud Health Technol Inform. 2012;175:109-10. (PMID: 22941999)
Bioinformatics. 2018 Sep 15;34(18):3094-3100. (PMID: 29750242)
Bioinformatics. 2009 Jul 15;25(14):1754-60. (PMID: 19451168)
Genome Biol. 2019 Aug 12;20(1):164. (PMID: 31405382)
Front Genet. 2020 Dec 10;11:610798. (PMID: 33362867)
PLoS Comput Biol. 2022 Mar 24;18(3):e1009823. (PMID: 35324885)
Nat Methods. 2021 Oct;18(10):1161-1168. (PMID: 34556866)
Genome Res. 2010 Sep;20(9):1297-303. (PMID: 20644199)
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W557-61. (PMID: 23640334)
Gigascience. 2021 Jan 13;10(1):. (PMID: 33438730)
Sci Data. 2016 Mar 15;3:160018. (PMID: 26978244)
Bioinformatics. 2012 Oct 1;28(19):2520-2. (PMID: 22908215)
Nat Biotechnol. 2020 Mar;38(3):276-278. (PMID: 32055031)
Nucleic Acids Res. 2010 Jul;38(Web Server issue):W689-94. (PMID: 20484378)
Nucleic Acids Res. 2010 Jul;38(Web Server issue):W677-82. (PMID: 20501605)
Nat Methods. 2012 Mar 04;9(4):357-9. (PMID: 22388286)
PeerJ Comput Sci. 2020 Sep 21;6:e281. (PMID: 33816932)
Bioinformatics. 2013 Jan 1;29(1):15-21. (PMID: 23104886)
Nat Biotechnol. 2017 Apr 11;35(4):316-319. (PMID: 28398311)
Nat Methods. 2018 Jul;15(7):475-476. (PMID: 29967506)
Nucleic Acids Res. 2022 Apr 21;:. (PMID: 35446428)
PLoS Comput Biol. 2020 Mar 26;16(3):e1007358. (PMID: 32214316)
J Neurol Neurosurg Psychiatry. 2021 Feb;92(2):122-128. (PMID: 33097563)
Am J Hum Genet. 2018 Jan 4;102(1):133-141. (PMID: 29304371)
Bioinformatics. 2009 Aug 15;25(16):2078-9. (PMID: 19505943)
Nat Methods. 2010 May;7(5):335-6. (PMID: 20383131)
Nucleic Acids Res. 2016 Jan 4;44(D1):D38-47. (PMID: 26538599)
PLoS Comput Biol. 2019 Jul 25;15(7):e1007007. (PMID: 31344036)
PLoS Comput Biol. 2021 Aug 5;17(8):e1009207. (PMID: 34351904)
Bioinformatics. 2010 Mar 15;26(6):841-2. (PMID: 20110278)
Contributed Indexing:
Keywords: Bioinformatics; Reproducibility; Reuse; Scientific workflows
Entry Date(s):
Date Created: 20230327 Latest Revision: 20230328
Update Code:
20250114
PubMed Central ID:
PMC10030817
DOI:
10.1016/j.csbj.2023.03.003
PMID:
36968012
Database:
MEDLINE

Weitere Informationen

Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows.
(© 2023 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology.)

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.