Treffer: Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems.
Original Publication: Gothenburg, Sweden : Research Network of Computational and Structural Biotechnology
PLoS One. 2017 May 11;12(5):e0177459. (PMID: 28494014)
Genome Biol. 2014 Feb 20;15(2):403. (PMID: 25001293)
Biol Direct. 2015 Aug 19;10:43. (PMID: 26282399)
Bioinformatics. 2016 Oct 1;32(19):3047-8. (PMID: 27312411)
Stud Health Technol Inform. 2012;175:109-10. (PMID: 22941999)
Bioinformatics. 2018 Sep 15;34(18):3094-3100. (PMID: 29750242)
Bioinformatics. 2009 Jul 15;25(14):1754-60. (PMID: 19451168)
Genome Biol. 2019 Aug 12;20(1):164. (PMID: 31405382)
Front Genet. 2020 Dec 10;11:610798. (PMID: 33362867)
PLoS Comput Biol. 2022 Mar 24;18(3):e1009823. (PMID: 35324885)
Nat Methods. 2021 Oct;18(10):1161-1168. (PMID: 34556866)
Genome Res. 2010 Sep;20(9):1297-303. (PMID: 20644199)
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W557-61. (PMID: 23640334)
Gigascience. 2021 Jan 13;10(1):. (PMID: 33438730)
Sci Data. 2016 Mar 15;3:160018. (PMID: 26978244)
Bioinformatics. 2012 Oct 1;28(19):2520-2. (PMID: 22908215)
Nat Biotechnol. 2020 Mar;38(3):276-278. (PMID: 32055031)
Nucleic Acids Res. 2010 Jul;38(Web Server issue):W689-94. (PMID: 20484378)
Nucleic Acids Res. 2010 Jul;38(Web Server issue):W677-82. (PMID: 20501605)
Nat Methods. 2012 Mar 04;9(4):357-9. (PMID: 22388286)
PeerJ Comput Sci. 2020 Sep 21;6:e281. (PMID: 33816932)
Bioinformatics. 2013 Jan 1;29(1):15-21. (PMID: 23104886)
Nat Biotechnol. 2017 Apr 11;35(4):316-319. (PMID: 28398311)
Nat Methods. 2018 Jul;15(7):475-476. (PMID: 29967506)
Nucleic Acids Res. 2022 Apr 21;:. (PMID: 35446428)
PLoS Comput Biol. 2020 Mar 26;16(3):e1007358. (PMID: 32214316)
J Neurol Neurosurg Psychiatry. 2021 Feb;92(2):122-128. (PMID: 33097563)
Am J Hum Genet. 2018 Jan 4;102(1):133-141. (PMID: 29304371)
Bioinformatics. 2009 Aug 15;25(16):2078-9. (PMID: 19505943)
Nat Methods. 2010 May;7(5):335-6. (PMID: 20383131)
Nucleic Acids Res. 2016 Jan 4;44(D1):D38-47. (PMID: 26538599)
PLoS Comput Biol. 2019 Jul 25;15(7):e1007007. (PMID: 31344036)
PLoS Comput Biol. 2021 Aug 5;17(8):e1009207. (PMID: 34351904)
Bioinformatics. 2010 Mar 15;26(6):841-2. (PMID: 20110278)
Weitere Informationen
Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows.
(© 2023 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology.)
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.