Treffer: Scalable transcriptomics analysis with Dask: applications in data science and machine learning.

Title:
Scalable transcriptomics analysis with Dask: applications in data science and machine learning.
Authors:
Moreno M; Department of Computer Science, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal.; Laboratory of Artificial Intelligence and Decision Support, INESC TEC, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal., Vilaça R; High-Assurance Software Laboratory, INESC TEC, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal.; Department of Informatics, Minho Advanced Computing Center, University of Minho, Gualtar, 4710-070, Braga, Portugal., Ferreira PG; Department of Computer Science, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal. pgferreira@fc.up.pt.; Laboratory of Artificial Intelligence and Decision Support, INESC TEC, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal. pgferreira@fc.up.pt.; Institute of Molecular Pathology and Immunology of the University of Porto, Institute for Research and Innovation in Health (i3s), R. Alfredo Allen 208, 4200-135, Porto, Portugal. pgferreira@fc.up.pt.
Source:
BMC bioinformatics [BMC Bioinformatics] 2022 Nov 30; Vol. 23 (1), pp. 514. Date of Electronic Publication: 2022 Nov 30.
Publication Type:
Review; Journal Article
Language:
English
Journal Info:
Publisher: BioMed Central Country of Publication: England NLM ID: 100965194 Publication Model: Electronic Cited Medium: Internet ISSN: 1471-2105 (Electronic) Linking ISSN: 14712105 NLM ISO Abbreviation: BMC Bioinformatics Subsets: MEDLINE
Imprint Name(s):
Original Publication: [London] : BioMed Central, 2000-
References:
Nat Med. 2001 Jun;7(6):673-9. (PMID: 11385503)
Nature. 2020 Sep;585(7825):357-362. (PMID: 32939066)
Bioinformatics. 2022 Apr 28;38(9):2519-2528. (PMID: 35188184)
Annu Rev Immunol. 2020 Apr 26;38:727-757. (PMID: 32075461)
Nat Rev Genet. 2016 May;17(5):257-71. (PMID: 26996076)
Nat Methods. 2011 Jun;8(6):469-77. (PMID: 21623353)
JCI Insight. 2018 Apr 5;3(7):. (PMID: 29618660)
Science. 2015 Oct 9;350(6257):207-211. (PMID: 26359337)
Genome Biol. 2019 Nov 1;20(1):228. (PMID: 31675989)
J Integr Bioinform. 2014 Jun 13;11(2):236. (PMID: 24953305)
Genome Biol. 2013 Jul 26;14(7):R75. (PMID: 23889843)
Nat Rev Drug Discov. 2019 Jun;18(6):463-477. (PMID: 30976107)
BMC Bioinformatics. 2010 Feb 18;11:94. (PMID: 20167110)
PLoS One. 2017 Oct 26;12(10):e0186906. (PMID: 29073279)
Nature. 2013 Sep 19;501(7467):338-45. (PMID: 24048066)
Circulation. 2015 Nov 17;132(20):1920-30. (PMID: 26572668)
Elife. 2020 Jan 27;9:. (PMID: 31985403)
Bioinformatics. 2019 Dec 15;35(24):5199-5206. (PMID: 31141124)
Nat Methods. 2022 Feb;19(2):171-178. (PMID: 35102346)
BMC Bioinformatics. 2017 Dec 19;18(1):565. (PMID: 29258445)
Nat Methods. 2020 Mar;17(3):261-272. (PMID: 32015543)
Nucleic Acids Res. 2018 Jan 4;46(D1):D794-D801. (PMID: 29126249)
Ann Surg Oncol. 2013 Nov;20(12):3747-53. (PMID: 23800896)
N Engl J Med. 2015 Nov 19;373(21):2005-14. (PMID: 26412349)
CA Cancer J Clin. 2019 Sep;69(5):363-385. (PMID: 31184787)
Nat Protoc. 2021 Jan;16(1):1-9. (PMID: 33288955)
Apert Neuro. 2021;1(1):. (PMID: 35079748)
Nature. 2022 Jan;601(7894):623-629. (PMID: 34875674)
Nature. 2015 Feb 19;518(7539):317-30. (PMID: 25693563)
Biomed Res Int. 2015;2015:621690. (PMID: 26176014)
Genome Biol. 2019 Dec 31;21(1):1. (PMID: 31892341)
Nat Genet. 2018 Aug;50(8):1171-1179. (PMID: 30013180)
Bioinformatics. 2014 Sep 1;30(17):2517-8. (PMID: 24813215)
Nat Rev Genet. 2019 Nov;20(11):631-656. (PMID: 31341269)
Nat Genet. 2013 Jun;45(6):580-5. (PMID: 23715323)
Genome Biol. 2010;11(3):R25. (PMID: 20196867)
Sci Transl Med. 2017 Apr 19;9(386):. (PMID: 28424332)
Nature. 2012 Sep 6;489(7414):57-74. (PMID: 22955616)
Gastric Cancer. 2020 May;23(3):473-482. (PMID: 31773340)
Nat Rev Genet. 2015 Jun;16(6):321-32. (PMID: 25948244)
Genome Biol. 2010;11(10):R106. (PMID: 20979621)
Bioinformatics. 2002 Jan;18(1):39-50. (PMID: 11836210)
Blood. 2014 May 8;123(19):2915-23. (PMID: 24632715)
Nat Methods. 2017 Apr;14(4):381-387. (PMID: 28263961)
Nature. 2013 Sep 26;501(7468):506-11. (PMID: 24037378)
Oncogene. 2018 Nov;37(47):6136-6151. (PMID: 29995873)
Cell. 2017 Feb 9;168(4):629-643. (PMID: 28187285)
J Am Acad Dermatol. 2015 May;72(5):780-5.e3. (PMID: 25748297)
Nat Methods. 2008 Jul;5(7):621-8. (PMID: 18516045)
Sci Rep. 2020 Jan 27;10(1):1212. (PMID: 31988390)
Genome Biol. 2018 Dec 20;19(1):221. (PMID: 30567591)
Contemp Oncol (Pozn). 2015;19(1A):A68-77. (PMID: 25691825)
Mol Pharm. 2016 Jul 5;13(7):2524-30. (PMID: 27200455)
Lancet Respir Med. 2016 Mar;4(3):213-24. (PMID: 26907218)
Cell Rep. 2019 Dec 10;29(11):3367-3373.e4. (PMID: 31825821)
Cancer Med. 2016 Jul;5(7):1619-28. (PMID: 27109697)
Am J Respir Crit Care Med. 2015 Oct 1;192(7):826-35. (PMID: 26121490)
BMC Genomics. 2017 Jul 3;18(1):508. (PMID: 28673244)
Exp Mol Med. 2018 Aug 7;50(8):1-14. (PMID: 30089861)
J Infect Dis. 2013 Jan 1;207(1):18-29. (PMID: 22872737)
Asian Pac J Cancer Prev. 2016;17(2):835-8. (PMID: 26925688)
Science. 2015 Jul 17;349(6245):255-60. (PMID: 26185243)
Int J Mol Sci. 2017 Jul 29;18(8):. (PMID: 28758927)
Brief Bioinform. 2018 Sep 28;19(5):776-792. (PMID: 28334202)
BMC Med Genomics. 2020 Apr 3;13(Suppl 5):44. (PMID: 32241303)
Mol Syst Biol. 2019 Jun 19;15(6):e8746. (PMID: 31217225)
J Chem Theory Comput. 2021 Sep 14;17(9):5907-5919. (PMID: 34450002)
Bioinformatics. 2019 Jun 1;35(12):2159-2161. (PMID: 30445495)
Contributed Indexing:
Keywords: Data analysis; Gene expression; Machine learning; Scalable data science; Transcriptomics
Entry Date(s):
Date Created: 20221130 Date Completed: 20221202 Latest Revision: 20221213
Update Code:
20250114
PubMed Central ID:
PMC9710082
DOI:
10.1186/s12859-022-05065-3
PMID:
36451115
Database:
MEDLINE

Weitere Informationen

Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary.
Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics.
Results: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask .
Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.
(© 2022. The Author(s).)