Treffer: Proteomics and machine learning: Leveraging domain knowledge for feature selection in a skeletal muscle tissue meta-analysis.

Title:
Proteomics and machine learning: Leveraging domain knowledge for feature selection in a skeletal muscle tissue meta-analysis.
Authors:
Shahin-Shamsabadi A; Evolved.Bio, 280 Joseph Street, Kitchener, Ontario, Canada., Cappuccitti J; Evolved.Bio, 280 Joseph Street, Kitchener, Ontario, Canada.
Source:
Heliyon [Heliyon] 2024 Nov 29; Vol. 10 (24), pp. e40772. Date of Electronic Publication: 2024 Nov 29 (Print Publication: 2024).
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Elsevier Ltd Country of Publication: England NLM ID: 101672560 Publication Model: eCollection Cited Medium: Print ISSN: 2405-8440 (Print) Linking ISSN: 24058440 NLM ISO Abbreviation: Heliyon Subsets: PubMed not MEDLINE
Imprint Name(s):
Original Publication: London : Elsevier Ltd, [2015]-
References:
Talanta. 2018 May 15;182:456-463. (PMID: 29501178)
Biomaterials. 2019 Apr;198:217-227. (PMID: 30527761)
Skelet Muscle. 2011 Feb 01;1(1):6. (PMID: 21798084)
J Proteomics. 2019 Apr 30;198:18-26. (PMID: 30529743)
J Neuromuscul Dis. 2014;1(1):15-40. (PMID: 27858666)
Mol Cell Proteomics. 2021;20:100083. (PMID: 33887487)
Biomark Med. 2013 Feb;7(1):169-86. (PMID: 23387498)
J Proteome Res. 2023 Apr 7;22(4):1181-1192. (PMID: 36963412)
J Mass Spectrom. 2001 Oct;36(10):1083-91. (PMID: 11747101)
J Proteome Res. 2018 Dec 7;17(12):4023-4030. (PMID: 30985145)
Proteomics. 2005 Aug;5(13):3537-45. (PMID: 16041671)
Cell Rep Phys Sci. 2022 Oct 19;3(10):. (PMID: 36381226)
BMC Bioinformatics. 2008 Dec 29;9:559. (PMID: 19114008)
Nucleic Acids Res. 2022 Jan 7;50(D1):D543-D552. (PMID: 34723319)
Int J Mol Sci. 2020 Dec 30;22(1):. (PMID: 33396627)
NPJ Regen Med. 2022 Apr 7;7(1):23. (PMID: 35393412)
Skelet Muscle. 2021 Nov 2;11(1):24. (PMID: 34727990)
Proteomics. 2016 Jan;16(2):214-25. (PMID: 26449181)
J Am Chem Soc. 2013 Feb 6;135(5):1629-40. (PMID: 23294060)
PLoS One. 2011;6(9):e24973. (PMID: 21969867)
Biomaterials. 2019 Nov;221:119416. (PMID: 31419653)
Nat Commun. 2020 Oct 16;11(1):5301. (PMID: 33067450)
J Chromatogr B Analyt Technol Biomed Life Sci. 2007 Apr 15;849(1-2):251-60. (PMID: 17071145)
Entropy (Basel). 2022 Dec 24;25(1):. (PMID: 36673174)
BMC Genomics. 2015 Jun 25;16:475. (PMID: 26109061)
Stem Cell Reports. 2023 Oct 10;18(10):1954-1971. (PMID: 37774701)
Front Bioinform. 2022 Jun 27;2:927312. (PMID: 36304293)
J Proteome Res. 2021 Jan 1;20(1):444-452. (PMID: 33107741)
J Pharm Biomed Anal. 2019 Feb 5;164:119-127. (PMID: 30368117)
Natl Sci Rev. 2023 May 01;10(7):nwad125. (PMID: 37323811)
OMICS. 2013 Dec;17(12):595-610. (PMID: 24116388)
Proteomics. 2020 Nov;20(21-22):e1900351. (PMID: 32267083)
J Proteomics. 2015 Nov 3;129:25-32. (PMID: 26196237)
Mol Cell Proteomics. 2020 Jul;19(7):1132-1144. (PMID: 32291283)
Genes (Basel). 2019 Jan 28;10(2):. (PMID: 30696086)
Front Physiol. 2021 Feb 26;12:619710. (PMID: 33716768)
J Cachexia Sarcopenia Muscle. 2017 Feb;8(1):5-18. (PMID: 27897395)
iScience. 2022 Jan 29;25(2):103836. (PMID: 35198892)
Front Physiol. 2022 Jul 06;13:928195. (PMID: 35874526)
Contributed Indexing:
Keywords: Domain knowledge; Feature selection; Machine learning; Proteomics; Skeletal muscle tissue
Entry Date(s):
Date Created: 20241225 Latest Revision: 20250104
Update Code:
20250114
PubMed Central ID:
PMC11667615
DOI:
10.1016/j.heliyon.2024.e40772
PMID:
39720035
Database:
MEDLINE

Weitere Informationen

Omics techniques, such as proteomics, contain crucial data for understanding biological processes, but they remain underutilized due to their high dimensionality. Typically, proteomics research focuses narrowly on using a limited number of datasets, hindering cross-study comparisons, a problem that can potentially be addressed by machine learning. Despite this potential, machine learning has seen limited adoption in the field of proteomics. Here, skeletal muscle proteomics datasets from five separate studies were combined. These studies included conditions such as in vitro models (both 2D and 3D), in vivo skeletal muscle tissue, and adjacent tissues such as tendons. The collected data was preprocessed using MaxQuant, and then enriched using a Python script fetching structural and compositional details from UniProt and Ensembl databases. This was used to handle high-dimensional and sparsely labeled dataset by breaking it down into five smaller categories using cellular composition information and then training a Random Forest model for each category separately. Using biological context for interpreting the data resulted in improved model performance and made tailored analysis possible by reducing the dimensionality and increasing signal-to-noise ratio as well as only preserving biologically relevant features in each category. This integration of domain knowledge into data analysis and model training facilitated the discovery of new patterns while ensuring the retention of critical details, often overlooked when blind feature selection methods are used to exclude proteins with minimal expressions or variances. This approach was shown to be suitable for performing diverse analyses on individual as well as combined datasets within a broader biological context, ultimately leading to the identification of biologically relevant patterns. Besides from generating new biological insights, this approach can be used to perform tasks such as biomarker discovery, cluster analysis, classification, and anomaly detection more accurately, but incorporation of more datasets is needed to further expand the computational capabilities of such models in clinical settings.
(© 2024 The Authors.)

The authors declare no conflict of interest. All expenses are covered by the authors’ institution, Evolved.Bio.