Treffer: Storing Mass-Spectrometry Data in Simple Databases Enables Flexible and Intuitive Exploration without Time or Space Penalties.

Title:
Storing Mass-Spectrometry Data in Simple Databases Enables Flexible and Intuitive Exploration without Time or Space Penalties.
Authors:
Kumler W; School of Oceanography, University of Washington, 1501 NE Boat St, Seattle, Washington 98195, United States., LaRue S; Department of Mechanical Engineering, University of Washington, 3900 E Stevens Way NE, Seattle, Washington 98195, United States., Ingalls AE; School of Oceanography, University of Washington, 1501 NE Boat St, Seattle, Washington 98195, United States.
Source:
Journal of proteome research [J Proteome Res] 2025 Dec 05; Vol. 24 (12), pp. 6174-6185. Date of Electronic Publication: 2025 Nov 24.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: American Chemical Society Country of Publication: United States NLM ID: 101128775 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1535-3907 (Electronic) Linking ISSN: 15353893 NLM ISO Abbreviation: J Proteome Res Subsets: MEDLINE
Imprint Name(s):
Original Publication: Washington, D.C. : American Chemical Society, c2002-
References:
Anal Chem. 2006 Feb 1;78(3):779-87. (PMID: 16448051)
Nat Methods. 2025 Jun;22(6):1247-1254. (PMID: 40355727)
Methods Mol Biol. 2011;696:205-24. (PMID: 21063949)
Mol Cell Proteomics. 2012 Jan;11(1):O111.011379. (PMID: 21960719)
PLoS One. 2017 Nov 15;12(11):e0188059. (PMID: 29141005)
Mol Cell Proteomics. 2011 Jan;10(1):R110.000133. (PMID: 20716697)
J Proteome Res. 2025 Nov 7;24(11):5329-5335. (PMID: 41037468)
Nat Biotechnol. 2012 Oct;30(10):918-20. (PMID: 23051804)
Sci Rep. 2020 Jun 2;10(1):8939. (PMID: 32488104)
Anal Chem. 2023 Jun 27;95(25):9428-9431. (PMID: 37307589)
Mass Spectrom Rev. 2017 Sep;36(5):668-673. (PMID: 27741559)
J Proteome Res. 2023 Feb 3;22(2):508-513. (PMID: 36414245)
BMC Bioinformatics. 2022 Jan 12;23(1):35. (PMID: 35021987)
J Am Soc Mass Spectrom. 2010 Oct;21(10):1784-8. (PMID: 20674389)
Bioinform Adv. 2024 Oct 26;4(1):vbae160. (PMID: 40034104)
Nat Commun. 2025 Jan 8;16(1):473. (PMID: 39773949)
Mol Cell Proteomics. 2015 Mar;14(3):771-81. (PMID: 25505153)
Bioinformatics. 2022 Apr 12;38(8):2333-2340. (PMID: 35171986)
Nucleic Acids Res. 2016 Jan 4;44(D1):D463-70. (PMID: 26467476)
PLoS One. 2015 Apr 30;10(4):e0125108. (PMID: 25927999)
Sci Rep. 2022 Mar 30;12(1):5384. (PMID: 35354909)
Mol Cell Proteomics. 2015 Sep;14(9):2301-7. (PMID: 26217018)
Metabolites. 2022 Feb 11;12(2):. (PMID: 35208247)
J Proteome Res. 2021 Jan 1;20(1):172-183. (PMID: 32864978)
Nat Methods. 2021 Jul;18(7):768-770. (PMID: 34183830)
Nucleic Acids Res. 2020 Jan 8;48(D1):D440-D444. (PMID: 31691833)
Contributed Indexing:
Keywords: SQL; benchmarking; data storage; exploratory data analysis; human-centered design; liquid chromatography; mass spectrometry
Entry Date(s):
Date Created: 20251124 Date Completed: 20251205 Latest Revision: 20251211
Update Code:
20251211
PubMed Central ID:
PMC12687358
DOI:
10.1021/acs.jproteome.5c00721
PMID:
41277771
Database:
MEDLINE

Weitere Informationen

Mass spectrometry (MS) generates large data sets that are stored in increasingly optimized and complex file types, demanding technical expertise to extract information rapidly and easily. We wondered whether a simple structured query language (SQL) database could hold raw MS data and allow for easily readable queries without incurring major penalties in the read time or disk space relative to other popular MS formats. Here, we describe a basic MS schema with intuitive database tables and fields that can outperform other formats for exploratory and interactive analysis according to six data subsets commonly extracted: single scans (both MS <sup>1</sup> and MS <sup>2</sup> ), ion chromatograms, retention time ranges, and fragmentation searches (both precursor and fragment search). Additionally, we compare SQLite, DuckDB, and Parquet implementations and find that they can perform these tasks in under a second, even when the files occupy over a gigabyte of data on the disk. We believe that this tidy data schema expands nicely to most forms of MS data and offers a way to transparently query data sets while preserving computational performance.