Treffer: NucleoSeeker-precision filtering of RNA databases to curate high-quality datasets.
Bioinformatics. 2022 Jul 11;38(14):3668-3670. (PMID: 35674373)
Mol Syst Biol. 2011 Oct 11;7:539. (PMID: 21988835)
Nature. 2015 May 28;521(7553):436-44. (PMID: 26017442)
Proc Natl Acad Sci U S A. 2009 Jan 6;106(1):67-72. (PMID: 19116270)
Nucleic Acids Res. 2000 Jan 1;28(1):235-42. (PMID: 10592235)
Nature. 2021 Aug;596(7873):583-589. (PMID: 34265844)
Nature. 2024 Jun;630(8016):493-500. (PMID: 38718835)
Proteins. 2023 Dec;91(12):1747-1770. (PMID: 37876231)
RNA. 2012 Apr;18(4):610-25. (PMID: 22361291)
NPJ Digit Med. 2022 Jan 10;5(1):2. (PMID: 35013569)
Proc Natl Acad Sci U S A. 2009 Dec 29;106(52):22124-9. (PMID: 20018738)
Nucleic Acids Res. 2021 Jan 8;49(D1):D192-D200. (PMID: 33211869)
Methods. 2019 Jun 1;162-163:68-73. (PMID: 31028927)
Trends Genet. 2000 Jun;16(6):276-7. (PMID: 10827456)
Bioinformatics. 2020 Apr 1;36(7):2264-2265. (PMID: 31778142)
Nucleic Acids Res. 2015 Dec 2;43(21):10444-55. (PMID: 26420827)
Bioinformatics. 2013 Nov 15;29(22):2933-5. (PMID: 24008419)
RNA. 2020 Jul;26(7):794-802. (PMID: 32276988)
Weitere Informationen
The structural prediction of biomolecules via computational methods complements the often involved wet-lab experiments. Unlike protein structure prediction, RNA structure prediction remains a significant challenge in bioinformatics, primarily due to the scarcity of annotated RNA structure data and its varying quality. Many methods have used this limited data to train deep learning models but redundancy, data leakage and bad data quality hampers their performance. In this work, we present NucleoSeeker, a tool designed to curate high-quality, tailored datasets from the Protein Data Bank (PDB) database. It is a unified framework that combines multiple tools and streamlines an otherwise complicated process of data curation. It offers multiple filters at structure, sequence, and annotation levels, giving researchers full control over data curation. Further, we present several use cases. In particular, we demonstrate how NucleoSeeker allows the creation of a nonredundant RNA structure dataset to assess AlphaFold3's performance for RNA structure prediction. This demonstrates NucleoSeeker's effectiveness in curating valuable nonredundant tailored datasets to both train novel and judge existing methods. NucleoSeeker is very easy to use, highly flexible, and can significantly increase the quality of RNA structure datasets.
(© The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.)
None declared.