Treffer: Representing DNA for machine learning algorithms: A primer on one-hot, binary, and integer encodings.

Title:
Representing DNA for machine learning algorithms: A primer on one-hot, binary, and integer encodings.
Authors:
Gupta YM; Department of Biology, Faculty of Science, Naresuan University, Phitsanulok, Thailand.; Center of Excellence for Innovation and Technology for Detection and Advanced Materials (ITDAM), Naresuan University, Phitsanulok, Thailand., Kirana SN; Business Management and Languages, Faculty of Management Science, Silpakorn University, Phetchaburi, Thailand., Homchan S; Department of Biology, Faculty of Science, Naresuan University, Phitsanulok, Thailand.; Center of Excellence for Innovation and Technology for Detection and Advanced Materials (ITDAM), Naresuan University, Phitsanulok, Thailand.
Source:
Biochemistry and molecular biology education : a bimonthly publication of the International Union of Biochemistry and Molecular Biology [Biochem Mol Biol Educ] 2025 Mar-Apr; Vol. 53 (2), pp. 142-146. Date of Electronic Publication: 2024 Dec 05.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: John Wiley & sons Country of Publication: United States NLM ID: 100970605 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1539-3429 (Electronic) Linking ISSN: 14708175 NLM ISO Abbreviation: Biochem Mol Biol Educ Subsets: MEDLINE
Imprint Name(s):
Publication: 2002- : Hoboken, NJ : John Wiley & sons
Original Publication: Oxford, UK : Elsevier, c2000-
References:
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):1–41.
Dong Y, Sun F, Ping Z, Ouyang Q, Qian L. DNA storage: research landscape and future prospects. Natl Sci Rev. 2020;7(6):1092–1107.
ElAbd H, Bromberg Y, Hoarfrost A, Lenz T, Franke A, Wendorff M. Amino acid encoding for deep learning applications. BMC Bioinfor. 2020;21:1–14.
Choong ACH, Lee NK. Evaluation of convolutionary neural networks modeling of DNA sequences using ordinal versus one‐hot encoding method. 2017 International Conference on Computer and Drone Applications (IConDA). 2017.
Wang C, Ma G, Wei D, Zhang X, Wang P, Li C, et al. Mainstream encoding–decoding methods of DNA data storage. CCF Trans High Perform Comput. 2022;4(1):23–33.
Chia SE, Lee NK. Comparisons of DNA Sequence Representation Methods for Deep Learning Modelling. 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET). 2022.
Yin C. Encoding DNA sequences by integer chaos game representation. arXiv Preprint arXiv. 2017;171204546.
Yin C. Encoding and decoding DNA sequences by integer chaos game representation. J Comput Biol. 2019;26(2):143–151.
Zakeri B, Carr PA, Lu TK. Multiplexed sequence encoding: a framework for DNA communication. PLoS One. 2016;11(4):e0152774.
Bhadola P, Gupta YM. Classifying DNA barcode sequences of four insects belonging to Orthoptera order using tensor network. Agric Nat Res. 2022;56(4):705–712.
Bada SO, Olusegun S. Constructivism learning theory: a paradigm for teaching and learning. J Res Method Educ. 2015;5(6):66–70.
Hmelo‐Silver CE. Problem‐based learning: what and how do students learn? Educ Psychol Rev. 2004;16:235–266.
Magana AJ, Taleyarkhan M, Alvarado DR, Kane M, Springer J, Clase K. A survey of scholarly literature describing the field of bioinformatics education and bioinformatics educational research. CBE—Life Sci Educ. 2014;13(4):607–623.
Attwood TK, Blackford S, Brazas MD, Davies A, Schneider MV. A global perspective on evolving bioinformatics and data science training needs. Brief Bioinform. 2019;20(2):398–404.
Gupta YM, Kirana SN, Homchan S, Tanasarnpaiboon S. Teaching python programming for bioinformatics with Jupyter notebook in the post‐COVID‐19 era. Biochem Mol Biol Educ. 2023;51(5):537–539.
Dow EG, Wood‐Charlson EM, Biller SJ, Paustian T, Schirmer A, Sheik CS, et al. Bioinformatic teaching resources–for educators, by educators–using KBase, a free, user‐friendly, open source platform. Original Strategies Train Educ Initiatives Bioinf. 2022;6:711535.
Goodman AL, Dekhtyar A. Teaching bioinformatics in concert. PLoS Comput Biol. 2014;10(11):e1003896.
Grant Information:
R2567E063 Naresuan University (NU), and National Science, Research and Innovation Fund (NSRF)
Contributed Indexing:
Keywords: Jupyter notebook; bioinformatics; python; sequence encoding; teaching material
Substance Nomenclature:
9007-49-2 (DNA)
Entry Date(s):
Date Created: 20241205 Date Completed: 20250426 Latest Revision: 20250520
Update Code:
20250521
DOI:
10.1002/bmb.21870
PMID:
39633594
Database:
MEDLINE

Weitere Informationen

This short paper presents an educational approach to teaching three popular methods for encoding DNA sequences: one-hot encoding, binary encoding, and integer encoding. Aimed at bioinformatics and computational biology students, our learning intervention focuses on developing practical skills in implementing these essential techniques for efficient representation and analysis of genetic data. The primary goal of this study is to enhance students' understanding and practical application of DNA encoding methods, which are crucial for various computational analyses in bioinformatics. Our intervention consists of three key components: (1) a conceptual framework that contextualizes these encoding methods within broader bioinformatics applications, (2) an interactive Jupyter Notebook with Python code examples (https://github.com/yashmgupta/Representing-DNA/tree/main), and (3) a user-friendly Streamlit application for visualizing encoded sequences (https://dnaencoding.streamlit.app/) that also enables students to input their own DNA sequences and visualize the different encoding methods, further enhancing their understanding and practical experience. By combining conceptual overview with practical coding and visualization tools, our approach provides a comprehensive foundation for students to leverage these key DNA sequence encoding methods in their future work. This study contributes to bioinformatics education by offering effective, hands-on learning resources that bridge the gap between theoretical knowledge and practical application in DNA sequence analysis, preparing students for advanced research and data analysis projects in the field.
(© 2024 International Union of Biochemistry and Molecular Biology.)

The authors declare no conflicts of interest.