Treffer: Prepared binned DNA data storage datasets for reconstruction benchmarking.
Weitere Informationen
This repository includes datasets from the following publications. 1 Grass, R. N., Heckel, R., Puddu, M., Paunescu, D., & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie International Edition, 54, 8, 2552–2555 (2015) 2 Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture. Science, 355, 6328, 950–954 (2017). 3 Srinivasavaradhan, S. R., Gopi, S., Pfister, H. D. & Yekhanin S. Trellis BMA: Coded Trace Reconstruction on IDS Channels for DNA Storage. in 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 2453–2458 (2021). The datasets are given in a binned format to enhance the reproducibility of the results presented in the paper. Bar-Lev, D., Orr, I., Sabary, O., Etzion T., & Yakkobi, E. Scalable and robust DNA-based storage via coding theory and deep learning. 2024. Detailed description of the format The binned format was created using the binning step described in the paper ("Scalable and robust DNA-based storage via coding theory and deep learning"). Each cluster of reads appears in the file with a header followed by the reads. More specifically: The header consists of 2 lines, the first corresponds to the encoded sequence of the clusters, and the second is a line of 18x“*” that should be ignored The reads in the clusters are provided after the header, where each read is given in a separate line Each cluster ends with two empty lines Data processing To ease the processing of our datasets, we also provide the following Python scripts (see https://github.com/itaiorr/Deep-DNA-based-storage) reads_preprocessor.py includes our preprocessing procedure for the raw reads. The procedure detects and truncates the primers binning.py - parses the file of the binned reads and creates two Python dictionaries. In the first dictionary, each key is an encoded sequence, and the value is a list of the reads in the cluster. In the second dictionary the keys are the ...