Result: SeisAug: A data augmentation python toolkit

Title:
SeisAug: A data augmentation python toolkit
Source:
Applied Computing and Geosciences, Vol 25, Iss , Pp 100232- (2025)
Publisher Information:
Elsevier, 2025.
Publication Year:
2025
Collection:
LCC:Geography. Anthropology. Recreation
LCC:Geology
LCC:Electronic computers. Computer science
Document Type:
Academic journal article
File Description:
electronic resource
Language:
English
ISSN:
2590-1974
DOI:
10.1016/j.acags.2025.100232
Accession Number:
edsdoj.9802e9dc3fe4700a390abbc19e986f7
Database:
Directory of Open Access Journals

Further Information

A common limitation in applying any deep learning and machine learning techniques is the limited labelled dataset which can be addressed through Data augmentation (DA). SeisAug is a DA python toolkit to address this challenge in seismological studies. DA. DA helps to balance the imbalanced classes of a dataset by creating more examples of under-represented classes. It significantly mitigates overfitting by increasing the volume of training data and introducing variability, thereby improving the model's performance on unseen data. Given the rapid advancements in deep learning for seismology, ‘SeisAug’ assists in extensibility by generating a substantial amount of data (2–6 times more data) which can aid in developing an indigenous robust model. Further, this study demonstrates the role of DA in developing a robust model. For this we utilized a basic two class identification models between earthquake/signal and noise/(non-earthquake). The model is trained with original, 1 and 5 times augmented datasets and their performance metrics are evaluated. The model trained with 5X times augmented dataset significantly outperforms with accuracy of 0.991, AUC 0.999 and AUC-PR 0.999 compared to the model trained with original dataset with accuracy of 0.50, AUC 0.75 and AUC-PR 0.80. Furthermore, by making all codes available on GitHub, the toolkit facilitates the easy application of DA techniques, empowering end-users to enhance their seismological waveform datasets effectively and overcome the initial drawbacks posed by the scarcity of labelled data.