Treffer: Building a searchable online corpus of Australian and New Zealand aligned speech.
Weitere Informationen
Advances in automatic speech recognition technology, increases in bandwidth availability, and the widespread use of video streaming and sharing platforms have opened new horizons for corpus phonetics. CoANZSE Audio, a searchable online version of the Corpus of Australian and New Zealand Spoken English, provides access to over 195 million words of transcribed speech from transcripts of videos uploaded to YouTube by councils and other local government entities in Australia and New Zealand. Audio and forced alignment files are also available, making the resource suitable for the investigation of a range of research questions pertaining to morphosyntax, phonetics, and discourse. The resource, which is freely available via login through CLARIN, Europe's main language resources infrastructure network, was created through the use of open-source tools and software: yt-dlp, a Python library for collecting data from video and streaming websites; the Montreal Forced Aligner, a recent neural network alignment suite; and Parselmouth-Praat, Python bindings for the Praat acoustic analysis software. The website is powered by BlackLab, which combines a powerful search engine based on Apache Lucene with an intuitive web frontend. CoANZSE Audio may be useful for the investigation of regional differentiation of language features, and with additional annotation, differences in feature use according to social or demographic groups. Recent applications have included studies of double modals, a rare syntactic feature, and apology sequences. The nature of the audio and alignment data may make the resource especially suitable for the study of regional phonetic variation. Furthermore, the methods used to create the resource may be of interest to researchers seeking to adopt a pipeline approach for the creation of specialized corpora from publicly available online content. [ABSTRACT FROM AUTHOR]
Copyright of Australian Journal of Linguistics is the property of Taylor & Francis Ltd and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)