Treffer: Evaluating Data-Efficient LLMs on a Benchmark of Disfluency Minimal Pairs
Weitere Informationen
Zero-shot benchmarks based on minimal pairs have become an essential part of the toolkit for evaluating large language models' linguistic capacities. Most of these tasks focus on syntactic, semantic, and morphological phenomena and are built from expert-crafted or semi-automatically generated sentences. Motivated by the crucial role of spontaneous speech in language processing, we experimented with creating a benchmark that leverages spontaneous speech corpora in three languages (English, French, and Mandarin). Crucially, the benchmark tests LLMs on disfluencies, a ubiquitous and essential feature of spontaneous speech. Our findings show that models pretrained on conversational data exhibit a clear advantage in handling disfluencies compared to those trained on written encyclopedic text. Furthermore, cross-linguistic LLMs trained on much larger datasets did not exhibit strong advantages in our proposed benchmark, highlighting the potential of disfluencybased tasks as a challenging problem for language models.