Treffer: Retrieval-augmented Chinese text-to-SQL generation for conversational bibliographic search.
Weitere Informationen
To overcome the limitations of current bibliographic search systems, such as low semantic precision and inadequate handling of complex queries, this study introduces a novel conversational search framework for the Chinese bibliographic domain. Our approach makes several contributions. We first developed BibSQL, the first Chinese Text-to-SQL dataset for bibliographic metadata. Using this dataset, we built a two-stage conversational system that combines semantic retrieval of relevant question-SQL pairs with in-context SQL generation by large language models (LLMs). To enhance retrieval, we designed SoftSimMatch, a supervised similarity learning model that improves semantic alignment. We further refined SQL generation using a Program-of-Thoughts (PoT) prompting strategy, which guides the LLM to produce more accurate output by first creating Python pseudocode. Experimental results demonstrate the framework's effectiveness. Retrieval-augmented generation (RAG) significantly boosts performance, achieving up to 96.6% execution accuracy. Our SoftSimMatch-enhanced RAG approach surpasses zero-shot prompting and random example selection in both semantic alignment and SQL accuracy. Ablation studies confirm that the PoT strategy and self-correction mechanism are particularly beneficial under low-resource conditions, increasing one model's exact matching accuracy from 74.8% to 82.9%. While acknowledging limitations such as potential logic errors in complex queries and reliance on domain-specific knowledge, the proposed framework shows strong generalizability and practical applicability. By uniquely integrating semantic similarity learning, RAG, and PoT prompting, this work establishes a scalable foundation for future intelligent bibliographic retrieval systems and domain-specific Text-to-SQL applications. [ABSTRACT FROM AUTHOR]