Treffer: Retrieval-augmented Chinese text-to-SQL generation for conversational bibliographic search.

Title:
Retrieval-augmented Chinese text-to-SQL generation for conversational bibliographic search.
Authors:
Wang, Zhenyu1 (AUTHOR), Zhu, Mark Xuefang1 (AUTHOR) xfzhu@nju.edu.cn, Li, Guo1 (AUTHOR), Kong, Shanshan1 (AUTHOR)
Source:
PLoS ONE. 10/27/2025, Vol. 20 Issue 10, p1-18. 18p.
Database:
Academic Search Index

Weitere Informationen

To overcome the limitations of current bibliographic search systems, such as low semantic precision and inadequate handling of complex queries, this study introduces a novel conversational search framework for the Chinese bibliographic domain. Our approach makes several contributions. We first developed BibSQL, the first Chinese Text-to-SQL dataset for bibliographic metadata. Using this dataset, we built a two-stage conversational system that combines semantic retrieval of relevant question-SQL pairs with in-context SQL generation by large language models (LLMs). To enhance retrieval, we designed SoftSimMatch, a supervised similarity learning model that improves semantic alignment. We further refined SQL generation using a Program-of-Thoughts (PoT) prompting strategy, which guides the LLM to produce more accurate output by first creating Python pseudocode. Experimental results demonstrate the framework's effectiveness. Retrieval-augmented generation (RAG) significantly boosts performance, achieving up to 96.6% execution accuracy. Our SoftSimMatch-enhanced RAG approach surpasses zero-shot prompting and random example selection in both semantic alignment and SQL accuracy. Ablation studies confirm that the PoT strategy and self-correction mechanism are particularly beneficial under low-resource conditions, increasing one model's exact matching accuracy from 74.8% to 82.9%. While acknowledging limitations such as potential logic errors in complex queries and reliance on domain-specific knowledge, the proposed framework shows strong generalizability and practical applicability. By uniquely integrating semantic similarity learning, RAG, and PoT prompting, this work establishes a scalable foundation for future intelligent bibliographic retrieval systems and domain-specific Text-to-SQL applications. [ABSTRACT FROM AUTHOR]