Result: ScienceXGuide Pipeline: An AI-Pipeline for Accelerating Scientific Education and Research
collection:UNIV-COTEDAZUR
collection:INTERDISCIPLINARITES
collection:TEST-NICE
Further Information
The rapid expansion of metabolomics and other scientific fields demands tools that keep pace with emerging research fields and interdisciplinary methodologies. Large Language Models (LLMs) are powerful for synthesizing vast amounts of information, but often fall short on specialized facts and lack of transparency regarding the information source. This limitation can be addressed by Retrieval-Augmented Generation (RAG), which can integrate dynamic and relevant data retrieval from text into Artificial Intelligence (AI) chatbot interaction, ensuring that accurate and relevant contextual information is provided. This method enhances the reliability and applicability of generated content.We introduce ScienceXGuide, an open computational pipeline, to enable the scalable deployment of AI-chatbots for scientific education and research. The pipeline starts from the building of a knowledge base from a set of scientific publications and content from other sources (Youtube videos, GitHub repos, and websites). ScienceXGuide required the creation of the BibTeX2FAISS python package to transform open-access bibliographic references into searchable FAISS databases, as well as a user-friendly, semi-automated system for end users to deploy these AI-chatbots as web-app on the Streamlit Community Cloud (UserXGuide). Additionally, the pipeline supports the integration with open LLMs like ollama, providing flexibility for users with programming skills. Our pipeline is proposed with a series of proof-of-concept websites, the ScienceXGuide Curated Series, a set of expert-curated and evaluated scientific content across various disciplines including metabolomics.The ScienceXGuide pipeline has been implemented successfully, with initial proof-of-concept applications in metabolomics and other scientific fields including computational applications for biology, chemistry, and omics sciences. It also allows users to effortlessly convert bibliographic data into a dynamic, searchable knowledge base that can be interrogated with a chatbot interface, while providing accurate responses and citing the documents sourced. The UserXGuide and ScienceXGuide Series websites serve as practical, real-world applications of AI in science, offering an open and free solution to make scientific education accessible to a global audience. The integration of RAG with a scalable, complete and user-focused pipeline represents advancement in the dissemination of scientific knowledge and a step in the integration of large language models in scientific education. While currently focused on a few types of bibliographic data, future enhancements could include broader data sources, support to a wider range of LLMs, and a more advanced RAG mechanism.