Treffer: Evaluating Inductive Reasoning and Programming Capabilities of Large Language Models With The One-Dimensional Abstract Reasoning Corpus.
Weitere Informationen
We present an initial automated test to evaluate LLMs' capacity to perform inductive reasoning tasks. We use the GPT-3.5 and GPT-4 models to create a system which generates Python code as hypotheses for inductive reasoning to transform sequences of the One Dimensional Abstract Reasoning Corpus (1D-ARC) challenge. We experiment with three prompting techniques, namely standard prompting, Chain of Thought (CoT), and direct feedback. We provide results and an analysis of cost-to-success rate and benefit-cost ratio. Our best result is an overall 25% success rate with our CoT prompting on GPT-4, significantly surpassing the standard prompting approach. We assess the programming capabilities of the LLM by analysing the execution rate and errors of the generated code for inductive reasoning. We discuss potential avenues to improve our experiments, testing other strategies, and combining deductive reasoning with LLM-based inductive reasoning. [ABSTRACT FROM AUTHOR]