Project Auto-World: Towards Automated Benchmarking of Neural Relational Reasoners

Project Auto-World: Towards Automated Benchmarking of Neural Relational Reasoners


Authors: Anirban Das, Joanne Boisson, Irtaza Khalid, Sumita Garai, Steven Schockaert

Submitted: 23 June 2026

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Submitted to: NeurIPS 2026 E&D track

Code: GitHub Repository


Abstract


Reasoning about relational structures remains a significant challenge for neural models, particularly when they are required to systematically apply learned knowledge to problem instances that are more complex than those encountered during training. Progress in this area is hindered by the difficulty of evaluating such generalization, as it is rarely clear a priori what constitutes a genuinely hard instance. In this work, we explore how this challenge can be addressed by leveraging large language models (LLMs) to automate benchmark generation, learning to produce increasingly difficult instances in an end-to-end fashion.


Specifically, given a world parameterized by Datalog rules and using an Edge Transformer as the reasoning evaluator, we employ LLM-driven evolutionary search (inspired by FunSearch) and autonomous agentic search to discover sampling functions that yield hard problem instances. We further demonstrate that the Edge Transformer can be improved using this automatically generated data, enabling it to generalize effectively to additional data perturbations. Finally, we show that the same framework can be applied to novel worlds proposed by LLMs, paving the way for autonomous research in neural relational reasoning.


Key Contributions (2026 Context)


  • Automated Benchmark Generation: Introduces a method using LLMs to automatically create challenging benchmarks for neural relational reasoners, addressing the long-standing issue of evaluating systematic generalization.
  • LLM-Driven Evolutionary Search: Combines FunSearch-based and autonomous agentic search techniques to discover sampling functions that generate progressively harder instances.
  • Improved Model Generalization: Demonstrates that the Edge Transformer, when trained on such automatically generated data, exhibits robust generalization to unseen perturbations.
  • Towards Autonomous Research: Extends the framework to novel worlds proposed by LLMs, suggesting a path toward fully automated discovery in neural reasoning research.

Significance


As of 2026, the challenge of systematic generalization in neural networks remains a critical bottleneck for AI systems, especially in domains requiring compositional reasoning (e.g., program synthesis, knowledge graph reasoning, and robotics). This work contributes to the growing trend of using LLMs as research assistants or autonomous agents—not only to generate code or data but to actively drive the scientific discovery process. By automating benchmark creation and model improvement, the authors propose a scalable approach to evaluating and enhancing relational reasoning capabilities, potentially accelerating progress in the field.

via ArXiv AI

Related