ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Summary


Large language models (LLMs) deployed as agents over extensive tool catalogs face a critical bottleneck in tool retrieval. While embedding-based methods often under-capture specialized tool semantics due to compact encoders, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM's vocabulary. These models are fine-tuned in two stages—memorization followed by retrieval supervised fine-tuning (SFT)—enabling the LLM to act as its own retriever. Although this approach achieves strong performance on standard benchmarks like ToolBench, existing evaluations rely on verbose, fully-specified queries and constrained decoding that restricts outputs to valid token paths. These methods do not reveal whether the model truly understands its tools.


To bridge this gap, the authors introduce ToolSense, an open-source, LLM-powered diagnostic framework. Given any tool catalog, ToolSense automatically generates three benchmarks:

  • Realistic Retrieval Benchmark (RRB): Queries at three ambiguity tiers, simulating real-world usage.
  • Multiple-Choice Question (MCQ) Probing Benchmark: Tests factual tool knowledge.
  • Question-Answering (QA) Probing Benchmark: Evaluates deeper understanding.

Key Findings


Applying ToolSense to ToolBench (~47,000 tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation:

  • On RRB queries, several configurations suffer a performance collapse of 50–64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline.
  • Despite strong retrieval performance, some models score near-random on factual probes, indicating that retrieval success does not equate to genuine tool knowledge.

Relevance in 2026


As LLM-based agents continue to proliferate in enterprise and open-source ecosystems—handling thousands of API calls, database queries, and firmware commands—the need for rigorous, real-world auditing has never been greater. ToolSense addresses a gap in the 2026 AI landscape: most retrieval benchmarks do not stress-test a model's ability to handle ambiguous or incomplete user requests. By providing a flexible, automated diagnostic pipeline, ToolSense enables developers to identify and correct knowledge blind spots before deploying agents in production.


Availability


The ToolSense framework and ToolBench diagnostic benchmarks are open-sourced at github.com/SAP/toolsense.

via ArXiv AI

Related