Contrastive Reflection for Iterative Prompt Optimization

Abstract


LLM agents are becoming central to information retrieval (IR): they issue retrieval queries, synthesize answers, and increasingly serve as judges for IR evaluation. Improving the prompts that control these agents is an optimization problem, but in applied IR settings it often looks less like blind search and more like debugging. Engineers need to know which behavior failed, which nearby behavior still worked, what distinguishes the two, and whether a prompt edit improves held-out quality without introducing regressions.


We present Contrastive Reflection, an iterative prompt-optimization framework for agentic IR workflows. The framework starts from a task-centric quality definition: QA agents expose retrieval or reasoning traces, and grading agents expose dimension-level scores and rationales. These structured traces are used to identify error-anchored behavioral slices, add nearby successful examples from the same region, and ask a Teacher LLM to propose a targeted prompt edit. Candidate edits are accepted only when validation performance improves, optionally subject to regression checks. We instantiate the framework with a tree-based slice selector, but the contribution is the contrastive reflection loop itself rather than the tree.


On a public HotpotQA retrieval-augmented QA setup, one tree-selected contrastive repair improves held-out exact-match accuracy from 51.4% to 60.4%. Failure-only and random-evidence variants improve less and break more previously correct examples. A light instruction-only comparison places the method near modern prompt optimizers: MIPROv2 reaches 59.4% and GEPA 57.0%. The result is an interpretable optimization loop for IR agents, aimed at making prompt repair more inspectable and validation-driven.


Introduction


In 2026, as LLM-based agents increasingly power information retrieval systems—from query formulation to answer synthesis and evaluation—the need for robust, debuggable prompt optimization has never been greater. Traditional prompt engineering often relies on manual trial and error, which is both time-consuming and opaque. Our work addresses this by framing prompt improvement as a contrastive debugging task: identifying what went wrong and what went right, then leveraging those distinctions to make targeted edits.


Method: Contrastive Reflection Loop


Contrastive Reflection proceeds through the following steps:

  1. Trace collection: QA agents output retrieval or reasoning traces; grading agents output dimension-level scores and rationales.
  2. Error slicing: Using a tree-based slice selector, the framework identifies behavioral slices anchored to failure modes.
  3. Contrastive addition: Nearby successful examples from the same region are added for comparison.
  4. Teacher LLM proposal: A Teacher LLM synthesizes a targeted prompt edit based on the contrastive pair.
  5. Validation: The edit is accepted only if it improves validation performance, with optional regression checks to guard against breaking previously correct behavior.

  6. Results and Comparison


    Evaluated on the HotpotQA retrieval-augmented QA benchmark, Contrastive Reflection achieved a held-out exact-match accuracy of 60.4% starting from 51.4%, outperforming:

    • Failure-only and random-evidence variants (which improved less and introduced more regressions)
    • MIPROv2 (59.4%)
    • GEPA (57.0%)

    Conclusion


    Contrastive Reflection offers a transparent, validation-driven approach to prompt optimization for IR agents. By focusing on contrastive behavioral analysis, it enables engineers to understand and fix prompt failures without sacrificing held-out quality. The framework is lightweight, interpretable, and ready for integration into modern agentic IR pipelines.


    6 pages, 1 figure. To appear at Agent4IR @ KDD 2026 (KDD 2026 Workshop on AI Agents for Information Retrieval)


    Subjects: Artificial Intelligence (cs.AI)

    ACM Classes: I.2.7; H.3.3; I.2...

    via ArXiv AI

Related