Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion
Technical Summary
In the rapidly evolving field of AI-driven information retrieval, the ability for agents to interact directly with large corpora is critical. As of early 2026, retrieval-augmented generation (RAG) systems have become standard, but they still struggle with verifying complex constraints across multiple documents. This paper introduces Dr-DCI, a novel framework that bridges the gap between scalable retrieval and precise, hands-on document analysis.
The Challenge of Direct Corpus Interaction
Traditional agentic search over large corpora relies on retriever-mediated interfaces—such as BM25 or ColBERT—for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views. This limits an agent's ability to reorganize material or verify constraints across multiple documents. Direct Corpus Interaction (DCI) addresses this by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading both performance and efficiency.
The Dr-DCI Solution
Dr-DCI is a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Instead of operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within that confined environment. This design combines the recall power of retrieval with the precision of DCI: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution.
Key Performance Gains (2026 Benchmarks)
Dr-DCI demonstrates both effectiveness and efficiency across scales:
- Browsecomp-Plus: Achieves 71.2% accuracy, improving over raw DCI and ablated variants by up to 8.3 percentage points while reducing tool usage, wall time, and estimated cost.
- Workspace-preserving context reset: Further boosts accuracy to 73.3%.
- Corpus scaling experiments: Remains effective from 100K to 10M documents. In contrast, raw DCI becomes unstable and BM25 performs substantially worse at scale.
- Wiki-18 QA (20M-scale): Achieves an average score of 63.0 across six benchmarks, outperforming both retrieval-based and trained search-agent baselines.
Critical Components
Ablation analysis reveals that two features are key to Dr-DCI's performance:
- Ranked previews: Allow the agent to quickly assess document relevance.
- Inter-document DCI operations: Enable constraint verification and information consolidation across multiple documents.
Broader Implications
As AI systems in 2026 increasingly need to handle massive, heterogeneous document collections, Dr-DCI offers a practical blueprint for scaling agentic search. By smartly combining retrieval and direct interaction within a dynamic workspace, it overcomes the limitations of both pure ranking and full-corpus command execution.
Paper details: 25 pages, 4 figures, 22 tables. Submitted to arXiv on 12 June 2026.
via ArXiv AI
