Computer Science > Artificial Intelligence
arXiv:2606.18385 (cs) | Submitted on 16 Jun 2026
Title
CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework
Authors
Sneha Rao, Shaina Raza, Dhanesh Ramachandram
Abstract
Vision-Language Models (VLMs) continue to exhibit a critical limitation: they produce fluent yet visually unfaithful outputs, commonly referred to as hallucinations. This issue undermines trust in VLM applications across domains such as medical imaging, autonomous driving, and accessibility tools. As of 2026, the field has seen significant progress in chain-of-thought (CoT) reasoning and retrieval-augmented generation (RAG), yet existing methods remain incomplete. Current approaches neither enforce step-level citation grounding—requiring each reasoning step to cite its evidence—nor route verification failures back to the retrieval stage for correction.
To address these gaps, we introduce CaVe-VLM-CoT, a modular, reflection-based agentic-RAG framework. Our system operates through a five-stage closed-loop pipeline:
- Extractor – Identifies claims requiring evidence.
- Retriever – Gathers relevant multimodal evidence.
- Solver – Generates a step-by-step reasoning chain.
- Citation Injector – Associates each reasoning step with its retrieved evidence.
- Verifier – Checks whether each claim is properly grounded; ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval.
- Artificial Intelligence (cs.AI)
- arXiv:2606.18385 [cs.AI]
- DOI: https://doi.org/10.48550/arXiv.2606.18385
Because no prior framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a comprehensive suite of 23 component-wise metrics covering all pipeline stages. At the core of this suite is CaVeScore, a composite metric that weights accuracy, citation precision and recall, attribution correctness, and evidence grounding.
Crucially, CaVe-VLM-CoT achieves these results without any architectural or prompt modifications to the underlying VLM. On the ScienceQA benchmark, our framework attains 87.1% accuracy and a CaVeScore of 56.6%. On the more challenging MMMU benchmark (30 subjects), it reaches 55.2% accuracy and a CaVeScore of 35.7%. These results demonstrate that a closed-loop verification and retrieval correction mechanism significantly improves VLM faithfulness while preserving flexibility for future model upgrades.
Subjects
Cite as
via ArXiv AI
