Zyphra Releases Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models Cut Time-to-First-Token by an Order of Magnitude

AI Agents 📅 2026-06-12 ⏱ 2 min read 👁 108 views ⭐ 7/10

Zyphra has unveiled Zamba2-VL, a family of hybrid vision-language models (VLMs) that combine Mamba2 state-space layers with Transformer attention mechanisms. These models achieve a roughly tenfold reduction in time-to-first-token (TTFT) compared to conventional Transformer-based VLMs, making them particularly well-suited for latency-sensitive applications in 2026's real-time AI landscape. The Zamba2-VL architecture integrates a pretrained Mamba2 backbone with a lightweight Transformer decoder, enabling efficient processing of both textual and visual inputs. By replacing the standard self-attention layers with Mamba2's selective state-space model for the majority of computations, the models dramatically lower the generation delay—especially for long-context or interactive use cases such as conversational agents, document analysis, and video understanding. Key technical highlights include: - A hybrid design that uses Mamba2 for most token generation, reserving Transformer attention for a small subset of residual interactions. - An order-of-magnitude improvement in TTFT versus similarly sized pure-Transformer VLMs, as validated on multiple benchmarks. - Support for diverse vision-language tasks including image captioning, visual question answering, and multimodal dialogue. The models are released under an open-source license, with weights and inference code available on GitHub. Zyphra's release represents an important step in rethinking the efficiency tradeoffs of VLMs at a time when the industry increasingly demands faster, more responsive multimodal systems. As of June 2026, Zamba2-VL is among the first production-ready VLMs to leverage state-space model backbones at scale, signaling a potential shift away from attention-heavy architectures for latency-critical deployments.

via MarkTechPost

Zyphra Releases Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models Cut Time-to-First-Token by an Order of Magnitude

Related