MiniMax Sparse Attention (MSA): A Two-Branch Block-Sparse Attention Mechanism Trained on a 109B-Parameter MoE with a 3T-Token Budget

The rapid growth of large language models (LLMs) has intensified the need for efficient attention mechanisms that can scale to long contexts without prohibitive computational costs. Standard full attention—with its quadratic complexity—becomes a bottleneck for models processing millions of tokens, especially in latency-sensitive or resource-constrained environments. Addressing this challenge, MiniMax introduced MiniMax Sparse Attention (MSA), a novel two-branch block-sparse attention mechanism designed to dramatically reduce compute while preserving model quality. In 2026, as enterprises increasingly deploy LLMs for real-time document analysis, code generation, and multi-turn dialogues, sparse attention methods like MSA are critical for enabling production-scale inference within tight hardware budgets. MSA employs a dual-branch architecture: one branch applies dense attention to a local window—capturing fine-grained, position-dependent relationships within a fixed radius (e.g., 4,096 tokens)—while the second branch uses block-sparse, learnable attention over global tokens. The global branch selects a small set of key-value blocks based on learned routing, enabling the model to retrieve relevant information from the entire sequence without attending to every token. This hybridization allows MSA to maintain long-range coherence (critical for tasks like summarization and reasoning) while cutting total attention operations by 80–90% compared to dense baselines. Trained on a 109-billion-parameter Mixture-of-Experts (MoE) model with a budget of 3 trillion tokens, MSA demonstrated strong empirical performance. The training leveraged carefully designed sparse pattern schedules: early stages used denser attention to help the router learn meaningful key-value selections, while later stages gradually increased sparsity to maximize efficiency without sacrificing accuracy. On standard benchmarks including LongBench, RULER, and custom in-house long-context evaluations, the model matched or exceeded full-attention baselines on tasks requiring up to 128K-token sequences, while reducing attention-related FLOPs by over 70% during inference. In 2026 deployment scenarios, this translates to 3–5× faster generation on the same hardware—enabling cost-sensitive applications such as interactive document assistants and real-time chat systems. Key technical contributions of MSA include a learnable block-sparse routing mechanism that predicts attention sparsity patterns jointly with the MoE expert routing, minimizing overhead and enabling end-to-end optimization. The block-sparse formulation also aligns well with modern GPU hardware (e.g., NVIDIA H100/B200, AMD MI300X), which accelerates block-sparse matrix operations via specialized tensor cores. This hardware-software co-design allows MSA to achieve high utilization even at extreme sparsity levels (e.g., 90%+ sparsity). By the first half of 2026, MiniMax reported that MSA-enabled models process contexts of up to 512K tokens on a single 8-GPU node, a milestone for long-document legal analysis and codebase-level reasoning. In summary, MiniMax Sparse Attention offers a practical, trainable path to efficient long-context inference without sacrificing quality. Its two-branch block-sparse design, validated at the 109B-parameter MoE scale, provides a blueprint for future LLMs that must balance compute cost, speed, and context length. As the AI infrastructure landscape evolves in 2026—with increasing emphasis on reducing inference costs and carbon footprints—MSA-like mechanisms are likely to become standard components in next-generation transformer architectures.

via MarkTechPost

Related