Today, Xiaomi MiMo releases and open-sourcing MiMo-V2-Flash as a powerful, efficient, and ultra-fast foundation language model that particularly excels in reasoning, coding, and agentic scenarios, while also serving as an excellent general-purpose assistant for everyday tasks.

The model also debuts globally on : Hugging Face, API Platform and AI Studio
Compared

MiMo-V2-Flash is a Mixture-of-Experts model with 309B total parameters and 15B active parameters, adopting a hybrid attention architecture that interleaves sliding-window and full attention, using an aggressive 128-token sliding window and a 5:1 hybrid ratio. With such a lightweight model architecture, we deliver superior intelligence.
In the math competition AIME 2025 and the scientific knowledge benchmark GPQA-Diamond, MiMo-V2-Flash ranks among the top 2 open-source models, demonstrating strong reasoning ability. On the SWE-bench Verified and Multilingual benchmarks for software engineering capability, MiMo-V2-Flash achieved the #1 spot among all open-source models and is on par with the world’s top closed-source models. This model is built for reasoning, coding, and agentic scenarios. It supports a hybrid thinking mode, allowing users to toggle whether the model “thinks” or answers instantly; it can generate functional HTML webpages with one click, working seamlessly with vibe-coding scaffolds such as Claude Code, Cursor, and Cline; and it offers an ultra-long 256k context window, enabling it to complete tasks across hundreds of rounds of agent interactions and tool calls.
Meet MiMo
MiMo-V2-Flash is not just a specialist that can only write code and do math—it can become your assistant for everyday tasks, and a friend you can exchange ideas with, sparking your inspiration.
Efficiency with Lightning Fast Inference
MiMo-V2-Flash is engineered for maximum efficiency. It delivers blazing-fast inference at 150 tokens per second while maintaining an ultra-low cost of $0.1 per million input tokens and $0.3 per million output tokens—making it one of the most cost-effective high-performance models available.

MiMo-V2-Flash adopts a 1:5 hybrid of Global Attention (GA) and Sliding Window Attention (SWA). Extensive empirical results show that SWA is simple, efficient, and easy to use, delivering better overall performance than Linear Attention across general tasks, long-context payload, and reasoning. It also provides a fixed-size KV cache, making it easy to integrate with existing training and inference infrastructure. It redefines parallel decoding to achieve extremely high output token throughput: by introducing Multi-Token Prediction (MTP) training, it boosts the base model’s capabilities, and during inference we validate MTP tokens in parallel.

MiMo-V2-Flash leverages MTP as a native draft model for self-speculative decoding, delivering real deployment speedups. LLM decoding is inherently memory-bound due to low arithmetic intensity. Batch-level parallelism is commonly used to increase FFN arithmetic intensity but does not benefit attention computation, as each request maintains its own KV cache. In contrast, MTP lifts the arithmetic intensity of both FFN and attention by generating multiple draft tokens, which the main model then verifies in parallel.
This approach enables token-level parallelism without increasing KV cache I/O. In MiMo-V2-Flash, the MTP block is deliberately kept lightweight to prevent it from becoming a new inference bottleneck. It uses a dense FFN (not MoE) to limit parameter count and SWA (instead of GA) to reduce KV cache and attention computation costs. Despite this lean design, the MTP module achieves a high acceptance rate. In our measurements with a 3-layer MTP, it attains an accepted length of 2.8–3.6 tokens and an effective speedup of 2.0–2.6×.

Try Xiaomi MiMo Now (Click Here)
Open-source
Xiaomi MiMo is officially Open-Source to the public. Read the technical report for full model details.
Model weights, including MiMo-V2-Flash-Base, are available on Hugging Face under the MIT license.
On Day 0, it contributed all inference code to SGLang. We work closely with their team and shared insights on MiMo-V2-Flash inference on the LMSYS blog.


