NVIDIA has open-sourced DFlash, a block diffusion model that speeds up autoregressive LLM inference by up to 15× while integrating with vLLM, SGLang and Hugging Face checkpoints, making high-performance speculative decoding more accessible to developers.
NVIDIA researchers have open-sourced DFlash, a lightweight block diffusion model that accelerates autoregressive large language model (LLM) inference by replacing sequential speculative drafting with parallel block generation. Alongside the release, the team published 20 DFlash model checkpoints on Hugging Face, deployment recipes for Blackwell and Hopper GPUs, and support for model families including Qwen, Kimi K2.6, Llama, Gemma and gpt-oss.
The release integrates natively with leading open-source inference frameworks including vLLM, SGLang and the Speculators library, allowing engineers to replace EAGLE-3 speculative decoding with a simple configuration change. Open checkpoints are also supported through TensorRT-LLM.
Unlike conventional speculative decoding, which still drafts tokens sequentially, DFlash predicts an entire block of masked future tokens in a single forward pass. The approach enables block-parallel drafting and verification, improving GPU utilisation, reducing latency and preserving the target model’s output quality.
On an eight-node NVIDIA DGX B300 system running gpt-oss-120B with TensorRT-LLM, DFlash delivered more than 15× higher throughput than autoregressive decoding at 500–600 tokens per second per user, 1.5× higher throughput than EAGLE-3, and more than 2× higher interactivity at batch size one. It also achieved up to 5.8× throughput gains in vLLM deployments and 5.1× improvements in SGLang, making it well suited for coding, reasoning, multilingual and agentic AI workloads.











































































