Peking University and DeepSeek have open-sourced DSpark, a speculative decoding framework that delivers up to a 661% throughput gain and slashes compute waste in live production systems.
On 27 June 2026, researchers from Peking University and DeepSeek jointly introduced and open-sourced DSpark, an innovative speculative decoding framework designed to optimise Large Language Model (LLM) inference efficiency.
Released under an MIT license via the DeepSpec GitHub repository, DSpark functions as an engineering layer rather than a newly trained standalone model. It builds on existing checkpoints by introducing an attached speculative draft module. This framework introduces two core architectural breakthroughs:
-
Semi-Autoregressive Generation: It pairs an enhanced parallel backbone network with lightweight sequential modules (Markov heads) to preserve token dependencies within blocks. This successfully resolves the traditional “acceptance rate decay” over long text sequences, allowing a two-layer DSpark module to outperform traditional five-layer parallel architectures.
-
Confidence-Scheduled Verification: Utilizing a hardware-aware prefix scheduler, the system tracks real-time server concurrency and token survival probability. It dynamically scales verification lengths—checking more tokens when GPUs are idle and fewer during peak traffic—to eradicate wasted compute.
DSpark is already deployed in DeepSeek-V4’s online production systems. Compared to prior single-token production benchmarks, end-to-end user text generation speeds increased by 60% to 85% on DeepSeek-V4-Flash and 57% to 78% on DeepSeek-V4-Pro, while preserving identical output quality.
Under heavy server loads and strict latency Service Level Agreements (SLAs), aggregate throughput scaling is substantial:
-
V4-Flash: Achieved a 51% throughput gain at an 80 token/s SLA, surging to a 661% throughput gain at a strict 120 token/s SLA.
-
V4-Pro: Registered a 52% throughput gain at 35 token/s and a 406% throughput gain at a 50 token/s SLA.
The DeepSpec codebase features cross-model compatibility, demonstrating successful structural validation on alternative open-source architectures like Alibaba’s Qwen3 series and Google’s Gemma4-12B. Against baseline models like Eagle3, DSpark improved average effective token acceptance lengths by up to 30.9%.















































































