arXiv cs.CL
6/18/2026

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
Short summary
JetFlow addresses a fundamental scaling limitation in speculative decoding by combining efficient one-forward drafting with branch-wise causal conditioning. It achieves up to 9.64x speedup on MATH benchmarks and 4.58x on conversational workloads on H100 GPUs. Open-source code and models are available on GitHub with vLLM integration.
- •Solves the scaling ceiling problem where increasing draft budget doesn't improve LLM inference speed due to acceptance and overhead tradeoffs
- •Achieves 9.64x speedup on MATH-500 and 4.58x on conversational tasks through causal parallel tree drafting
- •Production-ready with open-source implementation and vLLM integration demonstrated under realistic serving loads
Generated with AI, which can make mistakes.
Is this a good recommendation for you?