Back to feed
arXiv cs.CL
arXiv cs.CL
6/18/2026
JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Short summary

JetFlow addresses a fundamental scaling limitation in speculative decoding by combining efficient one-forward drafting with branch-wise causal conditioning. It achieves up to 9.64x speedup on MATH benchmarks and 4.58x on conversational workloads on H100 GPUs. Open-source code and models are available on GitHub with vLLM integration.

  • Solves the scaling ceiling problem where increasing draft budget doesn't improve LLM inference speed due to acceptance and overhead tradeoffs
  • Achieves 9.64x speedup on MATH-500 and 4.58x on conversational tasks through causal parallel tree drafting
  • Production-ready with open-source implementation and vLLM integration demonstrated under realistic serving loads

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more