arXiv cs.CL
6/19/2026

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
Short summary
DeepSeek releases V4 series: two MoE models (1.6T with 49B active, 284B with 13B active) supporting 1M-token context windows. Architecture improvements—compressed sparse attention (CSA), heavily compressed attention (HCA), and manifold-constrained hyper-connections—reduce inference FLOPs by 73% and KV cache by 90% vs V3.2. Pre-trained on 32T diverse tokens with Muon optimizer for faster convergence and training stability.
- •MoE architecture with 1.6T and 284B parameter variants supporting 1M token context
- •73% FLOPs reduction and 90% KV cache savings vs V3.2 through hybrid attention (CSA/HCA)
- •Pre-trained on 32T diverse tokens; Muon optimizer enables faster, more stable training
Generated with AI, which can make mistakes.
Is this a good recommendation for you?