DeepSeek-V4: a million-token context that agents can actually use

Summary (EN)

Hugging Face published an analysis on April 24 of DeepSeek’s newly released V4 family, presenting it as one of the most practically important open-weight launches for agentic workloads rather than simply another benchmark-driven model update. The post says DeepSeek released two new checkpoints, V4 Pro and V4 Flash, both with 1 million token context windows, and argues that the real novelty is not raw scale but the architecture’s focus on making very long context economically usable. According to the write-up, DeepSeek combines compressed sparse attention and heavily compressed attention across alternating layers so that single-token inference cost and KV-cache growth remain much lower than in prior architectures at long sequence lengths. The article highlights DeepSeek’s claim that V4 Pro uses only 27% of the single-token inference FLOPs of V3.2 at 1 million tokens and around 10% of the KV cache, while V4 Flash reduces these requirements further. It also describes several agent-specific choices beyond core architecture, including preserving reasoning traces across user turns when tool calls are involved, adopting a dedicated XML-based tool-call schema, and training with a large sandbox system designed for reinforcement-learning rollouts in real tool environments. Benchmark results cited in the piece place V4 near frontier closed models on software and tool-use tasks. The release matters because it directly targets a real deployment pain point for long-running AI agents: keeping long tool trajectories coherent without exploding inference cost or memory usage.

Summary (ZH)

Hugging Face 于 4 月 24 日发布了一篇对 DeepSeek 新一代 V4 模型家族的分析文章，将其描述为近期最值得关注的开源权重发布之一，原因并不只是基准分数，而是它明显围绕 agentic 工作负载做了更现实的工程优化。文章称，DeepSeek 当天推出了 V4 Pro 与 V4 Flash 两个新检查点，二者都支持 100 万 token 上下文窗口。分析强调，这次发布的真正新意不在于参数规模本身，而在于模型架构试图让超长上下文在经济上真正可用。根据文中说明，DeepSeek 通过在不同层交替使用 compressed sparse attention 与 heavily compressed attention，显著降低了长序列场景下单 token 推理成本和 KV cache 增长速度。文章援引 DeepSeek 的数据称，在 100 万 token 条件下，V4 Pro 的单 token 推理 FLOPs 约为 V3.2 的 27%，KV cache 仅约为其 10%，而 V4 Flash 还进一步下降。除核心架构外，文章还指出 DeepSeek 针对代理任务加入了多项设计，例如在涉及工具调用时跨用户轮次保留推理痕迹、采用专门的 XML 工具调用格式，以及借助大规模沙盒系统进行真实工具环境下的强化学习训练。文中引用的基准结果显示，V4 在软件工程和工具使用任务上已逼近部分闭源前沿模型。其重要性在于，它直面长时运行 AI 代理的真实部署难题，也就是如何在不让成本和内存失控的前提下维持长工具链任务的连贯性。

Source

https://huggingface.co/blog/deepseekv4