Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Summary (EN)

The paper introduces X-WAM, a unified 4D world model designed to combine robot action generation with high-fidelity future-world synthesis in one framework. The authors argue that earlier unified world models remained limited to 2D pixel-space modeling and struggled to balance action efficiency with detailed world modeling. X-WAM instead predicts multi-view RGB-D video so that it can support both action execution and reconstruction of future spatial structure. To accomplish this, the system reuses pretrained video diffusion priors and adds a lightweight depth-prediction branch by replicating late Diffusion Transformer blocks, enabling future 3D reconstruction alongside visual generation. The paper’s second technical contribution is Asynchronous Noise Sampling, an inference scheme that uses fewer denoising steps for action decoding while keeping a fuller denoising process for high-quality video generation. This is intended to preserve real-time controllability without giving up detailed world synthesis. According to the paper, X-WAM was pretrained on more than 5,800 hours of robotic data and achieved average success rates of 79.2% on RoboCasa and 90.7% on RoboTwin 2.0, while outperforming prior methods on visual and geometric metrics for 4D reconstruction and generation. The release signals a practical direction for robotics and embodied AI, where a single model can both decide what to do next and imagine the physical consequences in temporally and spatially richer form.

Summary (ZH)

该论文提出了 X-WAM，一种统一的 4D 世界模型，目标是在同一框架中同时完成机器人动作生成与高保真未来世界合成。作者指出，过去的统一世界模型大多停留在 2D 像素空间建模层面，难以同时兼顾动作执行效率和细粒度世界建模能力。X-WAM 则通过预测多视角 RGB-D 视频，把动作执行与未来空间结构重建结合起来。具体做法是复用预训练视频扩散模型的视觉先验，并复制 Diffusion Transformer 后部若干模块，构建一个轻量级深度预测分支，从而在生成未来视频的同时重建对应的 3D 空间信息。论文的第二项关键技术是 Asynchronous Noise Sampling，这是一种异步去噪推理策略，用更少的步骤快速解码动作，以满足实时控制需求，同时保留更完整的去噪链路来生成高质量视频。作者称，该模型在超过 5800 小时机器人数据上预训练后，在 RoboCasa 与 RoboTwin 2.0 上分别取得 79.2% 和 90.7% 的平均成功率，并在 4D 重建与生成的视觉和几何指标上超过现有方法。整体而言，这项工作展示了一条更面向落地的具身智能路线，即同一个模型既能决定下一步动作，也能以更丰富的时空形式“想象”动作后果。

Source

https://arxiv.org/abs/2604.26694