MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

2026-06-09 Korean inference-time success-rate WAM fine-tuning component-scratch-training

Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행

Overview Figure

motionwam_overview

Summary

기존 humanoid loco-manipulation은 보통 high-level manipulation policy가 upper body만 세밀하게 제어하고, lower body는 low-level locomotion controller가 velocity, torso height, orientation 같은 coarse command를 추종하는 hierarchical 구조라서 upper/lower body action space가 불일치하고 다리가 task-driven interaction에 적극적으로 쓰이지 못한다.
동시에 WAM은 video dynamics prior를 policy에 넣을 수 있어 temporal coherence와 physical grounding 측면에서 유망하지만, high-dimensional video-action latent를 반복 denoising해야 해서 real-time humanoid control에는 너무 느리다는 문제가 있다.
MotionWAM은 Video DiT + Motion DiT의 dual-DiT 구조를 사용하되, fully denoised future video를 만들지 않고 Video DiT branch의 intermediate denoising feature를 single forward pass로 뽑아 Motion DiT action policy에 condition으로 넣는다.
학습은 Stage 1 egocentric video pretraining, Stage 2 cross-embodiment action post-training, Stage 3 Unitree G1 whole-body teleoperation fine-tuning의 3단계로 구성되며, output은 SONIC 기반 unified whole-body motion token과 continuous end-effector/gripper channel로 구성된다.
실험에서는 9개 real-world Unitree G1 loco-manipulation task에서 MotionWAM이 strongest baseline인 GR00T-N1.7의 43.9% 대비 76.1% overall success rate를 달성하고, A100 기준 Cosmos Policy보다 7배 빠른 4.9 Hz chunk-wise inference frequency를 달성한다.