Weekly Review #1
2026.05.19 ~ 2026.06.05
이번 Weekly Review는 첫 번째 review이므로, 최근 일주일이 아니라 지금까지 읽었던 모든 논문을 커버한다.
최근 VLA/WAM 계열 연구는 단순한 “image + language → action” policy를 넘어, 로봇의 closed-loop behavior를 위해 action representation, latent reasoning, future prediction, inference scheduling, data repair loop까지 포함하는 structured robot intelligence stack으로 이동하고 있다.
| Axis | 핵심 변화 | 대표 논문 |
|---|---|---|
| Generalist VLA / Foundation Model | 여러 task, embodiment, environment를 하나의 embodied action/world-modeling framework로 통합 | Qwen-VLA, Cosmos 3, τ0-WM |
| World Action Model / Video-Action Model | 미래 video, action-conditioned rollout, 3D point dynamics, skeleton-conditioned representation을 action reasoning에 사용 | τ0-WM, OSCAR, PointAction, SANTS, Flash-WAM |
| Inference-time Adaptation | 모든 control step에 같은 compute를 쓰지 않고 state, phase, uncertainty, delay, task deadline에 따라 scheduling/post-training/execution을 조절 | ElegantVLA, SANTS, DVAC, PACE, Realtime-VLA FLASH, OxyGen, DEFLECT |
| Data and Sim-to-Real Loop | 실패 케이스, recovery trajectory, high-fidelity simulation, digital generation을 통해 data coverage 확장 | HyperSim, VLAMotor, GRAIL, Factory-Floor case study |
| Structured Reasoning / Skill / Spatial Priors | language CoT 대신 continuous latent reasoning, 3D priors, skill-aware MoE, visual evidence budget을 도입 | Continuous Reasoning, 3DThinkVLA, SMoDP, See Less Specify More |
| 논문 | 핵심 regime | 해석 |
|---|---|---|
| Qwen-VLA | mixed foundation-model training | pretrained VLM + scratch action expert + T2A/CPT/SFT/RL |
| Cosmos 3 | foundation-model pretraining/post-training | omnimodal world model + robot policy post-training |
| τ0-WM | WAM training / foundation-model-style training | VAM, ACVS, action selection/rectification을 통합 |
| SMoDP | scratch-training | skill-conditioned MoE diffusion policy 학습 |
| Factory-Floor VLA | fine-tuning | pretrained π0.5를 industrial task에 맞게 fine-tuning |
| HyperSim | fine-tuning / sim-to-real co-training | simulation data + small real data로 policy 학습 |
| VLAMotor | fine-tuning via data synthesis | failure-derived repaired data로 VLA fine-tuning |
| GRAIL | data-generation + policy fine-tuning | generated 4D HOI data로 tracking/visual policy 학습 |
| 3DThinkVLA | fine-tuning + auxiliary-module-training + component-scratch-training | 3D teacher/adapters를 학습 중 사용하고, inference에서는 2D image-only VLA로 동작 |
| PointAction | fine-tuning + component-scratch-training | video model fine-tuning + action decoder 학습 |
| OSCAR | fine-tuning | pretrained video DiT를 skeleton-conditioned WAM으로 fine-tuning |
| Flash-WAM | distillation / fine-tuning | teacher WAM을 modality-aware consistency distillation |
| ElegantVLA | scheduler-training | frozen VLA 위 lightweight RL scheduler 학습 |
| SANTS | scheduler-training | frozen WAM 위 state-adaptive denoising scheduler 학습 |
| Realtime-VLA FLASH | auxiliary-module-training | draft model + consistency verification framework |
| DEFLECT | fine-tuning | delay-derived preference pair로 flow-matching VLA post-training |
| OxyGen | training-free | unified KV cache manager 중심 inference system |
| PACE | training-free | action chunk speed valley 기반 test-time execution |
| DVAC | training-free | denoising variance 기반 adaptive chunking |
| Continuous Reasoning | fine-tuning | continuous thought interface와 self-verification objective |
| See Less, Specify More | fine-tuning | local language relabeling + visual evidence budget 학습 |