Weekly Review #1

2026.05.19 ~ 2026.06.05

이번 Weekly Review는 첫 번째 review이므로, 최근 일주일이 아니라 지금까지 읽었던 모든 논문을 커버한다.

최근 VLA/WAM 계열 연구는 단순한 “image + language → action” policy를 넘어, 로봇의 closed-loop behavior를 위해 action representation, latent reasoning, future prediction, inference scheduling, data repair loop까지 포함하는 structured robot intelligence stack으로 이동하고 있다.

Axis 핵심 변화 대표 논문
Generalist VLA / Foundation Model 여러 task, embodiment, environment를 하나의 embodied action/world-modeling framework로 통합 Qwen-VLA, Cosmos 3, τ0-WM
World Action Model / Video-Action Model 미래 video, action-conditioned rollout, 3D point dynamics, skeleton-conditioned representation을 action reasoning에 사용 τ0-WM, OSCAR, PointAction, SANTS, Flash-WAM
Inference-time Adaptation 모든 control step에 같은 compute를 쓰지 않고 state, phase, uncertainty, delay, task deadline에 따라 scheduling/post-training/execution을 조절 ElegantVLA, SANTS, DVAC, PACE, Realtime-VLA FLASH, OxyGen, DEFLECT
Data and Sim-to-Real Loop 실패 케이스, recovery trajectory, high-fidelity simulation, digital generation을 통해 data coverage 확장 HyperSim, VLAMotor, GRAIL, Factory-Floor case study
Structured Reasoning / Skill / Spatial Priors language CoT 대신 continuous latent reasoning, 3D priors, skill-aware MoE, visual evidence budget을 도입 Continuous Reasoning, 3DThinkVLA, SMoDP, See Less Specify More
논문 핵심 regime 해석
Qwen-VLA mixed foundation-model training pretrained VLM + scratch action expert + T2A/CPT/SFT/RL
Cosmos 3 foundation-model pretraining/post-training omnimodal world model + robot policy post-training
τ0-WM WAM training / foundation-model-style training VAM, ACVS, action selection/rectification을 통합
SMoDP scratch-training skill-conditioned MoE diffusion policy 학습
Factory-Floor VLA fine-tuning pretrained π0.5를 industrial task에 맞게 fine-tuning
HyperSim fine-tuning / sim-to-real co-training simulation data + small real data로 policy 학습
VLAMotor fine-tuning via data synthesis failure-derived repaired data로 VLA fine-tuning
GRAIL data-generation + policy fine-tuning generated 4D HOI data로 tracking/visual policy 학습
3DThinkVLA fine-tuning + auxiliary-module-training + component-scratch-training 3D teacher/adapters를 학습 중 사용하고, inference에서는 2D image-only VLA로 동작
PointAction fine-tuning + component-scratch-training video model fine-tuning + action decoder 학습
OSCAR fine-tuning pretrained video DiT를 skeleton-conditioned WAM으로 fine-tuning
Flash-WAM distillation / fine-tuning teacher WAM을 modality-aware consistency distillation
ElegantVLA scheduler-training frozen VLA 위 lightweight RL scheduler 학습
SANTS scheduler-training frozen WAM 위 state-adaptive denoising scheduler 학습
Realtime-VLA FLASH auxiliary-module-training draft model + consistency verification framework
DEFLECT fine-tuning delay-derived preference pair로 flow-matching VLA post-training
OxyGen training-free unified KV cache manager 중심 inference system
PACE training-free action chunk speed valley 기반 test-time execution
DVAC training-free denoising variance 기반 adaptive chunking
Continuous Reasoning fine-tuning continuous thought interface와 self-verification objective
See Less, Specify More fine-tuning local language relabeling + visual evidence budget 학습

Comments