Tags

English

Graduate-School

  • OpenPI 2026-05-22: ChatGPT랑 codex를 이용해서 openpi 레포지토리 분석하게 시켜보기

Korean

MoE

Nsight-Systems

Profiling

Python

VLA

WAM

  • ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? 2026-06-22: pretrained image-editing model을 robot policy backbone으로 fine-tuning하고, future video를 생성하는 대신 single future endpoint를 학습할 때 형성되는 layer-wise KV cache를 flow-matching Action Expert에 전달하여 action chunk를 생성하는 경량 WAM

  • SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method

  • PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation 2026-06-18: pretrained 14B flow-matching video DiT에 Geometry-Aware Cross-View Attention, camera-aware Geo-RoPE, Depth Anything 3 기반 Latent 3D-REPA를 결합해 여러 로봇 카메라의 미래 영상을 3D-consistent하게 생성하고, action-conditioned rollout을 WAM의 world-prediction backbone으로 활용할 수 있는 multi-view world foundation model

  • WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT 2026-06-17: pretrained WAM에서 actor만 RL fine-tuning하지 않고, successful online rollout으로 world model을 KL-regularized video SFT하며, actor는 imagined future와 executed future의 reconstruction consistency reward로 RL update하는 WAM post-training framework

  • Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy

  • Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method

  • Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method

  • WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성

  • µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model

  • WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model

  • Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination 2026-06-10: WAM의 미래 영상 예측을 photorealistic video generation이 아니라 action generation을 돕는 저비용 coarse future guidance로 재정의하고, compact video expert + low-resolution future latent + asymmetric video-action denoising으로 약 1B 규모에서 real-world policy inference latency를 약 98 ms/chunk까지 낮춤

  • AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM

  • MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation 2026-06-09: Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행

  • Flash-WAM: Modality-Aware Distillation for World Action Models 2026-06-05: WAM의 video/action diffusion denoising을 각각의 noise regime에 맞게 다르게 distill해서, WAM을 거의 teacher 성능에 가깝게 유지하면서 real-time chunk-level control이 가능한 수준까지 가속하는 step-distillation method

  • OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics 2026-06-04: pretrained Cosmos-Predict2.5-2B video DiT를 2D kinematic skeleton condition으로 fine-tuning하여, 여러 robot embodiment와 human hand에 걸쳐 action-conditioned future video를 생성하고 이를 RoboArena policy evaluation proxy로 쓴다

  • Cosmos 3: Omnimodal World Models for Physical AI 2026-06-03: language, image, video, audio, action을 하나의 Mixture-of-Transformers (MoT) 기반 omnimodal world model로 통합해, VLM·video generator·forward/inverse dynamics·robot policy를 하나의 Physical AI backbone으로 다루는 NVIDIA의 대규모 foundation model

  • PointAction: 3D Points as Universal Action Representations for Robot Control 2026-06-03: pretrained video diffusion model이 RGB뿐 아니라 temporally consistent XYZ pointmap까지 생성하게 만들고, 이 3D point dynamics를 embodiment-specific diffusion action decoder가 action chunk로 변환

  • τ0-WM: A Unified Video-Action World Model for Robotic Manipulation 2026-06-02: action generation, video prediction, action-conditioned evaluation을 하나의 shared video diffusion backbone 위에서 통합한 manipulation framework

  • SANTS: A State-Adaptive Scheduler for World Action Models 2026-05-28: WAM이 매번 미래 영상을 끝까지 denoise하지 않고, 현재 로봇 상태에 따라 “여기서 멈출지”와 “얼마나 크게 건너뛸지”를 결정해 full-denoising WAM 대비 success-latency tradeoff를 개선하는 state-adaptive video denoising scheduler

Writing

auxiliary-module-training

  • InSight: Self-Guided Skill Acquisition via Steerable VLAs 2026-06-25: 기존 demonstration을 자동으로 primitive 단위로 분해해 pretrained π0.5를 primitive-steerable policy로 만들고, novel task에서 VLM이 발견한 missing primitive를 single-axis controller로 자율 수집·검증한 뒤 VLA에 재학습하여 영속적인 skill vocabulary로 편입하는 VLM-guided continual skill acquisition framework

  • SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다

  • FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation 2026-06-24: flow matching robot policy의 중간 noisy action을 clean action chunk로 한 번에 projection한 뒤, 그 지점의 critic gradient를 value-improved velocity target으로 distillation하여 전체 denoising ODE를 backpropagation하지 않고도 offline-to-online real-world RL을 수행

  • SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method

  • DREAM-Chunk: Reactive Action Chunking with Latent World Model 2026-06-18: frozen action-chunking VLA가 샘플링한 N개 candidate chunk의 latent future를 lightweight world model로 예측하고, 매 control step마다 현재 observation과 가장 가까운 phase-aligned dreamed state의 action으로 전환해 VLA를 다시 호출하지 않고 within-chunk reactivity를 높이는 test-time scaling method

  • Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework

  • LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation 2026-06-17: asynchronous inference로 실행되는 Diffusion Policy의 chunk boundary jerk와 obstacle collision 문제를 latency-aware classifier-free guidance, demonstration-derived goal prediction, collision-free trajectory optimization, spatial-temporal smoothing으로 줄이는 real-robot manipulation policy

  • Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method

  • T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA

  • Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-15: frozen flow-based VLA는 그대로 둔 채, lightweight RL adaptor가 매 query마다 latent steering w, denoising steps K, execution chunk length C를 동적으로 선택해 hard state에서는 더 많은 compute와 잦은 replanning을, easy state에서는 낮은 compute와 긴 open-loop execution을 수행하도록 만드는 elastic VLA execution framework

  • WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성

  • EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine

  • Improving Robotic Generalist Policies via Flow Reversal Steering 2026-06-12: coarse semantic action을 frozen flow-matching VLA의 역방향 ODE로 latent noise에 매핑한 뒤 다시 denoise해, generalist policy prior 안의 더 정교한 action mode를 호출하는 training-free steering 방법

  • WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model

  • Dynamic Execution Horizon Prediction for Chunk-based Robot Policies 2026-06-11: pretrained action-chunking robot policy의 action generator는 완전히 고정하고, 현재 observation과 예측된 action chunk를 보고 “이번에 몇 step을 open-loop로 실행할지”를 PPO로 학습하는 lightweight execution-horizon predictor

  • SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation 2026-06-10: long-horizon robotic manipulation에서 VLA policy의 self-improvement를 위해, action-primitive stage estimator와 multi-gate MoE value head로 dense reward/value model을 만들고, 이를 SPIRAL의 offline-to-online residual RL data flywheel에 통합한다

  • AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM

  • GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy

  • Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies 2026-06-09: few-shot SFT된 π0.5 flow-matching VLA를 고정된 self-rollout buffer와 learned Q-critic의 action-gradient로 offline RL fine-tuning하되, Q-gradient를 terminal action label이 아니라 denoising-time residual velocity supervision으로 바꾸어 학습

  • 3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training 2026-06-04: pretrained VLA를 VLA data + real-world 3D reasoning data로 co-training하면서, 3D foundation model과 reasoning-prompt teacher를 학습 중에만 사용해 2D image-only inference에서도 implicit 3D spatial reasoning을 action prediction에 주입

  • GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework

  • Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs 2026-05-21: π0-style flow-matching dVLA의 replanning latency를 lightweight draft와 flow-consistency verification으로 줄이는 speculative inference framework

benchmark

  • SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method

component-scratch-training

cross-embodiment

  • Learning Action Priors for Cross-embodiment Robot Manipulation 2026-06-26: Pretrained VLM에 아직 motor structure를 배우지 못한 action head를 바로 붙여 joint train하는 대신, state-action trajectory만으로 flow-matching action encoder-decoder를 먼저 pretrain한 뒤 decoder initialization, decaying latent distillation, history compression을 통해 VLA에 이식하는 cross-embodiment policy training framework

  • Grounding Generative Policies in Physics: Optimization-Guided Diffusion for Robot Control 2026-06-25: Frozen task-space diffusion policy의 DDIM sampling noise를 무작위로 뽑는 대신, robot reachability·collision·controller trackability를 만족하도록 최적화하여 cross-embodiment deployment를 수행하는 inference-time constrained diffusion method

  • SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다

  • Do as I Do: Dexterous Manipulation Data from Everyday Human Videos 2026-06-18: monocular RGB human manipulation video를 4D hand–object trajectory로 복원하고, pretrained SAM 3D를 training-free guided flow sampling으로 object tracker처럼 재활용한 뒤, MuJoCo Warp의 dynamics-aware sampling optimization으로 22-DoF Sharpa Wave hand가 실행할 수 있는 robot trajectory로 변환하는 offline robot-data engine

  • MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임

  • ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework

  • Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model

  • Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method

  • EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine

  • GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy

  • GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework

  • OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics 2026-06-04: pretrained Cosmos-Predict2.5-2B video DiT를 2D kinematic skeleton condition으로 fine-tuning하여, 여러 robot embodiment와 human hand에 걸쳐 action-conditioned future video를 생성하고 이를 RoboArena policy evaluation proxy로 쓴다

diffusion-policy

distillation

  • Flash-WAM: Modality-Aware Distillation for World Action Models 2026-06-05: WAM의 video/action diffusion denoising을 각각의 noise regime에 맞게 다르게 distill해서, WAM을 거의 teacher 성능에 가깝게 유지하면서 real-time chunk-level control이 가능한 수준까지 가속하는 step-distillation method

fine-tuning

foundation-Model

foundation-model

  • World Value Models for Robotic Manipulation 2026-06-25: Pretrained Wan2.2 video world model을 robot video로 jointly fine-tune하면서 별도의 lightweight value DiT를 Mixture-of-Transformers로 결합해, video와 language로부터 4-frame task-progress chunk를 flow matching으로 생성하고 그 progress 변화량으로 suboptimal data를 filtering·reweighting하는 generalist robotic value model

  • MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임

  • PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation 2026-06-18: pretrained 14B flow-matching video DiT에 Geometry-Aware Cross-View Attention, camera-aware Geo-RoPE, Depth Anything 3 기반 Latent 3D-REPA를 결합해 여러 로봇 카메라의 미래 영상을 3D-consistent하게 생성하고, action-conditioned rollout을 WAM의 world-prediction backbone으로 활용할 수 있는 multi-view world foundation model

  • Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model

  • Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy

  • T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA

  • µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model

inference-time

multi-task

real-world

scheduler-training

scratch-training

sim2real

success-rate

  • Learning Action Priors for Cross-embodiment Robot Manipulation 2026-06-26: Pretrained VLM에 아직 motor structure를 배우지 못한 action head를 바로 붙여 joint train하는 대신, state-action trajectory만으로 flow-matching action encoder-decoder를 먼저 pretrain한 뒤 decoder initialization, decaying latent distillation, history compression을 통해 VLA에 이식하는 cross-embodiment policy training framework

  • InSight: Self-Guided Skill Acquisition via Steerable VLAs 2026-06-25: 기존 demonstration을 자동으로 primitive 단위로 분해해 pretrained π0.5를 primitive-steerable policy로 만들고, novel task에서 VLM이 발견한 missing primitive를 single-axis controller로 자율 수집·검증한 뒤 VLA에 재학습하여 영속적인 skill vocabulary로 편입하는 VLM-guided continual skill acquisition framework

  • PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models 2026-06-25: pretrained VLA를 두 단계의 GRPO 기반 RL post-training으로 fine-tuning하여, 한 번의 inference에서 안전하게 실행할 수 있는 action chunk 길이를 늘리고 전체 physical control step은 줄임

  • SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다

  • World Value Models for Robotic Manipulation 2026-06-25: Pretrained Wan2.2 video world model을 robot video로 jointly fine-tune하면서 별도의 lightweight value DiT를 Mixture-of-Transformers로 결합해, video와 language로부터 4-frame task-progress chunk를 flow matching으로 생성하고 그 progress 변화량으로 suboptimal data를 filtering·reweighting하는 generalist robotic value model

  • FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation 2026-06-24: flow matching robot policy의 중간 noisy action을 clean action chunk로 한 번에 projection한 뒤, 그 지점의 critic gradient를 value-improved velocity target으로 distillation하여 전체 denoising ODE를 backpropagation하지 않고도 offline-to-online real-world RL을 수행

  • UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models 2026-06-24: pretrained VLM과 action expert의 각 layer group을 서로 다른 주기로 실행·cache하도록 학습하고, VLM feature와 action decoding stage의 연결 순서를 뒤집어, VLA-Adapter의 success rate를 높이면서 평균 inference latency를 줄인 scheduler-aware VLA architecture

  • UniviewVLA: A Unified Multiview Vision-Language-Action Model with World Modeling 2026-06-24: agent-view와 wrist-view의 두 프레임만으로 candidate auxiliary-views의 다음 장면 token을 생성하고, motion-relevant token 16개로 압축한 뒤 action entropy가 가장 낮은 view를 선택해 FAST action token을 생성하는 autoregressive multiview VLA

  • ENPIRE: Agentic Robot Policy Self-Improvement in the Real World 2026-06-22: coding agent가 실제 로봇의 reset → rollout → verification → policy/code refinement research loop를 직접 운영하고, 여러 robot–agent worker가 Git으로 실험 지식을 공유하면서 task policy를 자동 개선하게 만든 physical autoresearch harness

  • Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection 2026-06-22: frozen flow-matching robot policy의 initial action noise를 backward ODE inversion과 repainting으로 조정하여, 이미 실행된 action prefix와 새 action chunk를 gradient·retraining 없이 연속적으로 연결하는 asynchronous inference method

  • SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method

  • Do as I Do: Dexterous Manipulation Data from Everyday Human Videos 2026-06-18: monocular RGB human manipulation video를 4D hand–object trajectory로 복원하고, pretrained SAM 3D를 training-free guided flow sampling으로 object tracker처럼 재활용한 뒤, MuJoCo Warp의 dynamics-aware sampling optimization으로 22-DoF Sharpa Wave hand가 실행할 수 있는 robot trajectory로 변환하는 offline robot-data engine

  • DREAM-Chunk: Reactive Action Chunking with Latent World Model 2026-06-18: frozen action-chunking VLA가 샘플링한 N개 candidate chunk의 latent future를 lightweight world model로 예측하고, 매 control step마다 현재 observation과 가장 가까운 phase-aligned dreamed state의 action으로 전환해 VLA를 다시 호출하지 않고 within-chunk reactivity를 높이는 test-time scaling method

  • MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임

  • Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework

  • PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation 2026-06-18: pretrained 14B flow-matching video DiT에 Geometry-Aware Cross-View Attention, camera-aware Geo-RoPE, Depth Anything 3 기반 Latent 3D-REPA를 결합해 여러 로봇 카메라의 미래 영상을 3D-consistent하게 생성하고, action-conditioned rollout을 WAM의 world-prediction backbone으로 활용할 수 있는 multi-view world foundation model

  • ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework

  • LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation 2026-06-17: asynchronous inference로 실행되는 Diffusion Policy의 chunk boundary jerk와 obstacle collision 문제를 latency-aware classifier-free guidance, demonstration-derived goal prediction, collision-free trajectory optimization, spatial-temporal smoothing으로 줄이는 real-robot manipulation policy

  • Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model

  • WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT 2026-06-17: pretrained WAM에서 actor만 RL fine-tuning하지 않고, successful online rollout으로 world model을 KL-regularized video SFT하며, actor는 imagined future와 executed future의 reconstruction consistency reward로 RL update하는 WAM post-training framework

  • Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies 2026-06-17: flow matching 기반 generative robot policy의 action generation source를 observation-independent Gaussian noise에서 proprioception-conditioned learnable Gaussian prior로 바꾸고, 같은 source prior가 diffusion-bridge generator에도 plug-in될 수 있음을 보인 source-prior learning method

  • Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement 2026-06-17: pretrained generalist robot policy를 stochastic action generator로 사용하고, geometric verifier로 Best-of-N action chunk를 선택한 뒤, 성공한 verified rollout을 BC fine-tuning data로 재사용하는 inference-time steering + autonomous policy improvement framework

  • Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy

  • Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method

  • Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method

  • T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA

  • Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-15: frozen flow-based VLA는 그대로 둔 채, lightweight RL adaptor가 매 query마다 latent steering w, denoising steps K, execution chunk length C를 동적으로 선택해 hard state에서는 더 많은 compute와 잦은 replanning을, easy state에서는 낮은 compute와 긴 open-loop execution을 수행하도록 만드는 elastic VLA execution framework

  • WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성

  • µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model

  • EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine

  • Improving Robotic Generalist Policies via Flow Reversal Steering 2026-06-12: coarse semantic action을 frozen flow-matching VLA의 역방향 ODE로 latent noise에 매핑한 뒤 다시 denoise해, generalist policy prior 안의 더 정교한 action mode를 호출하는 training-free steering 방법

  • WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model

  • Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics 2026-06-11: suboptimal / OOD robot demonstrations를 Diffusion Policy 학습에 그냥 섞지 않고, diffusion timestep에 따라 “쓸 수 있는 구간”을 제한해 유용한 global plan 또는 local motion primitive만 뽑아 쓰는 imitation learning 방법

  • Dynamic Execution Horizon Prediction for Chunk-based Robot Policies 2026-06-11: pretrained action-chunking robot policy의 action generator는 완전히 고정하고, 현재 observation과 예측된 action chunk를 보고 “이번에 몇 step을 open-loop로 실행할지”를 PPO로 학습하는 lightweight execution-horizon predictor

  • Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination 2026-06-10: WAM의 미래 영상 예측을 photorealistic video generation이 아니라 action generation을 돕는 저비용 coarse future guidance로 재정의하고, compact video expert + low-resolution future latent + asymmetric video-action denoising으로 약 1B 규모에서 real-world policy inference latency를 약 98 ms/chunk까지 낮춤

  • SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation 2026-06-10: long-horizon robotic manipulation에서 VLA policy의 self-improvement를 위해, action-primitive stage estimator와 multi-gate MoE value head로 dense reward/value model을 만들고, 이를 SPIRAL의 offline-to-online residual RL data flywheel에 통합한다

  • AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM

  • GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy

  • MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation 2026-06-09: Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행

  • Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies 2026-06-09: few-shot SFT된 π0.5 flow-matching VLA를 고정된 self-rollout buffer와 learned Q-critic의 action-gradient로 offline RL fine-tuning하되, Q-gradient를 terminal action label이 아니라 denoising-time residual velocity supervision으로 바꾸어 학습

  • ActionMap: Robot Policy Learning via Voxel Action Heatmap 2026-06-08: VLA의 기존 single-point action decoder를 3D translation / 3D rotation / gripper voxel heatmap action head로 교체해, action space의 geometric proximity(인접성)를 학습 신호로 활용

  • 3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training 2026-06-04: pretrained VLA를 VLA data + real-world 3D reasoning data로 co-training하면서, 3D foundation model과 reasoning-prompt teacher를 학습 중에만 사용해 2D image-only inference에서도 implicit 3D spatial reasoning을 action prediction에 주입

  • GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework

  • OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics 2026-06-04: pretrained Cosmos-Predict2.5-2B video DiT를 2D kinematic skeleton condition으로 fine-tuning하여, 여러 robot embodiment와 human hand에 걸쳐 action-conditioned future video를 생성하고 이를 RoboArena policy evaluation proxy로 쓴다

  • Cosmos 3: Omnimodal World Models for Physical AI 2026-06-03: language, image, video, audio, action을 하나의 Mixture-of-Transformers (MoT) 기반 omnimodal world model로 통합해, VLM·video generator·forward/inverse dynamics·robot policy를 하나의 Physical AI backbone으로 다루는 NVIDIA의 대규모 foundation model

  • PointAction: 3D Points as Universal Action Representations for Robot Control 2026-06-03: pretrained video diffusion model이 RGB뿐 아니라 temporally consistent XYZ pointmap까지 생성하게 만들고, 이 3D point dynamics를 embodiment-specific diffusion action decoder가 action chunk로 변환

  • See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs 2026-06-03: VLA executor가 coarse goal과 full image에서 “무엇을 할지/무엇을 볼지”를 스스로 추론하지 않도록 goal-preserving local language와 learned visual evidence budget을 함께 학습시키는 planner-executor VLA generalization framework

  • Continuous Reasoning for Vision-Language-Action 2026-06-02: VLA의 reasoning을 자연어 CoT가 아니라, 다른 VLA instance도 consume할 수 있는 WAE-regularized Gaussian continuous reasoning interface로 정의

  • VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-Based Data Synthesis 2026-06-02: training distribution에서 멀고 서로 중복되지 않는 테스트 케이스로 VLA 실패를 적극적으로 찾고, 그 실패 trajectory를 VLM agent가 성공 trajectory로 고쳐 fine-tuning data로 쓰는 failure-driven VLA enhancement framework

  • τ0-WM: A Unified Video-Action World Model for Robotic Manipulation 2026-06-02: action generation, video prediction, action-conditioned evaluation을 하나의 shared video diffusion backbone 위에서 통합한 manipulation framework

training-data

  • ENPIRE: Agentic Robot Policy Self-Improvement in the Real World 2026-06-22: coding agent가 실제 로봇의 reset → rollout → verification → policy/code refinement research loop를 직접 운영하고, 여러 robot–agent worker가 Git으로 실험 지식을 공유하면서 task policy를 자동 개선하게 만든 physical autoresearch harness

  • Do as I Do: Dexterous Manipulation Data from Everyday Human Videos 2026-06-18: monocular RGB human manipulation video를 4D hand–object trajectory로 복원하고, pretrained SAM 3D를 training-free guided flow sampling으로 object tracker처럼 재활용한 뒤, MuJoCo Warp의 dynamics-aware sampling optimization으로 22-DoF Sharpa Wave hand가 실행할 수 있는 robot trajectory로 변환하는 offline robot-data engine

  • MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임

  • Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework

  • ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework

  • Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model

  • Uncertainty Quantification for Flow-Based Vision-Language-Action Models 2026-06-17: flow matching 기반 VLA의 action generation ODE에서 ensemble velocity field disagreement(VFD)를 측정해 epistemic uncertainty를 추정하고, 이를 failure detection과 SAVE active fine-tuning data acquisition에 사용해 expert demonstration sample efficiency를 높임

  • T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA

  • µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model

  • EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine

  • WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model

  • Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics 2026-06-11: suboptimal / OOD robot demonstrations를 Diffusion Policy 학습에 그냥 섞지 않고, diffusion timestep에 따라 “쓸 수 있는 구간”을 제한해 유용한 global plan 또는 local motion primitive만 뽑아 쓰는 imitation learning 방법

  • GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework

training-free