Tags

English

2. Limited Closed-loop Reproduction, Route-level Profiling, and Wrist-camera Robustness 2026-06-16: A limited closed-loop reproduction and probing project for Realtime-VLA FLASH on Runpod L40S: official checkpoint conversion, LIBERO Goal baseline, synchronized route-level profiling, wrist-camera dropout robustness, and a minimal WristHealthGuard extension
1. Runpod Server Manifest 2026-06-15: public-safe Runpod GPU server snapshot after initial setup

Graduate-School

OpenPI 2026-05-22: ChatGPT랑 codex를 이용해서 openpi 레포지토리 분석하게 시켜보기

Korean

Learning Action Priors for Cross-embodiment Robot Manipulation 2026-06-26: Pretrained VLM에 아직 motor structure를 배우지 못한 action head를 바로 붙여 joint train하는 대신, state-action trajectory만으로 flow-matching action encoder-decoder를 먼저 pretrain한 뒤 decoder initialization, decaying latent distillation, history compression을 통해 VLA에 이식하는 cross-embodiment policy training framework
Grounding Generative Policies in Physics: Optimization-Guided Diffusion for Robot Control 2026-06-25: Frozen task-space diffusion policy의 DDIM sampling noise를 무작위로 뽑는 대신, robot reachability·collision·controller trackability를 만족하도록 최적화하여 cross-embodiment deployment를 수행하는 inference-time constrained diffusion method
InSight: Self-Guided Skill Acquisition via Steerable VLAs 2026-06-25: 기존 demonstration을 자동으로 primitive 단위로 분해해 pretrained π0.5를 primitive-steerable policy로 만들고, novel task에서 VLM이 발견한 missing primitive를 single-axis controller로 자율 수집·검증한 뒤 VLA에 재학습하여 영속적인 skill vocabulary로 편입하는 VLM-guided continual skill acquisition framework
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models 2026-06-25: pretrained VLA를 두 단계의 GRPO 기반 RL post-training으로 fine-tuning하여, 한 번의 inference에서 안전하게 실행할 수 있는 action chunk 길이를 늘리고 전체 physical control step은 줄임
SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다
World Value Models for Robotic Manipulation 2026-06-25: Pretrained Wan2.2 video world model을 robot video로 jointly fine-tune하면서 별도의 lightweight value DiT를 Mixture-of-Transformers로 결합해, video와 language로부터 4-frame task-progress chunk를 flow matching으로 생성하고 그 progress 변화량으로 suboptimal data를 filtering·reweighting하는 generalist robotic value model
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation 2026-06-24: flow matching robot policy의 중간 noisy action을 clean action chunk로 한 번에 projection한 뒤, 그 지점의 critic gradient를 value-improved velocity target으로 distillation하여 전체 denoising ODE를 backpropagation하지 않고도 offline-to-online real-world RL을 수행
UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models 2026-06-24: pretrained VLM과 action expert의 각 layer group을 서로 다른 주기로 실행·cache하도록 학습하고, VLM feature와 action decoding stage의 연결 순서를 뒤집어, VLA-Adapter의 success rate를 높이면서 평균 inference latency를 줄인 scheduler-aware VLA architecture
UniviewVLA: A Unified Multiview Vision-Language-Action Model with World Modeling 2026-06-24: agent-view와 wrist-view의 두 프레임만으로 candidate auxiliary-views의 다음 장면 token을 생성하고, motion-relevant token 16개로 압축한 뒤 action entropy가 가장 낮은 view를 선택해 FAST action token을 생성하는 autoregressive multiview VLA
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World 2026-06-22: coding agent가 실제 로봇의 reset → rollout → verification → policy/code refinement research loop를 직접 운영하고, 여러 robot–agent worker가 Git으로 실험 지식을 공유하면서 task policy를 자동 개선하게 만든 physical autoresearch harness
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think 2026-06-22: pretrained π0·GR00T-N1.5·SmolVLA의 VLM backbone과 continuous-action head에서 Centered Kernel Alignment로 표현이 거의 변하지 않는 연속 Transformer layer를 찾아 fine-tuning 전에 정적으로 제거하고, 남은 작은 모델을 downstream fine-tuning하여 학습·추론 비용을 함께 줄이는 VLA structural pruning method
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? 2026-06-22: pretrained image-editing model을 robot policy backbone으로 fine-tuning하고, future video를 생성하는 대신 single future endpoint를 학습할 때 형성되는 layer-wise KV cache를 flow-matching Action Expert에 전달하여 action chunk를 생성하는 경량 WAM
Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection 2026-06-22: frozen flow-matching robot policy의 initial action noise를 backward ODE inversion과 repainting으로 조정하여, 이미 실행된 action prefix와 새 action chunk를 gradient·retraining 없이 연속적으로 연결하는 asynchronous inference method
5. Valid Image Embedding Batching 2026-06-22: observation image를 각각 embed하지 말고 한번에 embedding을 구해서 나중에 split
Weekly Review #3 2026-06-20: 2026.06.15 ~ 2026.06.19
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method
Do as I Do: Dexterous Manipulation Data from Everyday Human Videos 2026-06-18: monocular RGB human manipulation video를 4D hand–object trajectory로 복원하고, pretrained SAM 3D를 training-free guided flow sampling으로 object tracker처럼 재활용한 뒤, MuJoCo Warp의 dynamics-aware sampling optimization으로 22-DoF Sharpa Wave hand가 실행할 수 있는 robot trajectory로 변환하는 offline robot-data engine
DREAM-Chunk: Reactive Action Chunking with Latent World Model 2026-06-18: frozen action-chunking VLA가 샘플링한 N개 candidate chunk의 latent future를 lightweight world model로 예측하고, 매 control step마다 현재 observation과 가장 가까운 phase-aligned dreamed state의 action으로 전환해 VLA를 다시 호출하지 않고 within-chunk reactivity를 높이는 test-time scaling method
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation 2026-06-18: pretrained 14B flow-matching video DiT에 Geometry-Aware Cross-View Attention, camera-aware Geo-RoPE, Depth Anything 3 기반 Latent 3D-REPA를 결합해 여러 로봇 카메라의 미래 영상을 3D-consistent하게 생성하고, action-conditioned rollout을 WAM의 world-prediction backbone으로 활용할 수 있는 multi-view world foundation model
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework
LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation 2026-06-17: asynchronous inference로 실행되는 Diffusion Policy의 chunk boundary jerk와 obstacle collision 문제를 latency-aware classifier-free guidance, demonstration-derived goal prediction, collision-free trajectory optimization, spatial-temporal smoothing으로 줄이는 real-robot manipulation policy
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model
Uncertainty Quantification for Flow-Based Vision-Language-Action Models 2026-06-17: flow matching 기반 VLA의 action generation ODE에서 ensemble velocity field disagreement(VFD)를 측정해 epistemic uncertainty를 추정하고, 이를 failure detection과 SAVE active fine-tuning data acquisition에 사용해 expert demonstration sample efficiency를 높임
WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT 2026-06-17: pretrained WAM에서 actor만 RL fine-tuning하지 않고, successful online rollout으로 world model을 KL-regularized video SFT하며, actor는 imagined future와 executed future의 reconstruction consistency reward로 RL update하는 WAM post-training framework
Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies 2026-06-17: flow matching 기반 generative robot policy의 action generation source를 observation-independent Gaussian noise에서 proprioception-conditioned learnable Gaussian prior로 바꾸고, 같은 source prior가 diffusion-bridge generator에도 plug-in될 수 있음을 보인 source-prior learning method
Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement 2026-06-17: pretrained generalist robot policy를 stochastic action generator로 사용하고, geometric verifier로 Best-of-N action chunk를 선택한 뒤, 성공한 verified rollout을 BC fine-tuning data로 재사용하는 inference-time steering + autonomous policy improvement framework
4. Prefix Fixed-Cost Breakdown and Masked Image Skip 2026-06-17: prefix fixed cost를 image embedding과 prefill 단계로 분해한 뒤, LIBERO에서 mask 처리된 right-wrist image branch가 여전히 vision tower를 통과하는 낭비를 찾아 제거
Acting While Understanding: Asynchronous Semantic-Action Decoupling for Real-Time Vision-Language-Action Models 2026-06-16: VLA 내부 semantic-action interface를 slow semantic understanding과 fast action generation으로 분리하고, stale semantic cache를 action history와 delay-aware training으로 보완해 full VLA를 control rate로 돌리지 않는 high-frequency state-feedback VLA deployment framework
Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy
Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method
Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method
T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA
Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-15: frozen flow-based VLA는 그대로 둔 채, lightweight RL adaptor가 매 query마다 latent steering w, denoising steps K, execution chunk length C를 동적으로 선택해 hard state에서는 더 많은 compute와 잦은 replanning을, easy state에서는 낮은 compute와 긴 open-loop execution을 수행하도록 만드는 elastic VLA execution framework
ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean Flow Action Generation 2026-06-15: diffusion / flow 기반 VLA policy의 inference latency 병목을 줄이기 위해, action generation을 improved Mean Flow(iMF) 기반 one-to-few-step continuous action chunk generation으로 바꾸고 Attention Residuals(AttnRes) Transformer를 결합한 low-latency reactive robot manipulation policy
WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성
µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model
Weekly Review #2 2026-06-13: 2026.06.08 ~ 2026.06.12
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine
Improving Robotic Generalist Policies via Flow Reversal Steering 2026-06-12: coarse semantic action을 frozen flow-matching VLA의 역방향 ODE로 latent noise에 매핑한 뒤 다시 denoise해, generalist policy prior 안의 더 정교한 action mode를 호출하는 training-free steering 방법
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics 2026-06-11: suboptimal / OOD robot demonstrations를 Diffusion Policy 학습에 그냥 섞지 않고, diffusion timestep에 따라 “쓸 수 있는 구간”을 제한해 유용한 global plan 또는 local motion primitive만 뽑아 쓰는 imitation learning 방법
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies 2026-06-11: pretrained action-chunking robot policy의 action generator는 완전히 고정하고, 현재 observation과 예측된 action chunk를 보고 “이번에 몇 step을 open-loop로 실행할지”를 PPO로 학습하는 lightweight execution-horizon predictor
DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model 2026-06-11: VLA의 synchronous clock 가정이 contact-rich manipulation의 multi-rate sensor structure와 맞지 않는다고 보고, modality별 asynchronous latent buffer + gated cross-attention으로 X-VLA를 100 Hz controller 기반 closed-loop execution에 맞춘다
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination 2026-06-10: WAM의 미래 영상 예측을 photorealistic video generation이 아니라 action generation을 돕는 저비용 coarse future guidance로 재정의하고, compact video expert + low-resolution future latent + asymmetric video-action denoising으로 약 1B 규모에서 real-world policy inference latency를 약 98 ms/chunk까지 낮춤
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation 2026-06-10: long-horizon robotic manipulation에서 VLA policy의 self-improvement를 위해, action-primitive stage estimator와 multi-gate MoE value head로 dense reward/value model을 만들고, 이를 SPIRAL의 offline-to-online residual RL data flywheel에 통합한다
AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation 2026-06-09: Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies 2026-06-09: few-shot SFT된 π0.5 flow-matching VLA를 고정된 self-rollout buffer와 learned Q-critic의 action-gradient로 offline RL fine-tuning하되, Q-gradient를 terminal action label이 아니라 denoising-time residual velocity supervision으로 바꾸어 학습
ActionMap: Robot Policy Learning via Voxel Action Heatmap 2026-06-08: VLA의 기존 single-point action decoder를 3D translation / 3D rotation / gripper voxel heatmap action head로 교체해, action space의 geometric proximity(인접성)를 학습 신호로 활용
Weekly Review #1 2026-06-06: 2026.05.19 ~ 2026.06.05
Flash-WAM: Modality-Aware Distillation for World Action Models 2026-06-05: WAM의 video/action diffusion denoising을 각각의 noise regime에 맞게 다르게 distill해서, WAM을 거의 teacher 성능에 가깝게 유지하면서 real-time chunk-level control이 가능한 수준까지 가속하는 step-distillation method
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training 2026-06-04: pretrained VLA를 VLA data + real-world 3D reasoning data로 co-training하면서, 3D foundation model과 reasoning-prompt teacher를 학습 중에만 사용해 2D image-only inference에서도 implicit 3D spatial reasoning을 action prediction에 주입
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework
OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics 2026-06-04: pretrained Cosmos-Predict2.5-2B video DiT를 2D kinematic skeleton condition으로 fine-tuning하여, 여러 robot embodiment와 human hand에 걸쳐 action-conditioned future video를 생성하고 이를 RoboArena policy evaluation proxy로 쓴다
Cosmos 3: Omnimodal World Models for Physical AI 2026-06-03: language, image, video, audio, action을 하나의 Mixture-of-Transformers (MoT) 기반 omnimodal world model로 통합해, VLM·video generator·forward/inverse dynamics·robot policy를 하나의 Physical AI backbone으로 다루는 NVIDIA의 대규모 foundation model
Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies 2026-06-03: last denoising step들에서 clean-action estimate들의 variance를 future action별 stability proxy로 사용해, 안정적인 action prefix만 실행하고 고분산 구간 전에 replan
PointAction: 3D Points as Universal Action Representations for Robot Control 2026-06-03: pretrained video diffusion model이 RGB뿐 아니라 temporally consistent XYZ pointmap까지 생성하게 만들고, 이 3D point dynamics를 embodiment-specific diffusion action decoder가 action chunk로 변환
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs 2026-06-03: VLA executor가 coarse goal과 full image에서 “무엇을 할지/무엇을 볼지”를 스스로 추론하지 않도록 goal-preserving local language와 learned visual evidence budget을 함께 학습시키는 planner-executor VLA generalization framework
3. Nsight Systems profiling & further optimization 2026-06-03: Nsight Systems를 이용해서 bottleneck 지점을 더 정확하게 찾고 원인 분석 및 최적화
Continuous Reasoning for Vision-Language-Action 2026-06-02: VLA의 reasoning을 자연어 CoT가 아니라, 다른 VLA instance도 consume할 수 있는 WAE-regularized Gaussian continuous reasoning interface로 정의
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking 2026-06-02: action chunking robot policy에서 고정 execution horizon 대신, predicted action chunk의 low-speed valley를 phase boundary로 사용해 매 query마다 실행 길이를 동적으로 선택하는 training-free test-time execution 방법
VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-Based Data Synthesis 2026-06-02: training distribution에서 멀고 서로 중복되지 않는 테스트 케이스로 VLA 실패를 적극적으로 찾고, 그 실패 trajectory를 VLM agent가 성공 trajectory로 고쳐 fine-tuning data로 쓰는 failure-driven VLA enhancement framework
τ0-WM: A Unified Video-Action World Model for Robotic Manipulation 2026-06-02: action generation, video prediction, action-conditioned evaluation을 하나의 shared video diffusion backbone 위에서 통합한 manipulation framework
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments 2026-06-01: Qwen3.5 VLM + DiT flow-matching action decoder / embodiment-aware prompt / joint pretraining → generalist VLA (manipulation, navigation, human egocentric motion, trajectory prediction)
0. VAE(Variational AutoEncoder) 2026-05-30: DDPM의 variational perspective를 이해하는 데 필요한 VAE의 핵심 개념을 정리
ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models 2026-05-29: VLA가 매 control step마다 전부 “생각”하지 않고, 현재 로봇 phase가 안정적인지/민감한지를 보고 Vision-LLM과 action head 계산을 동적으로 재사용하는 plug-in inference scheduler
SANTS: A State-Adaptive Scheduler for World Action Models 2026-05-28: WAM이 매번 미래 영상을 끝까지 denoise하지 않고, 현재 로봇 상태에 따라 “여기서 멈출지”와 “얼마나 크게 건너뛸지”를 결정해 full-denoising WAM 대비 success-latency tradeoff를 개선하는 state-adaptive video denoising scheduler
A Factory-Floor Deployment Case Study of VLA Pipelines for Industrial Packaging Task: Workflow, Failures, and Lessons 2026-05-28: 데이터 수집·teleoperation·runtime·failure analysis 루프를 설계해서 pretrained π0.5를 실제 공장 포장 작업에 배포하는 시도, 그리고 거기서 얻은 교훈들
HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation 2026-05-27: 더 현실적인 시뮬레이션 + 더 다양한 recovery trajectory + 소량 real data co-training → zero-shot/few-shot sim-to-real 성능 향상
2. Shallow-π Baseline Latency Check 2026-05-27: Profiling tool들을 사용하기 전에 profiler 없는 순수 latency를 먼저 확인
SMoDP: Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation 2026-05-25: Diffusion policy의 MoE router를 skill-aware하게 만들어 multi-task manipulation에서 expert를 의미 있는 skill 단위로 재사용하게 만든다
OpenPI 2026-05-22: ChatGPT랑 codex를 이용해서 openpi 레포지토리 분석하게 시켜보기
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs 2026-05-21: π0-style flow-matching dVLA의 replanning latency를 lightweight draft와 flow-consistency verification으로 줄이는 speculative inference framework
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies 2026-05-20: fresh observation에서 나온 action이 stale observation에서 나온 action보다 선호된다는 label-free preference pair를 이용해서 async VLA의 delay-robustness를 높이는 offline post-training 방법
1. Shallow-π implementation 2026-05-20: π0 distillation을 통해 Shallow-π 구현 완료
OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism 2026-05-19: MoT VLA에서 action과 language task가 공유하는 observation KV cache를 통합 관리해 중복 prefill과 resource contention을 줄이고 action frequency와 language throughput을 동시에 높이는 inference system

MoE

T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation 2026-06-10: long-horizon robotic manipulation에서 VLA policy의 self-improvement를 위해, action-primitive stage estimator와 multi-gate MoE value head로 dense reward/value model을 만들고, 이를 SPIRAL의 offline-to-online residual RL data flywheel에 통합한다
SMoDP: Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation 2026-05-25: Diffusion policy의 MoE router를 skill-aware하게 만들어 multi-task manipulation에서 expert를 의미 있는 skill 단위로 재사용하게 만든다

Nsight-Systems

3. Nsight Systems profiling & further optimization 2026-06-03: Nsight Systems를 이용해서 bottleneck 지점을 더 정확하게 찾고 원인 분석 및 최적화

Profiling

5. Valid Image Embedding Batching 2026-06-22: observation image를 각각 embed하지 말고 한번에 embedding을 구해서 나중에 split
4. Prefix Fixed-Cost Breakdown and Masked Image Skip 2026-06-17: prefix fixed cost를 image embedding과 prefill 단계로 분해한 뒤, LIBERO에서 mask 처리된 right-wrist image branch가 여전히 vision tower를 통과하는 낭비를 찾아 제거
2. Limited Closed-loop Reproduction, Route-level Profiling, and Wrist-camera Robustness 2026-06-16: A limited closed-loop reproduction and probing project for Realtime-VLA FLASH on Runpod L40S: official checkpoint conversion, LIBERO Goal baseline, synchronized route-level profiling, wrist-camera dropout robustness, and a minimal WristHealthGuard extension
3. Nsight Systems profiling & further optimization 2026-06-03: Nsight Systems를 이용해서 bottleneck 지점을 더 정확하게 찾고 원인 분석 및 최적화
2. Shallow-π Baseline Latency Check 2026-05-27: Profiling tool들을 사용하기 전에 profiler 없는 순수 latency를 먼저 확인

Python

5. Valid Image Embedding Batching 2026-06-22: observation image를 각각 embed하지 말고 한번에 embedding을 구해서 나중에 split
4. Prefix Fixed-Cost Breakdown and Masked Image Skip 2026-06-17: prefix fixed cost를 image embedding과 prefill 단계로 분해한 뒤, LIBERO에서 mask 처리된 right-wrist image branch가 여전히 vision tower를 통과하는 낭비를 찾아 제거
2. Limited Closed-loop Reproduction, Route-level Profiling, and Wrist-camera Robustness 2026-06-16: A limited closed-loop reproduction and probing project for Realtime-VLA FLASH on Runpod L40S: official checkpoint conversion, LIBERO Goal baseline, synchronized route-level profiling, wrist-camera dropout robustness, and a minimal WristHealthGuard extension
3. Nsight Systems profiling & further optimization 2026-06-03: Nsight Systems를 이용해서 bottleneck 지점을 더 정확하게 찾고 원인 분석 및 최적화
2. Shallow-π Baseline Latency Check 2026-05-27: Profiling tool들을 사용하기 전에 profiler 없는 순수 latency를 먼저 확인
1. Shallow-π implementation 2026-05-20: π0 distillation을 통해 Shallow-π 구현 완료

VLA

Learning Action Priors for Cross-embodiment Robot Manipulation 2026-06-26: Pretrained VLM에 아직 motor structure를 배우지 못한 action head를 바로 붙여 joint train하는 대신, state-action trajectory만으로 flow-matching action encoder-decoder를 먼저 pretrain한 뒤 decoder initialization, decaying latent distillation, history compression을 통해 VLA에 이식하는 cross-embodiment policy training framework
InSight: Self-Guided Skill Acquisition via Steerable VLAs 2026-06-25: 기존 demonstration을 자동으로 primitive 단위로 분해해 pretrained π0.5를 primitive-steerable policy로 만들고, novel task에서 VLM이 발견한 missing primitive를 single-axis controller로 자율 수집·검증한 뒤 VLA에 재학습하여 영속적인 skill vocabulary로 편입하는 VLM-guided continual skill acquisition framework
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models 2026-06-25: pretrained VLA를 두 단계의 GRPO 기반 RL post-training으로 fine-tuning하여, 한 번의 inference에서 안전하게 실행할 수 있는 action chunk 길이를 늘리고 전체 physical control step은 줄임
SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다
UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models 2026-06-24: pretrained VLM과 action expert의 각 layer group을 서로 다른 주기로 실행·cache하도록 학습하고, VLM feature와 action decoding stage의 연결 순서를 뒤집어, VLA-Adapter의 success rate를 높이면서 평균 inference latency를 줄인 scheduler-aware VLA architecture
UniviewVLA: A Unified Multiview Vision-Language-Action Model with World Modeling 2026-06-24: agent-view와 wrist-view의 두 프레임만으로 candidate auxiliary-views의 다음 장면 token을 생성하고, motion-relevant token 16개로 압축한 뒤 action entropy가 가장 낮은 view를 선택해 FAST action token을 생성하는 autoregressive multiview VLA
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World 2026-06-22: coding agent가 실제 로봇의 reset → rollout → verification → policy/code refinement research loop를 직접 운영하고, 여러 robot–agent worker가 Git으로 실험 지식을 공유하면서 task policy를 자동 개선하게 만든 physical autoresearch harness
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think 2026-06-22: pretrained π0·GR00T-N1.5·SmolVLA의 VLM backbone과 continuous-action head에서 Centered Kernel Alignment로 표현이 거의 변하지 않는 연속 Transformer layer를 찾아 fine-tuning 전에 정적으로 제거하고, 남은 작은 모델을 downstream fine-tuning하여 학습·추론 비용을 함께 줄이는 VLA structural pruning method
Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection 2026-06-22: frozen flow-matching robot policy의 initial action noise를 backward ODE inversion과 repainting으로 조정하여, 이미 실행된 action prefix와 새 action chunk를 gradient·retraining 없이 연속적으로 연결하는 asynchronous inference method
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method
DREAM-Chunk: Reactive Action Chunking with Latent World Model 2026-06-18: frozen action-chunking VLA가 샘플링한 N개 candidate chunk의 latent future를 lightweight world model로 예측하고, 매 control step마다 현재 observation과 가장 가까운 phase-aligned dreamed state의 action으로 전환해 VLA를 다시 호출하지 않고 within-chunk reactivity를 높이는 test-time scaling method
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model
Uncertainty Quantification for Flow-Based Vision-Language-Action Models 2026-06-17: flow matching 기반 VLA의 action generation ODE에서 ensemble velocity field disagreement(VFD)를 측정해 epistemic uncertainty를 추정하고, 이를 failure detection과 SAVE active fine-tuning data acquisition에 사용해 expert demonstration sample efficiency를 높임
Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement 2026-06-17: pretrained generalist robot policy를 stochastic action generator로 사용하고, geometric verifier로 Best-of-N action chunk를 선택한 뒤, 성공한 verified rollout을 BC fine-tuning data로 재사용하는 inference-time steering + autonomous policy improvement framework
Acting While Understanding: Asynchronous Semantic-Action Decoupling for Real-Time Vision-Language-Action Models 2026-06-16: VLA 내부 semantic-action interface를 slow semantic understanding과 fast action generation으로 분리하고, stale semantic cache를 action history와 delay-aware training으로 보완해 full VLA를 control rate로 돌리지 않는 high-frequency state-feedback VLA deployment framework
Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy
Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method
T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA
Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-15: frozen flow-based VLA는 그대로 둔 채, lightweight RL adaptor가 매 query마다 latent steering w, denoising steps K, execution chunk length C를 동적으로 선택해 hard state에서는 더 많은 compute와 잦은 replanning을, easy state에서는 낮은 compute와 긴 open-loop execution을 수행하도록 만드는 elastic VLA execution framework
ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean Flow Action Generation 2026-06-15: diffusion / flow 기반 VLA policy의 inference latency 병목을 줄이기 위해, action generation을 improved Mean Flow(iMF) 기반 one-to-few-step continuous action chunk generation으로 바꾸고 Attention Residuals(AttnRes) Transformer를 결합한 low-latency reactive robot manipulation policy
Improving Robotic Generalist Policies via Flow Reversal Steering 2026-06-12: coarse semantic action을 frozen flow-matching VLA의 역방향 ODE로 latent noise에 매핑한 뒤 다시 denoise해, generalist policy prior 안의 더 정교한 action mode를 호출하는 training-free steering 방법
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model
DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model 2026-06-11: VLA의 synchronous clock 가정이 contact-rich manipulation의 multi-rate sensor structure와 맞지 않는다고 보고, modality별 asynchronous latent buffer + gated cross-attention으로 X-VLA를 100 Hz controller 기반 closed-loop execution에 맞춘다
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation 2026-06-10: long-horizon robotic manipulation에서 VLA policy의 self-improvement를 위해, action-primitive stage estimator와 multi-gate MoE value head로 dense reward/value model을 만들고, 이를 SPIRAL의 offline-to-online residual RL data flywheel에 통합한다
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies 2026-06-09: few-shot SFT된 π0.5 flow-matching VLA를 고정된 self-rollout buffer와 learned Q-critic의 action-gradient로 offline RL fine-tuning하되, Q-gradient를 terminal action label이 아니라 denoising-time residual velocity supervision으로 바꾸어 학습
ActionMap: Robot Policy Learning via Voxel Action Heatmap 2026-06-08: VLA의 기존 single-point action decoder를 3D translation / 3D rotation / gripper voxel heatmap action head로 교체해, action space의 geometric proximity(인접성)를 학습 신호로 활용
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training 2026-06-04: pretrained VLA를 VLA data + real-world 3D reasoning data로 co-training하면서, 3D foundation model과 reasoning-prompt teacher를 학습 중에만 사용해 2D image-only inference에서도 implicit 3D spatial reasoning을 action prediction에 주입
Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies 2026-06-03: last denoising step들에서 clean-action estimate들의 variance를 future action별 stability proxy로 사용해, 안정적인 action prefix만 실행하고 고분산 구간 전에 replan
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs 2026-06-03: VLA executor가 coarse goal과 full image에서 “무엇을 할지/무엇을 볼지”를 스스로 추론하지 않도록 goal-preserving local language와 learned visual evidence budget을 함께 학습시키는 planner-executor VLA generalization framework
Continuous Reasoning for Vision-Language-Action 2026-06-02: VLA의 reasoning을 자연어 CoT가 아니라, 다른 VLA instance도 consume할 수 있는 WAE-regularized Gaussian continuous reasoning interface로 정의
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking 2026-06-02: action chunking robot policy에서 고정 execution horizon 대신, predicted action chunk의 low-speed valley를 phase boundary로 사용해 매 query마다 실행 길이를 동적으로 선택하는 training-free test-time execution 방법
VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-Based Data Synthesis 2026-06-02: training distribution에서 멀고 서로 중복되지 않는 테스트 케이스로 VLA 실패를 적극적으로 찾고, 그 실패 trajectory를 VLM agent가 성공 trajectory로 고쳐 fine-tuning data로 쓰는 failure-driven VLA enhancement framework
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments 2026-06-01: Qwen3.5 VLM + DiT flow-matching action decoder / embodiment-aware prompt / joint pretraining → generalist VLA (manipulation, navigation, human egocentric motion, trajectory prediction)
ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models 2026-05-29: VLA가 매 control step마다 전부 “생각”하지 않고, 현재 로봇 phase가 안정적인지/민감한지를 보고 Vision-LLM과 action head 계산을 동적으로 재사용하는 plug-in inference scheduler
A Factory-Floor Deployment Case Study of VLA Pipelines for Industrial Packaging Task: Workflow, Failures, and Lessons 2026-05-28: 데이터 수집·teleoperation·runtime·failure analysis 루프를 설계해서 pretrained π0.5를 실제 공장 포장 작업에 배포하는 시도, 그리고 거기서 얻은 교훈들
HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation 2026-05-27: 더 현실적인 시뮬레이션 + 더 다양한 recovery trajectory + 소량 real data co-training → zero-shot/few-shot sim-to-real 성능 향상
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs 2026-05-21: π0-style flow-matching dVLA의 replanning latency를 lightweight draft와 flow-consistency verification으로 줄이는 speculative inference framework
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies 2026-05-20: fresh observation에서 나온 action이 stale observation에서 나온 action보다 선호된다는 label-free preference pair를 이용해서 async VLA의 delay-robustness를 높이는 offline post-training 방법
OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism 2026-05-19: MoT VLA에서 action과 language task가 공유하는 observation KV cache를 통합 관리해 중복 prefill과 resource contention을 줄이고 action frequency와 language throughput을 동시에 높이는 inference system

WAM

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? 2026-06-22: pretrained image-editing model을 robot policy backbone으로 fine-tuning하고, future video를 생성하는 대신 single future endpoint를 학습할 때 형성되는 layer-wise KV cache를 flow-matching Action Expert에 전달하여 action chunk를 생성하는 경량 WAM
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation 2026-06-18: pretrained 14B flow-matching video DiT에 Geometry-Aware Cross-View Attention, camera-aware Geo-RoPE, Depth Anything 3 기반 Latent 3D-REPA를 결합해 여러 로봇 카메라의 미래 영상을 3D-consistent하게 생성하고, action-conditioned rollout을 WAM의 world-prediction backbone으로 활용할 수 있는 multi-view world foundation model
WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT 2026-06-17: pretrained WAM에서 actor만 RL fine-tuning하지 않고, successful online rollout으로 world model을 KL-regularized video SFT하며, actor는 imagined future와 executed future의 reconstruction consistency reward로 RL update하는 WAM post-training framework
Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy
Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method
Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method
WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성
µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination 2026-06-10: WAM의 미래 영상 예측을 photorealistic video generation이 아니라 action generation을 돕는 저비용 coarse future guidance로 재정의하고, compact video expert + low-resolution future latent + asymmetric video-action denoising으로 약 1B 규모에서 real-world policy inference latency를 약 98 ms/chunk까지 낮춤
AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation 2026-06-09: Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행
Flash-WAM: Modality-Aware Distillation for World Action Models 2026-06-05: WAM의 video/action diffusion denoising을 각각의 noise regime에 맞게 다르게 distill해서, WAM을 거의 teacher 성능에 가깝게 유지하면서 real-time chunk-level control이 가능한 수준까지 가속하는 step-distillation method
OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics 2026-06-04: pretrained Cosmos-Predict2.5-2B video DiT를 2D kinematic skeleton condition으로 fine-tuning하여, 여러 robot embodiment와 human hand에 걸쳐 action-conditioned future video를 생성하고 이를 RoboArena policy evaluation proxy로 쓴다
Cosmos 3: Omnimodal World Models for Physical AI 2026-06-03: language, image, video, audio, action을 하나의 Mixture-of-Transformers (MoT) 기반 omnimodal world model로 통합해, VLM·video generator·forward/inverse dynamics·robot policy를 하나의 Physical AI backbone으로 다루는 NVIDIA의 대규모 foundation model
PointAction: 3D Points as Universal Action Representations for Robot Control 2026-06-03: pretrained video diffusion model이 RGB뿐 아니라 temporally consistent XYZ pointmap까지 생성하게 만들고, 이 3D point dynamics를 embodiment-specific diffusion action decoder가 action chunk로 변환
τ0-WM: A Unified Video-Action World Model for Robotic Manipulation 2026-06-02: action generation, video prediction, action-conditioned evaluation을 하나의 shared video diffusion backbone 위에서 통합한 manipulation framework
SANTS: A State-Adaptive Scheduler for World Action Models 2026-05-28: WAM이 매번 미래 영상을 끝까지 denoise하지 않고, 현재 로봇 상태에 따라 “여기서 멈출지”와 “얼마나 크게 건너뛸지”를 결정해 full-denoising WAM 대비 success-latency tradeoff를 개선하는 state-adaptive video denoising scheduler

Writing

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement 2026-06-17: pretrained generalist robot policy를 stochastic action generator로 사용하고, geometric verifier로 Best-of-N action chunk를 선택한 뒤, 성공한 verified rollout을 BC fine-tuning data로 재사용하는 inference-time steering + autonomous policy improvement framework
0. VAE(Variational AutoEncoder) 2026-05-30: DDPM의 variational perspective를 이해하는 데 필요한 VAE의 핵심 개념을 정리
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies 2026-05-20: fresh observation에서 나온 action이 stale observation에서 나온 action보다 선호된다는 label-free preference pair를 이용해서 async VLA의 delay-robustness를 높이는 offline post-training 방법

auxiliary-module-training

InSight: Self-Guided Skill Acquisition via Steerable VLAs 2026-06-25: 기존 demonstration을 자동으로 primitive 단위로 분해해 pretrained π0.5를 primitive-steerable policy로 만들고, novel task에서 VLM이 발견한 missing primitive를 single-axis controller로 자율 수집·검증한 뒤 VLA에 재학습하여 영속적인 skill vocabulary로 편입하는 VLM-guided continual skill acquisition framework
SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation 2026-06-24: flow matching robot policy의 중간 noisy action을 clean action chunk로 한 번에 projection한 뒤, 그 지점의 critic gradient를 value-improved velocity target으로 distillation하여 전체 denoising ODE를 backpropagation하지 않고도 offline-to-online real-world RL을 수행
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method
DREAM-Chunk: Reactive Action Chunking with Latent World Model 2026-06-18: frozen action-chunking VLA가 샘플링한 N개 candidate chunk의 latent future를 lightweight world model로 예측하고, 매 control step마다 현재 observation과 가장 가까운 phase-aligned dreamed state의 action으로 전환해 VLA를 다시 호출하지 않고 within-chunk reactivity를 높이는 test-time scaling method
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework
LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation 2026-06-17: asynchronous inference로 실행되는 Diffusion Policy의 chunk boundary jerk와 obstacle collision 문제를 latency-aware classifier-free guidance, demonstration-derived goal prediction, collision-free trajectory optimization, spatial-temporal smoothing으로 줄이는 real-robot manipulation policy
Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method
T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA
Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-15: frozen flow-based VLA는 그대로 둔 채, lightweight RL adaptor가 매 query마다 latent steering w, denoising steps K, execution chunk length C를 동적으로 선택해 hard state에서는 더 많은 compute와 잦은 replanning을, easy state에서는 낮은 compute와 긴 open-loop execution을 수행하도록 만드는 elastic VLA execution framework
WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine
Improving Robotic Generalist Policies via Flow Reversal Steering 2026-06-12: coarse semantic action을 frozen flow-matching VLA의 역방향 ODE로 latent noise에 매핑한 뒤 다시 denoise해, generalist policy prior 안의 더 정교한 action mode를 호출하는 training-free steering 방법
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies 2026-06-11: pretrained action-chunking robot policy의 action generator는 완전히 고정하고, 현재 observation과 예측된 action chunk를 보고 “이번에 몇 step을 open-loop로 실행할지”를 PPO로 학습하는 lightweight execution-horizon predictor
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation 2026-06-10: long-horizon robotic manipulation에서 VLA policy의 self-improvement를 위해, action-primitive stage estimator와 multi-gate MoE value head로 dense reward/value model을 만들고, 이를 SPIRAL의 offline-to-online residual RL data flywheel에 통합한다
AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies 2026-06-09: few-shot SFT된 π0.5 flow-matching VLA를 고정된 self-rollout buffer와 learned Q-critic의 action-gradient로 offline RL fine-tuning하되, Q-gradient를 terminal action label이 아니라 denoising-time residual velocity supervision으로 바꾸어 학습
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training 2026-06-04: pretrained VLA를 VLA data + real-world 3D reasoning data로 co-training하면서, 3D foundation model과 reasoning-prompt teacher를 학습 중에만 사용해 2D image-only inference에서도 implicit 3D spatial reasoning을 action prediction에 주입
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs 2026-05-21: π0-style flow-matching dVLA의 replanning latency를 lightweight draft와 flow-consistency verification으로 줄이는 speculative inference framework

benchmark

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method

component-scratch-training

Learning Action Priors for Cross-embodiment Robot Manipulation 2026-06-26: Pretrained VLM에 아직 motor structure를 배우지 못한 action head를 바로 붙여 joint train하는 대신, state-action trajectory만으로 flow-matching action encoder-decoder를 먼저 pretrain한 뒤 decoder initialization, decaying latent distillation, history compression을 통해 VLA에 이식하는 cross-embodiment policy training framework
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation 2026-06-24: flow matching robot policy의 중간 noisy action을 clean action chunk로 한 번에 projection한 뒤, 그 지점의 critic gradient를 value-improved velocity target으로 distillation하여 전체 denoising ODE를 backpropagation하지 않고도 offline-to-online real-world RL을 수행
UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models 2026-06-24: pretrained VLM과 action expert의 각 layer group을 서로 다른 주기로 실행·cache하도록 학습하고, VLM feature와 action decoding stage의 연결 순서를 뒤집어, VLA-Adapter의 success rate를 높이면서 평균 inference latency를 줄인 scheduler-aware VLA architecture
Acting While Understanding: Asynchronous Semantic-Action Decoupling for Real-Time Vision-Language-Action Models 2026-06-16: VLA 내부 semantic-action interface를 slow semantic understanding과 fast action generation으로 분리하고, stale semantic cache를 action history와 delay-aware training으로 보완해 full VLA를 control rate로 돌리지 않는 high-frequency state-feedback VLA deployment framework
Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy
ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean Flow Action Generation 2026-06-15: diffusion / flow 기반 VLA policy의 inference latency 병목을 줄이기 위해, action generation을 improved Mean Flow(iMF) 기반 one-to-few-step continuous action chunk generation으로 바꾸고 Attention Residuals(AttnRes) Transformer를 결합한 low-latency reactive robot manipulation policy
WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성
µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model
DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model 2026-06-11: VLA의 synchronous clock 가정이 contact-rich manipulation의 multi-rate sensor structure와 맞지 않는다고 보고, modality별 asynchronous latent buffer + gated cross-attention으로 X-VLA를 100 Hz controller 기반 closed-loop execution에 맞춘다
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination 2026-06-10: WAM의 미래 영상 예측을 photorealistic video generation이 아니라 action generation을 돕는 저비용 coarse future guidance로 재정의하고, compact video expert + low-resolution future latent + asymmetric video-action denoising으로 약 1B 규모에서 real-world policy inference latency를 약 98 ms/chunk까지 낮춤
AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation 2026-06-09: Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행
ActionMap: Robot Policy Learning via Voxel Action Heatmap 2026-06-08: VLA의 기존 single-point action decoder를 3D translation / 3D rotation / gripper voxel heatmap action head로 교체해, action space의 geometric proximity(인접성)를 학습 신호로 활용
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training 2026-06-04: pretrained VLA를 VLA data + real-world 3D reasoning data로 co-training하면서, 3D foundation model과 reasoning-prompt teacher를 학습 중에만 사용해 2D image-only inference에서도 implicit 3D spatial reasoning을 action prediction에 주입
PointAction: 3D Points as Universal Action Representations for Robot Control 2026-06-03: pretrained video diffusion model이 RGB뿐 아니라 temporally consistent XYZ pointmap까지 생성하게 만들고, 이 3D point dynamics를 embodiment-specific diffusion action decoder가 action chunk로 변환

cross-embodiment

Learning Action Priors for Cross-embodiment Robot Manipulation 2026-06-26: Pretrained VLM에 아직 motor structure를 배우지 못한 action head를 바로 붙여 joint train하는 대신, state-action trajectory만으로 flow-matching action encoder-decoder를 먼저 pretrain한 뒤 decoder initialization, decaying latent distillation, history compression을 통해 VLA에 이식하는 cross-embodiment policy training framework
Grounding Generative Policies in Physics: Optimization-Guided Diffusion for Robot Control 2026-06-25: Frozen task-space diffusion policy의 DDIM sampling noise를 무작위로 뽑는 대신, robot reachability·collision·controller trackability를 만족하도록 최적화하여 cross-embodiment deployment를 수행하는 inference-time constrained diffusion method
SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다
Do as I Do: Dexterous Manipulation Data from Everyday Human Videos 2026-06-18: monocular RGB human manipulation video를 4D hand–object trajectory로 복원하고, pretrained SAM 3D를 training-free guided flow sampling으로 object tracker처럼 재활용한 뒤, MuJoCo Warp의 dynamics-aware sampling optimization으로 22-DoF Sharpa Wave hand가 실행할 수 있는 robot trajectory로 변환하는 offline robot-data engine
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model
Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework
OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics 2026-06-04: pretrained Cosmos-Predict2.5-2B video DiT를 2D kinematic skeleton condition으로 fine-tuning하여, 여러 robot embodiment와 human hand에 걸쳐 action-conditioned future video를 생성하고 이를 RoboArena policy evaluation proxy로 쓴다

diffusion-policy

Grounding Generative Policies in Physics: Optimization-Guided Diffusion for Robot Control 2026-06-25: Frozen task-space diffusion policy의 DDIM sampling noise를 무작위로 뽑는 대신, robot reachability·collision·controller trackability를 만족하도록 최적화하여 cross-embodiment deployment를 수행하는 inference-time constrained diffusion method
LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation 2026-06-17: asynchronous inference로 실행되는 Diffusion Policy의 chunk boundary jerk와 obstacle collision 문제를 latency-aware classifier-free guidance, demonstration-derived goal prediction, collision-free trajectory optimization, spatial-temporal smoothing으로 줄이는 real-robot manipulation policy
Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies 2026-06-17: flow matching 기반 generative robot policy의 action generation source를 observation-independent Gaussian noise에서 proprioception-conditioned learnable Gaussian prior로 바꾸고, 같은 source prior가 diffusion-bridge generator에도 plug-in될 수 있음을 보인 source-prior learning method
Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics 2026-06-11: suboptimal / OOD robot demonstrations를 Diffusion Policy 학습에 그냥 섞지 않고, diffusion timestep에 따라 “쓸 수 있는 구간”을 제한해 유용한 global plan 또는 local motion primitive만 뽑아 쓰는 imitation learning 방법
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies 2026-06-11: pretrained action-chunking robot policy의 action generator는 완전히 고정하고, 현재 observation과 예측된 action chunk를 보고 “이번에 몇 step을 open-loop로 실행할지”를 PPO로 학습하는 lightweight execution-horizon predictor
SMoDP: Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation 2026-05-25: Diffusion policy의 MoE router를 skill-aware하게 만들어 multi-task manipulation에서 expert를 의미 있는 skill 단위로 재사용하게 만든다

distillation

Flash-WAM: Modality-Aware Distillation for World Action Models 2026-06-05: WAM의 video/action diffusion denoising을 각각의 noise regime에 맞게 다르게 distill해서, WAM을 거의 teacher 성능에 가깝게 유지하면서 real-time chunk-level control이 가능한 수준까지 가속하는 step-distillation method

fine-tuning

Learning Action Priors for Cross-embodiment Robot Manipulation 2026-06-26: Pretrained VLM에 아직 motor structure를 배우지 못한 action head를 바로 붙여 joint train하는 대신, state-action trajectory만으로 flow-matching action encoder-decoder를 먼저 pretrain한 뒤 decoder initialization, decaying latent distillation, history compression을 통해 VLA에 이식하는 cross-embodiment policy training framework
InSight: Self-Guided Skill Acquisition via Steerable VLAs 2026-06-25: 기존 demonstration을 자동으로 primitive 단위로 분해해 pretrained π0.5를 primitive-steerable policy로 만들고, novel task에서 VLM이 발견한 missing primitive를 single-axis controller로 자율 수집·검증한 뒤 VLA에 재학습하여 영속적인 skill vocabulary로 편입하는 VLM-guided continual skill acquisition framework
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models 2026-06-25: pretrained VLA를 두 단계의 GRPO 기반 RL post-training으로 fine-tuning하여, 한 번의 inference에서 안전하게 실행할 수 있는 action chunk 길이를 늘리고 전체 physical control step은 줄임
SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다
World Value Models for Robotic Manipulation 2026-06-25: Pretrained Wan2.2 video world model을 robot video로 jointly fine-tune하면서 별도의 lightweight value DiT를 Mixture-of-Transformers로 결합해, video와 language로부터 4-frame task-progress chunk를 flow matching으로 생성하고 그 progress 변화량으로 suboptimal data를 filtering·reweighting하는 generalist robotic value model
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation 2026-06-24: flow matching robot policy의 중간 noisy action을 clean action chunk로 한 번에 projection한 뒤, 그 지점의 critic gradient를 value-improved velocity target으로 distillation하여 전체 denoising ODE를 backpropagation하지 않고도 offline-to-online real-world RL을 수행
UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models 2026-06-24: pretrained VLM과 action expert의 각 layer group을 서로 다른 주기로 실행·cache하도록 학습하고, VLM feature와 action decoding stage의 연결 순서를 뒤집어, VLA-Adapter의 success rate를 높이면서 평균 inference latency를 줄인 scheduler-aware VLA architecture
UniviewVLA: A Unified Multiview Vision-Language-Action Model with World Modeling 2026-06-24: agent-view와 wrist-view의 두 프레임만으로 candidate auxiliary-views의 다음 장면 token을 생성하고, motion-relevant token 16개로 압축한 뒤 action entropy가 가장 낮은 view를 선택해 FAST action token을 생성하는 autoregressive multiview VLA
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think 2026-06-22: pretrained π0·GR00T-N1.5·SmolVLA의 VLM backbone과 continuous-action head에서 Centered Kernel Alignment로 표현이 거의 변하지 않는 연속 Transformer layer를 찾아 fine-tuning 전에 정적으로 제거하고, 남은 작은 모델을 downstream fine-tuning하여 학습·추론 비용을 함께 줄이는 VLA structural pruning method
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? 2026-06-22: pretrained image-editing model을 robot policy backbone으로 fine-tuning하고, future video를 생성하는 대신 single future endpoint를 학습할 때 형성되는 layer-wise KV cache를 flow-matching Action Expert에 전달하여 action chunk를 생성하는 경량 WAM
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework
Uncertainty Quantification for Flow-Based Vision-Language-Action Models 2026-06-17: flow matching 기반 VLA의 action generation ODE에서 ensemble velocity field disagreement(VFD)를 측정해 epistemic uncertainty를 추정하고, 이를 failure detection과 SAVE active fine-tuning data acquisition에 사용해 expert demonstration sample efficiency를 높임
WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT 2026-06-17: pretrained WAM에서 actor만 RL fine-tuning하지 않고, successful online rollout으로 world model을 KL-regularized video SFT하며, actor는 imagined future와 executed future의 reconstruction consistency reward로 RL update하는 WAM post-training framework
Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement 2026-06-17: pretrained generalist robot policy를 stochastic action generator로 사용하고, geometric verifier로 Best-of-N action chunk를 선택한 뒤, 성공한 verified rollout을 BC fine-tuning data로 재사용하는 inference-time steering + autonomous policy improvement framework
Acting While Understanding: Asynchronous Semantic-Action Decoupling for Real-Time Vision-Language-Action Models 2026-06-16: VLA 내부 semantic-action interface를 slow semantic understanding과 fast action generation으로 분리하고, stale semantic cache를 action history와 delay-aware training으로 보완해 full VLA를 control rate로 돌리지 않는 high-frequency state-feedback VLA deployment framework
Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy
Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method
T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA
WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model
DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model 2026-06-11: VLA의 synchronous clock 가정이 contact-rich manipulation의 multi-rate sensor structure와 맞지 않는다고 보고, modality별 asynchronous latent buffer + gated cross-attention으로 X-VLA를 100 Hz controller 기반 closed-loop execution에 맞춘다
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination 2026-06-10: WAM의 미래 영상 예측을 photorealistic video generation이 아니라 action generation을 돕는 저비용 coarse future guidance로 재정의하고, compact video expert + low-resolution future latent + asymmetric video-action denoising으로 약 1B 규모에서 real-world policy inference latency를 약 98 ms/chunk까지 낮춤
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation 2026-06-10: long-horizon robotic manipulation에서 VLA policy의 self-improvement를 위해, action-primitive stage estimator와 multi-gate MoE value head로 dense reward/value model을 만들고, 이를 SPIRAL의 offline-to-online residual RL data flywheel에 통합한다
AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation 2026-06-09: Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies 2026-06-09: few-shot SFT된 π0.5 flow-matching VLA를 고정된 self-rollout buffer와 learned Q-critic의 action-gradient로 offline RL fine-tuning하되, Q-gradient를 terminal action label이 아니라 denoising-time residual velocity supervision으로 바꾸어 학습
ActionMap: Robot Policy Learning via Voxel Action Heatmap 2026-06-08: VLA의 기존 single-point action decoder를 3D translation / 3D rotation / gripper voxel heatmap action head로 교체해, action space의 geometric proximity(인접성)를 학습 신호로 활용
Flash-WAM: Modality-Aware Distillation for World Action Models 2026-06-05: WAM의 video/action diffusion denoising을 각각의 noise regime에 맞게 다르게 distill해서, WAM을 거의 teacher 성능에 가깝게 유지하면서 real-time chunk-level control이 가능한 수준까지 가속하는 step-distillation method
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training 2026-06-04: pretrained VLA를 VLA data + real-world 3D reasoning data로 co-training하면서, 3D foundation model과 reasoning-prompt teacher를 학습 중에만 사용해 2D image-only inference에서도 implicit 3D spatial reasoning을 action prediction에 주입
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework
OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics 2026-06-04: pretrained Cosmos-Predict2.5-2B video DiT를 2D kinematic skeleton condition으로 fine-tuning하여, 여러 robot embodiment와 human hand에 걸쳐 action-conditioned future video를 생성하고 이를 RoboArena policy evaluation proxy로 쓴다
PointAction: 3D Points as Universal Action Representations for Robot Control 2026-06-03: pretrained video diffusion model이 RGB뿐 아니라 temporally consistent XYZ pointmap까지 생성하게 만들고, 이 3D point dynamics를 embodiment-specific diffusion action decoder가 action chunk로 변환
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs 2026-06-03: VLA executor가 coarse goal과 full image에서 “무엇을 할지/무엇을 볼지”를 스스로 추론하지 않도록 goal-preserving local language와 learned visual evidence budget을 함께 학습시키는 planner-executor VLA generalization framework
Continuous Reasoning for Vision-Language-Action 2026-06-02: VLA의 reasoning을 자연어 CoT가 아니라, 다른 VLA instance도 consume할 수 있는 WAE-regularized Gaussian continuous reasoning interface로 정의
VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-Based Data Synthesis 2026-06-02: training distribution에서 멀고 서로 중복되지 않는 테스트 케이스로 VLA 실패를 적극적으로 찾고, 그 실패 trajectory를 VLM agent가 성공 trajectory로 고쳐 fine-tuning data로 쓰는 failure-driven VLA enhancement framework
A Factory-Floor Deployment Case Study of VLA Pipelines for Industrial Packaging Task: Workflow, Failures, and Lessons 2026-05-28: 데이터 수집·teleoperation·runtime·failure analysis 루프를 설계해서 pretrained π0.5를 실제 공장 포장 작업에 배포하는 시도, 그리고 거기서 얻은 교훈들
HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation 2026-05-27: 더 현실적인 시뮬레이션 + 더 다양한 recovery trajectory + 소량 real data co-training → zero-shot/few-shot sim-to-real 성능 향상
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies 2026-05-20: fresh observation에서 나온 action이 stale observation에서 나온 action보다 선호된다는 label-free preference pair를 이용해서 async VLA의 delay-robustness를 높이는 offline post-training 방법

foundation-Model

Cosmos 3: Omnimodal World Models for Physical AI 2026-06-03: language, image, video, audio, action을 하나의 Mixture-of-Transformers (MoT) 기반 omnimodal world model로 통합해, VLM·video generator·forward/inverse dynamics·robot policy를 하나의 Physical AI backbone으로 다루는 NVIDIA의 대규모 foundation model
τ0-WM: A Unified Video-Action World Model for Robotic Manipulation 2026-06-02: action generation, video prediction, action-conditioned evaluation을 하나의 shared video diffusion backbone 위에서 통합한 manipulation framework
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments 2026-06-01: Qwen3.5 VLM + DiT flow-matching action decoder / embodiment-aware prompt / joint pretraining → generalist VLA (manipulation, navigation, human egocentric motion, trajectory prediction)

foundation-model

World Value Models for Robotic Manipulation 2026-06-25: Pretrained Wan2.2 video world model을 robot video로 jointly fine-tune하면서 별도의 lightweight value DiT를 Mixture-of-Transformers로 결합해, video와 language로부터 4-frame task-progress chunk를 flow matching으로 생성하고 그 progress 변화량으로 suboptimal data를 filtering·reweighting하는 generalist robotic value model
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation 2026-06-18: pretrained 14B flow-matching video DiT에 Geometry-Aware Cross-View Attention, camera-aware Geo-RoPE, Depth Anything 3 기반 Latent 3D-REPA를 결합해 여러 로봇 카메라의 미래 영상을 3D-consistent하게 생성하고, action-conditioned rollout을 WAM의 world-prediction backbone으로 활용할 수 있는 multi-view world foundation model
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model
Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy
T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA
µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model

inference-time

Grounding Generative Policies in Physics: Optimization-Guided Diffusion for Robot Control 2026-06-25: Frozen task-space diffusion policy의 DDIM sampling noise를 무작위로 뽑는 대신, robot reachability·collision·controller trackability를 만족하도록 최적화하여 cross-embodiment deployment를 수행하는 inference-time constrained diffusion method
UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models 2026-06-24: pretrained VLM과 action expert의 각 layer group을 서로 다른 주기로 실행·cache하도록 학습하고, VLM feature와 action decoding stage의 연결 순서를 뒤집어, VLA-Adapter의 success rate를 높이면서 평균 inference latency를 줄인 scheduler-aware VLA architecture
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think 2026-06-22: pretrained π0·GR00T-N1.5·SmolVLA의 VLM backbone과 continuous-action head에서 Centered Kernel Alignment로 표현이 거의 변하지 않는 연속 Transformer layer를 찾아 fine-tuning 전에 정적으로 제거하고, 남은 작은 모델을 downstream fine-tuning하여 학습·추론 비용을 함께 줄이는 VLA structural pruning method
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? 2026-06-22: pretrained image-editing model을 robot policy backbone으로 fine-tuning하고, future video를 생성하는 대신 single future endpoint를 학습할 때 형성되는 layer-wise KV cache를 flow-matching Action Expert에 전달하여 action chunk를 생성하는 경량 WAM
Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection 2026-06-22: frozen flow-matching robot policy의 initial action noise를 backward ODE inversion과 repainting으로 조정하여, 이미 실행된 action prefix와 새 action chunk를 gradient·retraining 없이 연속적으로 연결하는 asynchronous inference method
DREAM-Chunk: Reactive Action Chunking with Latent World Model 2026-06-18: frozen action-chunking VLA가 샘플링한 N개 candidate chunk의 latent future를 lightweight world model로 예측하고, 매 control step마다 현재 observation과 가장 가까운 phase-aligned dreamed state의 action으로 전환해 VLA를 다시 호출하지 않고 within-chunk reactivity를 높이는 test-time scaling method
LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation 2026-06-17: asynchronous inference로 실행되는 Diffusion Policy의 chunk boundary jerk와 obstacle collision 문제를 latency-aware classifier-free guidance, demonstration-derived goal prediction, collision-free trajectory optimization, spatial-temporal smoothing으로 줄이는 real-robot manipulation policy
Uncertainty Quantification for Flow-Based Vision-Language-Action Models 2026-06-17: flow matching 기반 VLA의 action generation ODE에서 ensemble velocity field disagreement(VFD)를 측정해 epistemic uncertainty를 추정하고, 이를 failure detection과 SAVE active fine-tuning data acquisition에 사용해 expert demonstration sample efficiency를 높임
Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement 2026-06-17: pretrained generalist robot policy를 stochastic action generator로 사용하고, geometric verifier로 Best-of-N action chunk를 선택한 뒤, 성공한 verified rollout을 BC fine-tuning data로 재사용하는 inference-time steering + autonomous policy improvement framework
Acting While Understanding: Asynchronous Semantic-Action Decoupling for Real-Time Vision-Language-Action Models 2026-06-16: VLA 내부 semantic-action interface를 slow semantic understanding과 fast action generation으로 분리하고, stale semantic cache를 action history와 delay-aware training으로 보완해 full VLA를 control rate로 돌리지 않는 high-frequency state-feedback VLA deployment framework
Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method
Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method
Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-15: frozen flow-based VLA는 그대로 둔 채, lightweight RL adaptor가 매 query마다 latent steering w, denoising steps K, execution chunk length C를 동적으로 선택해 hard state에서는 더 많은 compute와 잦은 replanning을, easy state에서는 낮은 compute와 긴 open-loop execution을 수행하도록 만드는 elastic VLA execution framework
ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean Flow Action Generation 2026-06-15: diffusion / flow 기반 VLA policy의 inference latency 병목을 줄이기 위해, action generation을 improved Mean Flow(iMF) 기반 one-to-few-step continuous action chunk generation으로 바꾸고 Attention Residuals(AttnRes) Transformer를 결합한 low-latency reactive robot manipulation policy
Improving Robotic Generalist Policies via Flow Reversal Steering 2026-06-12: coarse semantic action을 frozen flow-matching VLA의 역방향 ODE로 latent noise에 매핑한 뒤 다시 denoise해, generalist policy prior 안의 더 정교한 action mode를 호출하는 training-free steering 방법
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies 2026-06-11: pretrained action-chunking robot policy의 action generator는 완전히 고정하고, 현재 observation과 예측된 action chunk를 보고 “이번에 몇 step을 open-loop로 실행할지”를 PPO로 학습하는 lightweight execution-horizon predictor
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination 2026-06-10: WAM의 미래 영상 예측을 photorealistic video generation이 아니라 action generation을 돕는 저비용 coarse future guidance로 재정의하고, compact video expert + low-resolution future latent + asymmetric video-action denoising으로 약 1B 규모에서 real-world policy inference latency를 약 98 ms/chunk까지 낮춤
AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation 2026-06-09: Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행
Flash-WAM: Modality-Aware Distillation for World Action Models 2026-06-05: WAM의 video/action diffusion denoising을 각각의 noise regime에 맞게 다르게 distill해서, WAM을 거의 teacher 성능에 가깝게 유지하면서 real-time chunk-level control이 가능한 수준까지 가속하는 step-distillation method
Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies 2026-06-03: last denoising step들에서 clean-action estimate들의 variance를 future action별 stability proxy로 사용해, 안정적인 action prefix만 실행하고 고분산 구간 전에 replan
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking 2026-06-02: action chunking robot policy에서 고정 execution horizon 대신, predicted action chunk의 low-speed valley를 phase boundary로 사용해 매 query마다 실행 길이를 동적으로 선택하는 training-free test-time execution 방법

multi-task

SMoDP: Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation 2026-05-25: Diffusion policy의 MoE router를 skill-aware하게 만들어 multi-task manipulation에서 expert를 의미 있는 skill 단위로 재사용하게 만든다

real-world

A Factory-Floor Deployment Case Study of VLA Pipelines for Industrial Packaging Task: Workflow, Failures, and Lessons 2026-05-28: 데이터 수집·teleoperation·runtime·failure analysis 루프를 설계해서 pretrained π0.5를 실제 공장 포장 작업에 배포하는 시도, 그리고 거기서 얻은 교훈들

scheduler-training

UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models 2026-06-24: pretrained VLM과 action expert의 각 layer group을 서로 다른 주기로 실행·cache하도록 학습하고, VLM feature와 action decoding stage의 연결 순서를 뒤집어, VLA-Adapter의 success rate를 높이면서 평균 inference latency를 줄인 scheduler-aware VLA architecture
Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-15: frozen flow-based VLA는 그대로 둔 채, lightweight RL adaptor가 매 query마다 latent steering w, denoising steps K, execution chunk length C를 동적으로 선택해 hard state에서는 더 많은 compute와 잦은 replanning을, easy state에서는 낮은 compute와 긴 open-loop execution을 수행하도록 만드는 elastic VLA execution framework
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies 2026-06-11: pretrained action-chunking robot policy의 action generator는 완전히 고정하고, 현재 observation과 예측된 action chunk를 보고 “이번에 몇 step을 open-loop로 실행할지”를 PPO로 학습하는 lightweight execution-horizon predictor
ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models 2026-05-29: VLA가 매 control step마다 전부 “생각”하지 않고, 현재 로봇 phase가 안정적인지/민감한지를 보고 Vision-LLM과 action head 계산을 동적으로 재사용하는 plug-in inference scheduler
SANTS: A State-Adaptive Scheduler for World Action Models 2026-05-28: WAM이 매번 미래 영상을 끝까지 denoise하지 않고, 현재 로봇 상태에 따라 “여기서 멈출지”와 “얼마나 크게 건너뛸지”를 결정해 full-denoising WAM 대비 success-latency tradeoff를 개선하는 state-adaptive video denoising scheduler

scratch-training

LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation 2026-06-17: asynchronous inference로 실행되는 Diffusion Policy의 chunk boundary jerk와 obstacle collision 문제를 latency-aware classifier-free guidance, demonstration-derived goal prediction, collision-free trajectory optimization, spatial-temporal smoothing으로 줄이는 real-robot manipulation policy
Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies 2026-06-17: flow matching 기반 generative robot policy의 action generation source를 observation-independent Gaussian noise에서 proprioception-conditioned learnable Gaussian prior로 바꾸고, 같은 source prior가 diffusion-bridge generator에도 plug-in될 수 있음을 보인 source-prior learning method
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics 2026-06-11: suboptimal / OOD robot demonstrations를 Diffusion Policy 학습에 그냥 섞지 않고, diffusion timestep에 따라 “쓸 수 있는 구간”을 제한해 유용한 global plan 또는 local motion primitive만 뽑아 쓰는 imitation learning 방법
SMoDP: Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation 2026-05-25: Diffusion policy의 MoE router를 skill-aware하게 만들어 multi-task manipulation에서 expert를 의미 있는 skill 단위로 재사용하게 만든다

sim2real

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework
HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation 2026-05-27: 더 현실적인 시뮬레이션 + 더 다양한 recovery trajectory + 소량 real data co-training → zero-shot/few-shot sim-to-real 성능 향상

success-rate

Learning Action Priors for Cross-embodiment Robot Manipulation 2026-06-26: Pretrained VLM에 아직 motor structure를 배우지 못한 action head를 바로 붙여 joint train하는 대신, state-action trajectory만으로 flow-matching action encoder-decoder를 먼저 pretrain한 뒤 decoder initialization, decaying latent distillation, history compression을 통해 VLA에 이식하는 cross-embodiment policy training framework
InSight: Self-Guided Skill Acquisition via Steerable VLAs 2026-06-25: 기존 demonstration을 자동으로 primitive 단위로 분해해 pretrained π0.5를 primitive-steerable policy로 만들고, novel task에서 VLM이 발견한 missing primitive를 single-axis controller로 자율 수집·검증한 뒤 VLA에 재학습하여 영속적인 skill vocabulary로 편입하는 VLM-guided continual skill acquisition framework
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models 2026-06-25: pretrained VLA를 두 단계의 GRPO 기반 RL post-training으로 fine-tuning하여, 한 번의 inference에서 안전하게 실행할 수 있는 action chunk 길이를 늘리고 전체 physical control step은 줄임
SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies 2026-06-25: VLA가 robot-specific control command 대신 실제로 달성해야 할 6-DoF Cartesian end-effector displacement를 예측하게 하고, target robot마다 선형 Action Adapter를 offline calibration과 online LMS로 적응시켜 cross-embodiment·cross-hardware·deployment dynamics shift에 강한 execution interface를 만든다
World Value Models for Robotic Manipulation 2026-06-25: Pretrained Wan2.2 video world model을 robot video로 jointly fine-tune하면서 별도의 lightweight value DiT를 Mixture-of-Transformers로 결합해, video와 language로부터 4-frame task-progress chunk를 flow matching으로 생성하고 그 progress 변화량으로 suboptimal data를 filtering·reweighting하는 generalist robotic value model
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation 2026-06-24: flow matching robot policy의 중간 noisy action을 clean action chunk로 한 번에 projection한 뒤, 그 지점의 critic gradient를 value-improved velocity target으로 distillation하여 전체 denoising ODE를 backpropagation하지 않고도 offline-to-online real-world RL을 수행
UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models 2026-06-24: pretrained VLM과 action expert의 각 layer group을 서로 다른 주기로 실행·cache하도록 학습하고, VLM feature와 action decoding stage의 연결 순서를 뒤집어, VLA-Adapter의 success rate를 높이면서 평균 inference latency를 줄인 scheduler-aware VLA architecture
UniviewVLA: A Unified Multiview Vision-Language-Action Model with World Modeling 2026-06-24: agent-view와 wrist-view의 두 프레임만으로 candidate auxiliary-views의 다음 장면 token을 생성하고, motion-relevant token 16개로 압축한 뒤 action entropy가 가장 낮은 view를 선택해 FAST action token을 생성하는 autoregressive multiview VLA
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World 2026-06-22: coding agent가 실제 로봇의 reset → rollout → verification → policy/code refinement research loop를 직접 운영하고, 여러 robot–agent worker가 Git으로 실험 지식을 공유하면서 task policy를 자동 개선하게 만든 physical autoresearch harness
Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection 2026-06-22: frozen flow-matching robot policy의 initial action noise를 backward ODE inversion과 repainting으로 조정하여, 이미 실행된 action prefix와 새 action chunk를 gradient·retraining 없이 연속적으로 연결하는 asynchronous inference method
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation 2026-06-19: pretrained Cosmos3-Nano video foundation model을 forward dynamics, inverse dynamics, cross-view inpainting의 세 mode로 공동 fine-tuning하고, inference에서는 commanded action과 generated video에서 inverse dynamics로 복원한 action의 불일치를 rollout reliability signal로 사용해 frozen VLA policy를 multi-view video world model 안에서 closed-loop 평가하는 method
Do as I Do: Dexterous Manipulation Data from Everyday Human Videos 2026-06-18: monocular RGB human manipulation video를 4D hand–object trajectory로 복원하고, pretrained SAM 3D를 training-free guided flow sampling으로 object tracker처럼 재활용한 뒤, MuJoCo Warp의 dynamics-aware sampling optimization으로 22-DoF Sharpa Wave hand가 실행할 수 있는 robot trajectory로 변환하는 offline robot-data engine
DREAM-Chunk: Reactive Action Chunking with Latent World Model 2026-06-18: frozen action-chunking VLA가 샘플링한 N개 candidate chunk의 latent future를 lightweight world model로 예측하고, 매 control step마다 현재 observation과 가장 가까운 phase-aligned dreamed state의 action으로 전환해 VLA를 다시 호출하지 않고 within-chunk reactivity를 높이는 test-time scaling method
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation 2026-06-18: pretrained 14B flow-matching video DiT에 Geometry-Aware Cross-View Attention, camera-aware Geo-RoPE, Depth Anything 3 기반 Latent 3D-REPA를 결합해 여러 로봇 카메라의 미래 영상을 3D-consistent하게 생성하고, action-conditioned rollout을 WAM의 world-prediction backbone으로 활용할 수 있는 multi-view world foundation model
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework
LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation 2026-06-17: asynchronous inference로 실행되는 Diffusion Policy의 chunk boundary jerk와 obstacle collision 문제를 latency-aware classifier-free guidance, demonstration-derived goal prediction, collision-free trajectory optimization, spatial-temporal smoothing으로 줄이는 real-robot manipulation policy
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model
WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT 2026-06-17: pretrained WAM에서 actor만 RL fine-tuning하지 않고, successful online rollout으로 world model을 KL-regularized video SFT하며, actor는 imagined future와 executed future의 reconstruction consistency reward로 RL update하는 WAM post-training framework
Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies 2026-06-17: flow matching 기반 generative robot policy의 action generation source를 observation-independent Gaussian noise에서 proprioception-conditioned learnable Gaussian prior로 바꾸고, 같은 source prior가 diffusion-bridge generator에도 plug-in될 수 있음을 보인 source-prior learning method
Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement 2026-06-17: pretrained generalist robot policy를 stochastic action generator로 사용하고, geometric verifier로 Best-of-N action chunk를 선택한 뒤, 성공한 verified rollout을 BC fine-tuning data로 재사용하는 inference-time steering + autonomous policy improvement framework
Geometric Action Model for Robot Policy Learning 2026-06-16: pretrained Geometric Foundation Model(GFM)을 단순 feature extractor가 아니라 robot policy backbone 자체로 재활용해, GFM latent space 안에서 future geometry와 action chunk를 함께 예측하는 geometry-grounded World-Action Policy
Inference-time Policy Steering via Vision and Touch 2026-06-16: frozen diffusion robot policy의 weights는 바꾸지 않고, action-conditioned visuo-tactile latent world model로 후보 action chunk의 future outcome을 예측한 뒤, long-horizon vision으로 global action mode를 선택하고 short-horizon touch로 local contact execution을 diffusion editing하는 inference-time steering method
Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method
T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA
Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models 2026-06-15: frozen flow-based VLA는 그대로 둔 채, lightweight RL adaptor가 매 query마다 latent steering w, denoising steps K, execution chunk length C를 동적으로 선택해 hard state에서는 더 많은 compute와 잦은 replanning을, easy state에서는 낮은 compute와 긴 open-loop execution을 수행하도록 만드는 elastic VLA execution framework
WAM4D: Fast 4D World Action Model via Spatial Register Tokens 2026-06-15: 4D geometry를 inference-time output으로 직접 만들지 않고, training-time spatial register token으로 future depth를 예측하게 만들어 geometric foundation prior를 causal video-action WAM에 distill한 뒤, deploy 시 geometry branch를 제거해 action chunk를 빠르게 생성
µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine
Improving Robotic Generalist Policies via Flow Reversal Steering 2026-06-12: coarse semantic action을 frozen flow-matching VLA의 역방향 ODE로 latent noise에 매핑한 뒤 다시 denoise해, generalist policy prior 안의 더 정교한 action mode를 호출하는 training-free steering 방법
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics 2026-06-11: suboptimal / OOD robot demonstrations를 Diffusion Policy 학습에 그냥 섞지 않고, diffusion timestep에 따라 “쓸 수 있는 구간”을 제한해 유용한 global plan 또는 local motion primitive만 뽑아 쓰는 imitation learning 방법
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies 2026-06-11: pretrained action-chunking robot policy의 action generator는 완전히 고정하고, 현재 observation과 예측된 action chunk를 보고 “이번에 몇 step을 open-loop로 실행할지”를 PPO로 학습하는 lightweight execution-horizon predictor
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination 2026-06-10: WAM의 미래 영상 예측을 photorealistic video generation이 아니라 action generation을 돕는 저비용 coarse future guidance로 재정의하고, compact video expert + low-resolution future latent + asymmetric video-action denoising으로 약 1B 규모에서 real-world policy inference latency를 약 98 ms/chunk까지 낮춤
SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation 2026-06-10: long-horizon robotic manipulation에서 VLA policy의 self-improvement를 위해, action-primitive stage estimator와 multi-gate MoE value head로 dense reward/value model을 만들고, 이를 SPIRAL의 offline-to-online residual RL data flywheel에 통합한다
AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing 2026-06-09: Video-DiT world planner는 low-frequency로 long-horizon latent context를 만들고, Action-DiT executor는 OVCR로 최신 observation에 맞게 context를 보정해 short action chunk를 high-frequency closed-loop로 실행하는 asynchronous WAM
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation 2026-06-09: Qwen2.5-VL 기반 VLA에 latent action token K/V cache-conditioned stop-gradient DiT flow action expert, VGGT 기반 3D spatial encoder, relative end-effector action 기반 embodiment canonicalization을 결합해 unseen object / background shift / pretraining-unseen robot embodiment transfer를 개선하는 geometry-aware manipulation policy
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation 2026-06-09: Cosmos-Predict2.5 기반 Video DiT의 intermediate denoising feature를 Motion DiT action policy에 주입하고, SONIC 기반 unified whole-body motion token으로 humanoid의 상·하체를 한 action space에 묶어 Unitree G1에서 real-time loco-manipulation을 수행
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies 2026-06-09: few-shot SFT된 π0.5 flow-matching VLA를 고정된 self-rollout buffer와 learned Q-critic의 action-gradient로 offline RL fine-tuning하되, Q-gradient를 terminal action label이 아니라 denoising-time residual velocity supervision으로 바꾸어 학습
ActionMap: Robot Policy Learning via Voxel Action Heatmap 2026-06-08: VLA의 기존 single-point action decoder를 3D translation / 3D rotation / gripper voxel heatmap action head로 교체해, action space의 geometric proximity(인접성)를 학습 신호로 활용
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training 2026-06-04: pretrained VLA를 VLA data + real-world 3D reasoning data로 co-training하면서, 3D foundation model과 reasoning-prompt teacher를 학습 중에만 사용해 2D image-only inference에서도 implicit 3D spatial reasoning을 action prediction에 주입
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework
OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics 2026-06-04: pretrained Cosmos-Predict2.5-2B video DiT를 2D kinematic skeleton condition으로 fine-tuning하여, 여러 robot embodiment와 human hand에 걸쳐 action-conditioned future video를 생성하고 이를 RoboArena policy evaluation proxy로 쓴다
Cosmos 3: Omnimodal World Models for Physical AI 2026-06-03: language, image, video, audio, action을 하나의 Mixture-of-Transformers (MoT) 기반 omnimodal world model로 통합해, VLM·video generator·forward/inverse dynamics·robot policy를 하나의 Physical AI backbone으로 다루는 NVIDIA의 대규모 foundation model
PointAction: 3D Points as Universal Action Representations for Robot Control 2026-06-03: pretrained video diffusion model이 RGB뿐 아니라 temporally consistent XYZ pointmap까지 생성하게 만들고, 이 3D point dynamics를 embodiment-specific diffusion action decoder가 action chunk로 변환
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs 2026-06-03: VLA executor가 coarse goal과 full image에서 “무엇을 할지/무엇을 볼지”를 스스로 추론하지 않도록 goal-preserving local language와 learned visual evidence budget을 함께 학습시키는 planner-executor VLA generalization framework
Continuous Reasoning for Vision-Language-Action 2026-06-02: VLA의 reasoning을 자연어 CoT가 아니라, 다른 VLA instance도 consume할 수 있는 WAE-regularized Gaussian continuous reasoning interface로 정의
VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-Based Data Synthesis 2026-06-02: training distribution에서 멀고 서로 중복되지 않는 테스트 케이스로 VLA 실패를 적극적으로 찾고, 그 실패 trajectory를 VLM agent가 성공 trajectory로 고쳐 fine-tuning data로 쓰는 failure-driven VLA enhancement framework
τ0-WM: A Unified Video-Action World Model for Robotic Manipulation 2026-06-02: action generation, video prediction, action-conditioned evaluation을 하나의 shared video diffusion backbone 위에서 통합한 manipulation framework

training-data

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World 2026-06-22: coding agent가 실제 로봇의 reset → rollout → verification → policy/code refinement research loop를 직접 운영하고, 여러 robot–agent worker가 Git으로 실험 지식을 공유하면서 task policy를 자동 개선하게 만든 physical autoresearch harness
Do as I Do: Dexterous Manipulation Data from Everyday Human Videos 2026-06-18: monocular RGB human manipulation video를 4D hand–object trajectory로 복원하고, pretrained SAM 3D를 training-free guided flow sampling으로 object tracker처럼 재활용한 뒤, MuJoCo Warp의 dynamics-aware sampling optimization으로 22-DoF Sharpa Wave hand가 실행할 수 있는 robot trajectory로 변환하는 offline robot-data engine
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction 2026-06-18: RGB history, object 위의 2D query points와 corresponding initial 3D coordinates, language instruction을 입력받아 object-attached point들의 미래 3D world-frame trajectory를 예측하도록 Molmo2를 대규모 human/robot/in-the-wild video로 pretrain하고, 이 motion prior가 robot policy initialization과 video generation guidance로 전이됨을 보임
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement 2026-06-18: Real-robot demonstrations로 fine-tune한 VLA를 고정한 뒤, task-relevant object 6-DoF pose·proprioception·현재 base VLA action만 입력받는 lightweight residual RL policy를 simulation에서 학습하고 real robot에 adaptation 없이 결합해, FR3 5-task 평균 real success rate를 42%에서 76%로 높인 sim-to-real VLA enhancement framework
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining 2026-06-17: 대규모 egocentric human video를 robot-compatible pseudo-action으로 변환하고, camera-space action / morphology conditioning / time-aligned chunking / reliability-aware auxiliary loss를 결합해 human + robot + simulation 데이터를 함께 VLA pretraining에 쓰는 unified VLA pretraining framework
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models 2026-06-17: Qwen-VL 기반 VLA에 canonical state/action alignment, camera-frame EEF action, in-context policy adaptation, Human-to-Robot synthesis를 결합해 heterogeneous robot manipulation data를 coherent하게 scale하고 OOD task/scene·instruction·cross-embodiment generalization을 끌어올린 robot manipulation foundation model
Uncertainty Quantification for Flow-Based Vision-Language-Action Models 2026-06-17: flow matching 기반 VLA의 action generation ODE에서 ensemble velocity field disagreement(VFD)를 측정해 epistemic uncertainty를 추정하고, 이를 failure detection과 SAVE active fine-tuning data acquisition에 사용해 expert demonstration sample efficiency를 높임
T-Rex: Tactile-Reactive Dexterous Manipulation 2026-06-16: tactile-free human egocentric pretraining으로 얻은 visuomotor prior를 tactile-rich robot mid-training으로 contact dynamics에 맞춘 뒤, slow action expert와 fast tactile expert를 cascaded flow matching으로 연결해 action chunk 내부에서도 tactile feedback에 반응하는 tactile-reactive dexterous VLA
µ0: A Scalable 3D Interaction-Trace World Model 2026-06-15: pretraining 단계에서는 action-labeled robot data 없이 heterogeneous videos에서 추출한 semantic 3D interaction traces를 학습하고, downstream에서는 frozen trace world model의 hidden features를 action expert에 주입해 robot policy를 만드는 3D trace-space world model
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations 2026-06-12: egocentric human manipulation video를 digital twin 기반으로 변환해, robot observation video와 실행 가능한 로봇 action trajectory를 함께 생성하고, 이를 이용해 real-robot dexterous visuomotor policy를 학습하는 human-video-to-robot-demo data engine
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation 2026-06-12: multi-view RGB + proprioception + action chunk를 입력으로 미래 latent rollout과 reward/value를 빠르게 예측해, π0.5 같은 VLA policy의 offline evaluation, synthetic-data policy improvement, test-time best-of-N planning을 가능하게 만든 action-conditioned latent world model
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics 2026-06-11: suboptimal / OOD robot demonstrations를 Diffusion Policy 학습에 그냥 섞지 않고, diffusion timestep에 따라 “쓸 수 있는 구간”을 제한해 유용한 global plan 또는 local motion primitive만 뽑아 쓰는 imitation learning 방법
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors 2026-06-04: 3D asset과 video foundation model prior를 이용해 humanoid loco-manipulation용 4D human-object interaction 데이터를 완전 디지털로 생성하고, 이를 Unitree G1용 tracking policy와 egocentric visual policy로 변환해 실제 로봇에 배포하는 data-generation / sim-to-real framework

training-free

Grounding Generative Policies in Physics: Optimization-Guided Diffusion for Robot Control 2026-06-25: Frozen task-space diffusion policy의 DDIM sampling noise를 무작위로 뽑는 대신, robot reachability·collision·controller trackability를 만족하도록 최적화하여 cross-embodiment deployment를 수행하는 inference-time constrained diffusion method
Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection 2026-06-22: frozen flow-matching robot policy의 initial action noise를 backward ODE inversion과 repainting으로 조정하여, 이미 실행된 action prefix와 새 action chunk를 gradient·retraining 없이 연속적으로 연결하는 asynchronous inference method
Do as I Do: Dexterous Manipulation Data from Everyday Human Videos 2026-06-18: monocular RGB human manipulation video를 4D hand–object trajectory로 복원하고, pretrained SAM 3D를 training-free guided flow sampling으로 object tracker처럼 재활용한 뒤, MuJoCo Warp의 dynamics-aware sampling optimization으로 22-DoF Sharpa Wave hand가 실행할 수 있는 robot trajectory로 변환하는 offline robot-data engine
Retrieve, Don’t Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time 2026-06-16: VLA/WAM policy를 새 task마다 다시 fine-tuning하지 않고, 저비용 pool embodiment demonstration을 retrieval pool에 추가한 뒤 frozen policy가 매 control step마다 retrieved trajectory를 조건으로 action chunk를 생성하게 만든 test-time task adaptation method
Improving Robotic Generalist Policies via Flow Reversal Steering 2026-06-12: coarse semantic action을 frozen flow-matching VLA의 역방향 ODE로 latent noise에 매핑한 뒤 다시 denoise해, generalist policy prior 안의 더 정교한 action mode를 호출하는 training-free steering 방법
Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies 2026-06-03: last denoising step들에서 clean-action estimate들의 variance를 future action별 stability proxy로 사용해, 안정적인 action prefix만 실행하고 고분산 구간 전에 replan
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking 2026-06-02: action chunking robot policy에서 고정 execution horizon 대신, predicted action chunk의 low-speed valley를 phase boundary로 사용해 매 query마다 실행 길이를 동적으로 선택하는 training-free test-time execution 방법
OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism 2026-05-19: MoT VLA에서 action과 language task가 공유하는 observation KV cache를 통합 관리해 중복 prefill과 resource contention을 줄이고 action frequency와 language throughput을 동시에 높이는 inference system