2. Limited Closed-loop Reproduction, Route-level Profiling, and Wrist-camera Robustness

2026-06-16 English Profiling Python

A limited closed-loop reproduction and probing project for Realtime-VLA FLASH on Runpod L40S: official checkpoint conversion, LIBERO Goal baseline, synchronized route-level profiling, wrist-camera dropout robustness, and a minimal WristHealthGuard extension

TL;DR

I ran Realtime-VLA FLASH on a Runpod NVIDIA L40S instance using the official repository and public checkpoints. I resolved and converted the public pi0_libero base checkpoint and draft_libero_goal draft checkpoint into Triton artifacts, and then executed a limited closed-loop LIBERO Goal baseline. The project culminated in synchronized route-level profiling, a wrist-camera dropout robustness probe, and a minimal local inference-time extension called WristHealthGuard.

The most important result is not “paper reproduced.” The correct result is more precise:

This is a completed limited closed-loop reproduction + probing + profiling + minimal extension project. It partially supports the Realtime-VLA FLASH latency mechanism in a limited setting, but it is not a full paper reproduction, not a hardware-exact latency reproduction, and not a full robustness evaluation.

The final measured highlights were:

Component	Result
Stage 4 limited baseline	27/30 success on LIBERO Goal tasks 0–2
Stage 4 route mix	full = 88, draft = 276
Stage 4 draft/full ratio	0.758 / 0.242
Stage 4 accepted prefix mean	10.074
Stage 6 synchronized server-side p50	8.083 ms
Stage 6 draft route p50	8.067 ms
Stage 6 full route p50	33.156 ms
Stage 6 client roundtrip p50	11.957 ms
Stage 5 `wrist_zero_every4`	4/15 success
Stage 5 `wrist_zero_all`	0/15 success
Stage 7 WristHealthGuard `every4`	6/15 success
Stage 7 WristHealthGuard `allzero`	0/9 success

The paper claims that FLASH reduces LIBERO inference latency by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average latency to 19.1 ms with a 3.04× speedup and only a 0.3-point average success drop. My project supports the mechanism-level story in a limited Runpod L40S setting, but it does not reproduce the full paper benchmark or the real-world conveyor experiment. (arXiv)

1. Why this paper

Realtime-VLA FLASH is interesting because it attacks a real deployment bottleneck in diffusion-based vision-language-action models, or dVLAs. dVLAs can generate high-quality continuous action chunks, but full-path inference can be too slow for reactive closed-loop robot control. In a robot system, the problem is not just throughput. The policy must repeatedly consume fresh observations, produce action chunks, and avoid executing stale actions when the scene has changed.

The project page describes the central motivation clearly: synchronous full-path inference can make robot commands stale in reactive scenes, and FLASH replaces slow full-path replanning with speculative rounds that execute only the verified action prefix and fall back when the draft is inconsistent. (Project Page)

This makes the paper attractive as a paper-to-prototype-lab project for three reasons.

First, the paper’s claim is not purely model-centric. It is a runtime and control-loop claim. The system has a full path, a draft path, parallel verification, accepted-prefix execution, and fallback. If any part of that chain fails in practice, the paper’s real-time control story becomes much weaker.

Second, the official repository and public checkpoint path make the project reproducible enough to audit systematically. The repository provides installation instructions, Triton conversion commands, a policy server command, and a LIBERO client command. It also explicitly supports a workflow where the LIBERO client/evaluation code can run in a separate environment. (GitHub)

Third, the failure modes are valuable even when the project does not fully reproduce the paper. Closed-loop robot model projects often fail because of renderer issues, checkpoint ambiguity, server/client incompatibility, action-shape mismatch, logging gaps, or invalid latency measurement. This project was designed to make those failure modes explicit instead of hiding them.

2. What this project claims and does not claim

This project makes a narrow set of claims.

This project claims

I successfully ran the official Realtime-VLA FLASH repository on Runpod L40S under a limited evaluation scope.
I resolved the public pi0_libero base checkpoint and converted both base and draft artifacts into Triton layout.
I booted the official pi0_libero Triton policy server and connected it to the LIBERO client.
I executed a limited closed-loop LIBERO Goal baseline on tasks 0–2 and obtained 27/30 success.
I added synchronized server-side timing and observed that the draft route p50 latency was lower than the full route p50 latency in this limited setting.
I probed synthetic wrist-camera dropout and found severe degradation.
I implemented a minimal local inference-time extension, WristHealthGuard, and observed modest recovery under intermittent dropout.

This project does not claim

It does not claim full paper reproduction.
It does not claim full LIBERO benchmark coverage.
It does not claim hardware-exact paper latency reproduction.
It does not claim real-world conveyor reproduction.
It does not claim general robustness.
It does not claim that WristHealthGuard is an upstream-quality method.

The Stage 8 claim matrix explicitly labels the full benchmark and real-world conveyor result as not claimed, while the limited baseline, latency mechanism, wrist-dropout probe, and WristHealthGuard extension are marked as limited or partial support only.

3. Paper claim decomposition

I did not treat “reproduce Realtime-VLA FLASH” as one monolithic objective. Instead, I decomposed the paper and project into progressively stronger claims.

Claim ID	Claim	My result	Status
C1	Official repo and model environment can be prepared	model env and imports passed	supported
C2	Draft and base artifacts for `pi0_libero` can be resolved and converted	draft/base conversion passed	supported
C3	Policy server can boot with converted artifacts	server readiness reached	supported
C4	End-to-end server/client/LIBERO integration works for one task	2/2 task 0 episodes succeeded	supported
C5	FLASH+Triton can complete a limited LIBERO Goal baseline	27/30 success on tasks 0–2	supported limited
C6	FLASH latency mechanism can be profiled	draft p50 < full p50	partially supported limited
C7	Flash route and accepted-prefix behavior can be inspected	route/prefix logs collected	supported limited
C8	Wrist-camera dropout robustness can be probed synthetically	strong degradation observed	negative limited
C9	WristHealthGuard can recover some intermittent dropout	4/15 → 6/15	partially supported limited
C10	Full paper benchmark is reproduced	not attempted	not claimed
C11	Real-world conveyor result is reproduced	not attempted	not claimed

This decomposition matters. It prevents a common reproduction mistake: using one successful smoke test as evidence for a much larger claim.

The project’s final status is therefore:

complete limited closed-loop reproduction + probing + profiling + minimal extension; full paper reproduction: no; paper claim status: partially supported in limited setting.

4. Method recap: Realtime-VLA FLASH

Realtime-VLA FLASH can be understood as a dual-path runtime for diffusion-based VLA policies.

4.1 Full path

The full path is the reliable but expensive route. It runs the full image encoding, VLM prefill, and action denoising process. In a closed-loop robot system, this path refreshes the high-fidelity context and provides the anchor against which speculative outputs can be checked.

4.2 Draft path

The draft path generates a candidate action chunk more cheaply. The idea is not to discard the main model, but to avoid rerunning the full pipeline at every replanning round. The draft path becomes valuable only if it is sufficiently fast and sufficiently aligned with the full model’s action behavior.

The official repository highlights “speculative inference as fast as 7.8 ms” and customized Triton serving as central to the system. (GitHub)

4.3 Parallel verification

The difficult part is that dVLAs output continuous actions through flow matching or denoising, not discrete language tokens. Therefore, token-level speculative decoding cannot be used directly.

FLASH instead verifies draft action chunks using the main Action Expert at selected flow-matching timesteps. The project page summarizes this as a flash path that drafts and verifies a candidate action chunk and returns the longest consistent prefix. (Project Page)

4.4 Accepted prefix

The accepted prefix is the portion of the candidate action chunk that the system decides to execute. This is a control-loop variable, not just an inference artifact. A longer accepted prefix reduces replanning frequency and can improve runtime efficiency, but it also risks executing stale actions if the environment changes or the observation is corrupted.

In my experiments, accepted prefix was useful but not sufficient as an uncertainty signal. In the clean Stage 4 baseline, the accepted prefix mean was 10.074 and the p50/p95/p99 were all 12. Under wrist-camera dropout, the mean decreased, but high percentiles often remained saturated at 12. This means the accepted prefix captured some degradation but did not reliably flag all failure-prone states.

4.5 Phase-aware fallback

FLASH also includes fallback logic. In principle, smooth motion phases tolerate approximate draft behavior, while precision-sensitive phases such as final alignment or gripper switching should fall back to the full path. My project did not isolate phase-aware fallback at the same depth as the paper, but it did track route mix, accepted prefix, success, and failure modes across clean, perturbed, and guarded settings.

5. Action representation and control-loop meaning

The policy returned action chunks with shape (50, 7). In the LIBERO setting, this is a continuous robot control representation rather than a discrete token sequence.

The critical control-loop variables in this project were:

Variable	Why it matters
Action chunk shape	Confirms policy-client action interface compatibility
Route type	Indicates whether the runtime used full or draft path
Accepted prefix length	Determines how many actions are executed before replanning
Client roundtrip	Measures effective closed-loop delay observed by the simulator client
Server-side action generation time	Measures model-side computation more directly
Success/failure at horizon	Captures whether the closed-loop system actually solved the task

This is why I did not stop at server boot. Server boot proves readiness; it does not prove closed-loop behavior. The project only became meaningful after the LIBERO client received valid action chunks, stepped the simulator, produced episode logs, and completed tasks.

6. Stage 0: scope lock

I locked the scope before moving into model and checkpoint setup.

The scope lock stated:

closed-loop simulator scope: allowed
checkpoint/model setup: allowed
full benchmark: not allowed yet
current project type: conditional partial closed-loop reproduction
paper reproduction claim: not allowed yet

This was not bureaucracy. It prevented a common failure mode: treating setup progress as reproduction evidence. The Stage 0 scope document explicitly preserved the distinction between infrastructure readiness, model readiness, and paper reproduction.

7. Stage 1–2B: model, checkpoint, and server readiness

7.1 Model environment

The top-level model environment was kept separate from the .venv-libero client environment.

The final model environment used:

Component	Version
Python	3.11.13
torch	2.7.1+cu126
torch CUDA	12.6
Triton	3.3.1
JAX	0.5.3
transformers	4.53.2
GPU	NVIDIA L40S

This separation was important because the official repository requires Python ≥3.11 and dependencies such as torch==2.7.1, triton==3.3.1, jax[cuda12]==0.5.3, and transformers==4.53.2, while LIBERO uses a different evaluation environment. The repository’s pyproject.toml confirms the Python and major dependency requirements. (GitHub)

The import smoke tested torch, triton, jax, transformers, openpi, openpi.training.config, openpi.policies.policy_config, CUDA availability, JAX devices, and the transformer replacement patch.

7.2 Draft checkpoint

The draft checkpoint draft_libero_goal.pt was downloaded and loaded on CPU. The Stage 2 draft load script recorded file existence, file size, top-level object type, keys, tensor count, and tensor shapes/dtypes without printing large tensors.

The draft Triton conversion succeeded and produced a local-only draft_triton.pkl.

7.3 Base checkpoint blocker and resolution

The first serious blocker was base checkpoint ambiguity. The repository quick start expects a JAX checkpoint path for base conversion, but the exact public path for pi0_libero had to be resolved carefully. The official README shows the expected conversion flow: first convert a pretrained base checkpoint with --mode base --jax-path /path/to/jax/checkpoint, then convert the draft checkpoint, then serve with --config pi0_libero, --base-triton-path, --draft-triton-path, and --backend triton. (GitHub)

I did not force a blind large download. Instead, I created a metadata-only checkpoint probe and classified candidates. The probe explicitly distinguished:

MATCH_FLASH_PI0_LIBERO_JAX
MATCH_FLASH_PI0_LIBERO_TORCH_ONLY
BASE_PI0_PRETRAIN_ONLY
PUBLIC_PI05_LIBERO_MISMATCH
LOCAL_PLACEHOLDER_ONLY
NOT_FOUND
UNKNOWN_ACCESS_ERROR

The metadata probe classified gs://openpi-assets/checkpoints/pi0_libero as the correct JAX/Orbax-style root only if it had params, assets, and checkpoint metadata.

The base checkpoint resolver then downloaded only the single candidate classified as MATCH_FLASH_PI0_LIBERO_JAX, validated that it contained params, assets, checkpoint metadata, and norm stats, and wrote the final base resolution decision.

This resolved the blocker:

Base checkpoint: gs://openpi-assets/checkpoints/pi0_libero
Local size: approximately 12 GB
Converted artifact: converted/base/base_weights.pkl
Norm stats: assets/physical-intelligence/libero/norm_stats.json

All checkpoint and converted-weight artifacts were kept local-only and excluded from Git. The Stage 8 artifact index confirms that checkpoints, converted weights, videos, datasets, profiler binaries, private endpoints, and credentials were excluded from Git.

7.4 Server boot

With converted base and draft artifacts present, the server boot smoke used the official serving path:

--config pi0_libero
--base-triton-path converted/base
--draft-triton-path converted/draft_goal
--task-suite-name libero_goal
--backend triton

The boot smoke reached readiness on port 8000.

This still did not prove closed-loop success. It only proved that the server could start with the converted artifacts.

8. Stage 3: one-task closed-loop smoke

Stage 3 was the first real end-to-end closed-loop test.

Setup:

Item	Value
Suite	`libero_goal`
Task	0
Trials	2
Seed	7
Render backend	EGL
Server backend	Triton
Config	`pi0_libero`

The server was launched with the converted pi0_libero base and draft_libero_goal draft artifacts. The LIBERO client ran in .venv-libero with EGL offscreen rendering.

Results:

Metric	Value
Episodes requested	2
Episodes completed	2
Success	2
Success rate	1.000
Infer calls	21
Route counts	full = 2, draft = 19
Accepted prefix mean/p50/p95	12 / 12 / 12
Peak VRAM	7390 MiB

Stage 3 verified that the server, client, policy, action interface, simulator, logs, and videos were connected. It also confirmed action chunks with valid shape. The Stage 3 report correctly frames this as a tiny smoke test rather than a benchmark result.

9. Stage 4: limited closed-loop baseline

Stage 4 was the clean baseline.

Setup:

Item	Value
Suite	`libero_goal`
Tasks	0, 1, 2
Episodes per task	10
Total measured episodes	30
Seed	7
Backend	Triton + EGL
Warm-up	excluded

Results:

Task	Episodes	Success	Success rate
0	10	9	0.900
1	10	9	0.900
2	10	9	0.900
Aggregate	30	27	0.900

Route behavior:

Metric	Value
Full route count	88
Draft route count	276
Draft ratio	0.758
Full ratio	0.242
Accepted prefix mean	10.074

This is the strongest closed-loop baseline result in the project. It shows that the official setup can run a nontrivial limited LIBERO Goal subset on Runpod L40S.

But it is still not a full paper reproduction. It used only LIBERO Goal tasks 0–2, one seed, and 30 measured episodes. The Stage 8 final metrics and project summary explicitly preserve this limited interpretation.

10. Stage 6: synchronized latency profiling

Stage 4 already had latency fields, but they were preliminary wall-clock timings. That was not enough for a latency-focused paper.

Therefore, Stage 6 added synchronized server-side timing. The code change is captured in scripts/stage6_apply_server_timing_patch.py and patches/stage6_server_timing.patch. The key new field was:

policy_time_gpu_sync_ms

The measurement pattern used torch.cuda.synchronize() before and after the server-side action-generation path. This does not make the experiment hardware-exact relative to the paper, but it is much better than unsynchronized wall-clock timing.

Setup:

Item	Value
Suite	`libero_goal`
Tasks	0, 1, 2
Episodes per task	3
Total measured episodes	9
Warm-up	excluded
Timing	server-side synchronized timing
Claim status	`PARTIALLY_SUPPORTED_LIMITED`

Results:

Metric	Value
`policy_time_gpu_sync_ms` p50	8.083 ms
`policy_time_gpu_sync_ms` p95	33.501 ms
Draft route p50	8.067 ms
Full route p50	33.156 ms
Client roundtrip p50	11.957 ms

Interpretation:

The draft route was substantially faster than the full route at p50 in this limited setting. This supports the latency mechanism: speculative/draft rounds can be much cheaper than full-path rounds.

But the result is not a hardware-exact paper latency reproduction. It used Runpod L40S, a small LIBERO Goal subset, and custom synchronized timing fields. The Stage 8 final metrics label the latency claim as PARTIALLY_SUPPORTED_LIMITED.

Also, p99 latency remained noisy due to residual first-use and route-specific warm-up tails. This is why the blog should emphasize p50/p95 mechanism-level support and avoid overclaiming p99 stability. The caveats document explicitly says Stage 6 p99 values should be interpreted cautiously.

11. Stage 5: wrist-camera dropout robustness probe

Stage 5 added a synthetic robustness probe. The client-side perturbation is captured in scripts/stage5_apply_client_perturb_patch.py and patches/stage5_client_camera_perturb.patch.

The perturbation targeted the wrist camera image:

Source key: robot0_eye_in_hand_image
Sent as: observation/wrist_image

Two conditions were tested:

Condition	Perturbation
`wrist_zero_every4`	Zero wrist image every fourth policy query
`wrist_zero_all`	Zero wrist image at every policy query

Setup:

Item	Value
Suite	`libero_goal`
Tasks	0, 1, 2
Episodes per task per condition	5
Total measured robustness episodes	30
Seed	7
Baseline reference	Stage 4 clean baseline

Results:

Condition	Task 0	Task 1	Task 2	Aggregate
Stage 4 clean	9/10	9/10	9/10	27/30
`wrist_zero_every4`	0/5	0/5	4/5	4/15
`wrist_zero_all`	0/5	0/5	0/5	0/15

Route and prefix behavior:

Metric	Stage 4 clean	`wrist_zero_every4`	`wrist_zero_all`
Full ratio	0.242	0.322	0.388
Accepted prefix mean	10.074	7.929	6.550

The interpretation is clear:

Wrist-camera dropout strongly harmed this subset.
Full-route ratio increased under perturbation.
Accepted-prefix mean decreased under perturbation.
However, high-percentile accepted prefix remained saturated at 12.
Therefore, accepted prefix is only a partial uncertainty signal.

This is a negative result, and it is valuable. It shows that speculative execution logs can reveal some degradation, but they do not fully diagnose observation corruption.

The Stage 8 caveats explicitly preserve this interpretation: synthetic wrist-camera dropout caused strong degradation, and accepted prefix remained saturated at high percentiles under failure.

12. Stage 7: WristHealthGuard minimal extension

Stage 7 implemented a small local inference-time extension called WristHealthGuard.

12.1 Motivation

Stage 5 suggested that accepted-prefix-only adaptation would not be the best first extension. The failure source was not primarily an action-prefix scheduling problem. It was a corrupted wrist observation problem.

Therefore, the extension targeted observation health directly.

12.2 Design

WristHealthGuard uses a simple last-valid wrist-frame cache. The patch application script is scripts/stage7_apply_wrist_health_guard_patch.py.

The anti-cheating order was:

Get simulator observation.
Apply Stage 5 dropout perturbation.
Compute wrist-image health on the perturbed image.
If unhealthy and a valid cached wrist image exists, replace the image with the cached copy.
If healthy, update the cache.
Send the final observation to the policy.

The guard never sees the clean pre-perturb image after dropout. The cache resets at the start of every episode.

The health metric was:

Metric	Rule
std	healthy if `std >= 1.0`
range	healthy if `range >= 5.0`

This is intentionally simple. The goal was not to invent a robust perception module, but to test whether a minimal observation-health heuristic can recover some synthetic intermittent dropout.

12.3 Results

Setup:

Condition	Episodes
`guard_wrist_zero_every4`	15
`guard_wrist_zero_all`	9
Sanity clean guard	1, excluded

Success:

Condition	Stage 5 no guard	Stage 7 guard	Recovery
`wrist_zero_every4`	4/15	6/15	+0.133
`wrist_zero_all`	0/15	0/9	+0.000

Guard behavior:

Metric	Every4	All-zero
Cache hits	96	0
Cache misses	15	433
Cache updates	308	0
Replacements	96	0
Cache hit rate	0.229	0.000
Healthy wrist rate	0.735	0.000
Unhealthy wrist rate	0.265	1.000

Interpretation:

WristHealthGuard modestly improved intermittent dropout.
It did not solve persistent all-zero dropout.
This is expected because the cache resets per episode and all-zero dropout provides no valid frame to cache.
The all-zero result is a useful sanity check against cheating: the guard did not secretly access clean pre-perturb frames.

The Stage 8 claim matrix marks this as partially_supported_limited, not as a general robustness solution.

13. Failure mode taxonomy

A major purpose of this project was to classify failures rather than simply report success rates.

Failure mode	Stage	Interpretation
Checkpoint ambiguity	Stage 2	resolved in Stage 2B
Server boot uncertainty	Stage 2B	resolved by readiness smoke
Integration uncertainty	Stage 3	resolved by 2/2 closed-loop smoke
Horizon no-success	Stage 4	policy behavior failure, not infrastructure
Wrist-camera perturbation failure	Stage 5	perturbation-induced perception failure
Accepted-prefix saturation under failure	Stage 5/7	accepted prefix is partial signal
Persistent all-zero dropout	Stage 7	no valid cache, expected guard failure
p99 latency tail	Stage 6/7	residual first-use / tail behavior

The key shift across the project was that failures moved from infrastructure and artifact problems to actual closed-loop behavior problems. That is a good sign. It means the system was running deeply enough for policy-level and observation-level analysis to become possible.

14. Reproducibility and artifact policy

The project used a strict public-safe artifact policy.

Tracked artifacts included:

stage reports
configs
scripts
patches
lightweight logs
parsed JSON/CSV summaries
figures
final claim matrix
final project summary
final blog draft

Excluded artifacts included:

checkpoints
converted weights
videos
datasets
profiler binaries
private endpoints
credentials

The Stage 8 artifact index confirms that all required source artifacts were present and that local-only artifacts were excluded from Git.

The fixed reproducibility inputs were:

Input	Value
Official repo	`dexmal/realtime-vla-flash`
Official commit	`da6ceccad603695a8a3d6fa14dd410c3aadb536f`
Project repo	Changhyun-Choi-98/realtime-vla-flash-runpod-project
Hardware	Runpod NVIDIA L40S
Simulator	LIBERO / MuJoCo with EGL
Main config	`pi0_libero`
Main suite	`libero_goal`
Main tasks	0, 1, 2
Seed	7

The reproducibility checklist also notes that re-running from scratch requires enough disk for the public base checkpoint and converted Triton artifacts.

15. Comparison to the paper

The paper-level claim should be kept separate from my project-level result.

15.1 What the paper reports

The paper and project page report:

58.0 ms full-inference rounds
speculative rounds as fast as 7.8 ms
19.1 ms task-level average inference latency
3.04× speedup
0.3-point average success drop
real-world conveyor-belt sorting demonstration (arXiv)

15.2 What this project measured

My project measured:

27/30 success on LIBERO Goal tasks 0–2
draft/full ratio 0.758 / 0.242 in Stage 4
synchronized server-side policy_time_gpu_sync_ms p50 8.083 ms in Stage 6
draft route p50 8.067 ms
full route p50 33.156 ms
client roundtrip p50 11.957 ms
wrist-camera dropout degradation
WristHealthGuard minimal extension behavior

15.3 Claim status

The correct claim status is:

Paper area	My status
Full benchmark success preservation	not reproduced
Latency mechanism	partially supported in limited setting
Flash path faster than full path	supported in limited synchronized profiling
Accepted-prefix behavior	supported as logged behavior, but partial uncertainty signal
Real-world conveyor result	not attempted
Robustness under wrist-camera dropout	project-added negative probe
WristHealthGuard	project-added minimal extension

Thus the final phrase should be:

Partially supported in a limited setting. Not a full paper reproduction.

16. What this project does not claim

This section is intentionally explicit.

This project does not claim:

Full LIBERO benchmark reproduction.
Full paper result reproduction.
Hardware-exact latency reproduction.
Real-world conveyor reproduction.
General robustness.
General sensor-dropout robustness.
That WristHealthGuard is an upstream-quality extension.
That accepted prefix is a reliable uncertainty estimator.
That the measured latency should be directly compared to the paper without task, hardware, and timing-method caveats.

The safest description is:

limited closed-loop reproduction + probing + profiling + minimal extension.

17. Final takeaway

Realtime-VLA FLASH was reproducible enough to become a strong portfolio-quality research-engineering project.

The strongest project-level result is not a single success-rate number. It is the full chain:

official repo
public checkpoint resolution
Triton conversion
server readiness
closed-loop LIBERO rollout
limited baseline
synchronized profiling
robustness probing
minimal extension
claim matrix and caveats

The clean limited baseline was strong: 27/30 success on LIBERO Goal tasks 0–2. The synchronized profiling supported the latency mechanism: draft route p50 was lower than full route p50 in this limited setting. The robustness probe revealed a concrete weakness: wrist-camera dropout severely reduced success. The minimal extension gave a modest but interpretable improvement under intermittent dropout and failed, as expected, under persistent all-zero dropout.

This is exactly the kind of result that is useful for robot foundation model research: not overclaimed, not merely a README run, and not just a failed attempt. It turns a paper into a controlled artifact trail.

18. Career takeaway

From a Physical AI / Robot AI Model Researcher perspective, this project is valuable because it demonstrates more than implementation ability.

It demonstrates claim discipline.

Many reproduction projects fail because they jump from “the repo runs” to “the paper is reproduced.” This project did not do that. It separated:

model environment readiness
checkpoint provenance
Triton conversion
server readiness
closed-loop integration
limited baseline success
synchronized latency profiling
robustness probing
local extension

The Stage 8 career takeaway summarizes the project well: the value is that I did not believe or reject the paper wholesale; I narrowed the problem stage by stage and left a reproducible artifact trail.

This is important for robot AI research because real systems fail at the boundaries: renderer, simulator, checkpoint, client/server protocol, action representation, timing measurement, observation corruption, and control-loop latency. A strong researcher should be able to identify which layer failed and avoid turning infrastructure success into algorithmic evidence.

In this project, I reached the point where the remaining failures were no longer environment failures. They were meaningful behavior failures: horizon no-success, perturbation-induced perception failure, accepted-prefix saturation, and persistent observation loss. That is the point where research begins.

Appendix A. Final metrics

Category	Metric	Value
Stage 4 baseline	Task 0	9/10
Stage 4 baseline	Task 1	9/10
Stage 4 baseline	Task 2	9/10
Stage 4 baseline	Aggregate	27/30
Stage 4 route	Full	88
Stage 4 route	Draft	276
Stage 4 route	Draft ratio	0.758
Stage 4 route	Full ratio	0.242
Stage 4 prefix	Accepted prefix mean	10.074
Stage 6 latency	`policy_time_gpu_sync_ms` p50	8.083 ms
Stage 6 latency	`policy_time_gpu_sync_ms` p95	33.501 ms
Stage 6 latency	Full route p50	33.156 ms
Stage 6 latency	Draft route p50	8.067 ms
Stage 6 latency	Client roundtrip p50	11.957 ms
Stage 5 robustness	Clean Stage 4	27/30
Stage 5 robustness	`wrist_zero_every4`	4/15
Stage 5 robustness	`wrist_zero_all`	0/15
Stage 7 extension	`guard_wrist_zero_every4`	6/15
Stage 7 extension	`guard_wrist_zero_all`	0/9
Stage 7 extension	Every4 recovery	+0.133
Stage 7 extension	All-zero recovery	+0.000

These metrics are also encoded in results/stage8_final_metrics.json.

Appendix B. Final claim matrix summary

Claim	Status	Blog wording	Caveat
C1 official repo/env readiness	supported	model env could be prepared	not evaluation
C2 checkpoint/draft/base conversion	supported	public checkpoint conversion worked	weights local-only
C3 policy server boot	supported	server readiness confirmed	not rollout
C4 one-task closed-loop smoke	supported	end-to-end connection confirmed	tiny smoke
C5 limited LIBERO Goal baseline	supported limited	27/30 success	tasks 0–2 only
C6 latency mechanism	partially supported limited	draft p50 < full p50	not hardware-exact
C7 route / accepted prefix	supported limited	accepted prefix inspected	partial signal
C8 wrist dropout robustness	negative limited	wrist dropout was damaging	synthetic only
C9 WristHealthGuard	partially supported limited	intermittent dropout modestly recovered	not general robustness
C10 full paper benchmark	not claimed	full benchmark not run	out of scope
C11 real-world conveyor	not claimed	real-world result outside scope	no robot setup

The full matrix exists in both Markdown and JSON form.