Long)

One-line goal: Install LIBERO, run a fine-tuned OpenVLA (or Octo) checkpoint from project 11 against the four LIBERO suites, report per-suite success rates with confidence intervals, save a failure rollout video per suite, and write a one-paragraph reflection on the LIBERO-PRO "saturation" critique.

Goal

By the end of this project you will have:

A working LIBERO install (Robosuite + MuJoCo) on a single GPU box, with a smoke test that renders one task from each of the four suites — libero_spatial, libero_object, libero_goal, libero_10 (a.k.a. LIBERO-Long).
A loaded VLA policy: either the OpenVLA-7B checkpoint you fine-tuned in project 11, or one of the public openvla-7b-finetuned-libero-* checkpoints from HuggingFace if project 11 isn't done.
Four success-rate numbers — one per suite — measured over N rollouts × 10 tasks per suite (canonical N is 50, but 10 is fine on a single GPU; the notebook lets you pick), with a Wilson 95% confidence interval reported next to each.
Four MP4/GIF artifacts in outputs/failures/ — one chosen failure rollout per suite, with the language instruction, final pose, and failure mode (timeout / wrong-object / dropped).
A reflection paragraph in the notebook contrasting your numbers against the published OpenVLA / OpenVLA-OFT / pi0.5 numbers, and against the LIBERO-PRO "perturbed" numbers, with a one-line answer to: which of the four suites does my model actually generalise on, and which has it just memorised?

The whole loop should fit in 2–4 hours on a single GPU after install (install itself is the slow part — budget another 1–2 hours the first time).

Loops touched

This is an EVAL project in the four-loop AV/robot data flywheel (collect → label → train → evaluate, then curate the failures back into collect).

EVAL — you take a trained policy (your artifact from project 11) and stress-test it against a standardised, reproducible benchmark. You produce summary numbers and per-failure artifacts. You read the benchmark literature critically and decide which of your numbers are real and which are benchmark-overfit.

You do not fine-tune or curate data here — that's projects 01, 02, 06. The "curate failures back into collect" half of the flywheel is what project 17 does end-to-end.

Why this matters for AI Data Intelligence

LIBERO is the de facto VLA evaluation benchmark in 2024–2026. Every paper that claims a new "generalist robot policy" — OpenVLA, Octo, RT-2-X follow-ons, pi0, pi0.5, GR00T-N1, RDT-2 — reports LIBERO numbers, and most published headlines (>90% on LIBERO-Spatial, >95% on LIBERO-Goal, etc.) come from this benchmark. If you're ever in a room arguing about whether a robotics-foundation-model claim is real, you need to know:

What LIBERO actually measures. Four suites of 10 tasks each, all in Robosuite + MuJoCo, all with a Franka Panda arm, all with the same low-DoF action space (7-D delta end-effector + gripper). Spatial = same objects, varying spatial relations. Object = varying object types, fixed scene. Goal = same scene + objects, varying goal predicate. Long (LIBERO-10) = 10 long-horizon multi-step tasks pulled from LIBERO-100. 50 demonstrations per task in the training data.
What the canonical eval protocol is, and how teams quietly bend it. The OpenVLA paper standardised "N=50 rollouts × 10 tasks × 4 suites = 2000 episodes per model", with episode-length caps of 280 / 280 / 300 / 520 steps for spatial / object / goal / long. Several papers since report "10 rollouts per task" or "20 rollouts per task" without saying so on the headline plot. A 5-point success-rate gap at 10 rollouts × 10 tasks (n=100 trials) is statistically a coin-flip — Wilson 95% CI is roughly ±10 points. Always read the eval section, not just the bar chart.
Why the original splits are saturated. LIBERO-PRO (arXiv 2510.03827) shows that frontier VLAs that score >90% on standard LIBERO collapse to 0% under reasonable perturbations of object identity, initial pose, language phrasing, and lighting. The conclusion: the original splits are now mostly memorisation tests. Reporting >95% on LIBERO-Spatial in 2026 means roughly the same thing as reporting >99% on MNIST in 2018: the benchmark is full, not the model.

This is the conversation Applied Intuition's Data Intelligence customers have constantly with vendors who claim "97% generalist robot performance". The job of a data-intelligence engineer is to be the person in the room who knows which number, on which split, with which protocol, and what perturbation it would take to make that number fall apart. That's what this project rehearses.

Prerequisites

Project 11 complete, OR you skip 06 and use a public checkpoint from openvla/openvla-7b-finetuned-libero-spatial etc. (see notebook Step 4). Either path works; the eval code is the same.
Python 3.8 (mandatory). LIBERO pins to Python 3.8.13 — Robosuite 1.4.x and the bddl 1.0.1 parser predate the Python 3.10 deprecation of collections.Iterable and similar. 3.9 mostly works. 3.10+ requires patches.
OS: Linux x86_64 is the reference platform. macOS works for training-side code (project 11) but Mujoco rendering on macOS is fragile — see "Hardware" below.
CUDA: any modern NVIDIA GPU (≥ 8 GB VRAM for OpenVLA-7B 4-bit inference; ≥ 24 GB for fp16 inference; Octo-base runs on a 4070).
Disk: ~10 GB. LIBERO datasets are ~3 GB; OpenVLA-7B int4 weights are ~7 GB; rollout videos add up fast.
Background: comfortable Python; you've installed PyTorch with CUDA before; you've at least seen Robosuite's env.step(action) API.

Hardware

LIBERO is a fast simulator (Robosuite under MuJoCo, no Unreal/Unity). On a single 4090, ~200 frames/sec wallclock per env, single-process. The bottleneck is policy inference, not the sim — OpenVLA-7B at 4-bit is ~4 Hz on a 4090, ~6 Hz on an A100. So a 280-step episode takes ~70 sec on a 4090; a full 50-rollout × 40-task eval is ~40 hours. Plan accordingly.

If you only have an 8 GB consumer GPU: use --load_in_4bit True for OpenVLA, or switch to Octo-base (~93M params, runs on anything). Octo's per-suite numbers will be ~25–40% lower than OpenVLA's; that's expected.

Apple Silicon caveat: MuJoCo 3.x has arm64 wheels and works on macOS, but Robosuite's mjpython GLFW path is finicky on M-series. The notebook uses off-screen MuJoCo rendering (no on-screen window) which works on macOS but you need to install MuJoCo via pip install mujoco==3.2.7 before robosuite, and you need MUJOCO_GL=glfw (not egl) on Mac. See "Common pitfalls" #2.

Headless Linux servers (no display): set MUJOCO_GL=egl and the included setup.sh will install the EGL libs (libegl1, libgles2). Do not set MUJOCO_GL=osmesa unless you know you need it; software rendering is 5–10× slower.

Setup

# from the repo root
cd projects/12-libero-eval
 
# 1. Create venv + install pinned deps + clone & install LIBERO. Idempotent.
bash setup.sh
 
# 2. Activate
source .venv/bin/activate
 
# 3. Download LIBERO benchmark task descriptions + (optionally) demos.
#    Task BDDL files are pulled in by the LIBERO clone in step 1.
#    Demos are only needed if you want to retrain or replay teleop — for
#    *eval only*, you do NOT need to download them. The notebook will tell you.
python LIBERO/benchmark_scripts/download_libero_datasets.py --datasets libero_spatial   # optional
# alt: --use-huggingface
 
# 4. Open the notebook.
jupytext --to notebook notebook.py
jupyter lab notebook.ipynb
# OR: open notebook.py directly in VSCode and use the "Run Cell" gutter

The non-obvious detail: install MuJoCo and Robosuite before LIBERO. LIBERO's setup.py does not pin them and the wrong order produces silent ABI mismatches at first env.reset(). setup.sh orders them correctly.

Steps

Each step maps to a labeled section in notebook.py. The notebook is jupytext percent-format — # %% cells are runnable, # %% [markdown] cells are narrative.

Sanity check (# %% Step 1). Print torch.cuda.is_available(), the CUDA version, the mujoco.__version__, the MUJOCO_GL env var, and try import robosuite, libero. If any of these fail, fix them before proceeding — the eval loop will be 100× harder to debug if the install is half-broken.
Smoke test: render one task from each suite (# %% Step 2). Use libero.libero.benchmark.get_benchmark_dict() to enumerate the four suites; for each, instantiate task 0, call env.reset(), render an agentview_image and a robot0_eye_in_hand_image, and save side-by-side PNGs to outputs/smoke/. This proves your install works and tells you what the policy will see.
Read & display the language instructions (# %% Step 3). For each task in each suite, print the BDDL goal predicate and the natural-language instruction (e.g. "pick up the alphabet soup and place it in the basket"). This is what gets fed to the VLA's text encoder. Eyeball them — many LIBERO instructions are ambiguous or robot-speak ("pick up the bowl" when there are three bowls). This matters when you read failure modes.
Load the policy (# %% Step 4). Two paths:
- (A) Local checkpoint from project 11: AutoModelForVision2Seq.from_pretrained("/path/to/your/ckpt", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True). The notebook tries this first.
- (B) Public checkpoint: falls back to openvla/openvla-7b-finetuned-libero-spatial (and the matching -object, -goal, -10 variants — there is one fine-tuned checkpoint per suite for OpenVLA). Loads in 4-bit if --load_in_4bit True.
Print the model size, parameter count, and the unnorm_key you'll use for action denormalisation ("libero_spatial_no_noops" for the spatial-tuned checkpoint, etc.). Action denormalisation is the most common silent bug — see "Common pitfalls" #4.
Build the eval loop (# %% Step 5). One function: eval_suite(suite_name, model, n_rollouts_per_task, episode_length) that loops over the 10 tasks in the suite, runs n_rollouts_per_task episodes per task with different seeds, queries the policy at every step, calls env.step(action), and counts info["success"] == True. Returns (n_successes, n_total, per_task_breakdown, failure_seeds).
Run all four suites (# %% Step 6). With the canonical episode-length caps (280, 280, 300, 520). Default N=10 rollouts per task (so n=100 per suite, ~30 min per suite on a 4090); flip a flag to bump to 50 if you have the compute. Tabulate the four numbers with Wilson 95% CIs.
Drill-down: save one failure rollout per suite (# %% Step 7). Pick the seed of the first failed episode in each suite, replay it with frames captured at every step, write outputs/failures/{suite}_task{i}_seed{s}.mp4 (10 fps, 256×256). Annotate each video with the language instruction overlaid on frame 0. This is the highest-pedagogical-value artifact in the project.
Compare against published numbers (# %% Step 8). The notebook has a markdown table prefilled with OpenVLA, OpenVLA-OFT, pi0, pi0.5 LIBERO numbers from their respective papers (cited inline). You fill in your row and write a one-paragraph commentary: where you matched, where you didn't, and why.
Read the LIBERO-PRO summary + reflect (# %% Step 9). The notebook has a 200-word inline summary of LIBERO-PRO (arXiv 2510.03827) — the paper that shows OpenVLA goes from 90%+ on standard LIBERO to ~0% under spatial-perturbation evaluation, with pi0.5 holding up slightly better. Write a 100-word reflection: which of your per-suite numbers do you believe most? least? Which suite, if you were buying this model for production, would you re-evaluate under perturbations before trusting?
User TODO: baseline policies for grounding (# %% Step 10). Two empty cells:
- Random policy — sample action ~ Uniform([-1,1]^7) at every step. Run on libero_spatial only (it's the easiest suite). You should see ~0–5% success rate. If you see more, your "success" detector is too lenient.
- Scripted move-up policy — at every step, output [0, 0, 0.05, 0, 0, 0, 0] (just lift z). Run on libero_object. Expected ~0%. If higher, BDDL goal predicates are firing on robot pose alone — investigate.

These two baselines catch ~80% of "I think my number is real" mistakes. Always run them before reporting headline numbers anywhere.

Done criterion

You are done when all five of these are true:

outputs/results.csv exists with four rows (libero_spatial, libero_object, libero_goal, libero_10) and at minimum the columns n_total, n_success, success_rate, ci_lo, ci_hi, episode_length_cap, n_rollouts_per_task.
outputs/smoke/ has 4 side-by-side PNGs, one per suite, showing agentview + eye_in_hand from task 0.
outputs/failures/ has at least 4 MP4 files (one per suite); each plays correctly and has the instruction text legible.
The Step 8 results table in notebook.py is filled in with your numbers next to the published baselines, and the Step 9 LIBERO-PRO reflection paragraph is at least 100 words and names a specific suite that you'd distrust most under perturbations.
At least one baseline (random or scripted) was run in Step 10 and produced ≤ 5% on its target suite. (If higher: stop, debug your success detector, do not report headline numbers.)

If any of those five are missing, you're not done.

Common pitfalls

MuJoCo licence confusion. MuJoCo was proprietary (Roboti LLC) until Oct 2021, then DeepMind acquired and open-sourced under Apache 2.0 in May 2022 (MuJoCo 2.1.0+). The MIT-licensed mujoco-py Python binding (OpenAI, 2016–2021) is deprecated; do not use it. Use the official mujoco package (DeepMind), version 3.2.x. Robosuite 1.4+ uses the new bindings. If you see import mujoco_py anywhere, it's stale tutorial code from 2020 — replace with import mujoco.
MUJOCO_GL environment variable. This selects the OpenGL backend. On a headless Linux server: export MUJOCO_GL=egl (fastest; needs libegl1/libgles2/libglvnd-dev). On a Linux box with a display: MUJOCO_GL=glfw (default; will pop up a window unless you also set MUJOCO_GL_OFFSCREEN=1). On macOS: MUJOCO_GL=glfw only — egl does not work. On WSL2: MUJOCO_GL=osmesa is the safe fallback but ~5–10× slower than EGL. Symptom of a wrong setting: RuntimeError: Could not initialize OpenGL context at first env.reset().
Eval-protocol differences across papers. OpenVLA (2024) reports 50 rollouts × 10 tasks × 4 suites = 2000 trials, episode caps 280/280/300/520. Octo (2024) reports 25 rollouts × 10 tasks. pi0 (2024) reports 10 rollouts × 10 tasks but with a different success detector (looser). pi0.5 (2025) goes back to OpenVLA's protocol. Do not compare numbers across papers without checking the protocol. The notebook's Step 8 table shows protocol next to each number for exactly this reason.
Action denormalisation / unnorm_key mismatch. OpenVLA outputs normalised delta-EE actions in [-1, 1]. To produce real-robot actions you must denormalise using the per-dataset stats stored in model.norm_stats[unnorm_key]. If you set unnorm_key="bridge_orig" (your fine-tune source) but the LIBERO env expects LIBERO action stats, the gripper action sign will be flipped and the arm motion will be ~3× too small. Symptom: arm waves around but never grasps; success rate is 0–2% across the board. Fix: use the matching libero_*_no_noops unnorm_key, or refit norm stats to your fine-tune data and load them into the model card.
Episode-length cutoffs and "last-step success". LIBERO's success predicate checks the goal predicate at the timeout step; some tasks are intermittently satisfiable (e.g. an object briefly resting on a target plate). If you cap at 280 steps and the policy succeeds at step 251 but the object slides off at step 279, the task is reported as a failure. This matches OpenVLA's protocol but disagrees with some earlier papers that took "any-step success". Be explicit about which one you use; the notebook uses last-step.
Headless server with no GPU OpenGL → MuJoCo silently falls back to software. If MUJOCO_GL=egl but libEGL.so.1 isn't installed, MuJoCo may fall back to software OSMESA without error, and the sim will run at ~30 fps instead of ~200 fps. Diagnose: time a single env.reset(); for _ in range(280): env.step(zero_action) — should be <2 sec on EGL. If it's 10+ sec, OpenGL fell back. Re-run setup.sh and check dpkg -l | grep -E 'libegl|libgles'.
OpenVLA fp16 OOM on 16 GB GPUs. OpenVLA-7B in fp16 is ~14.5 GB just for weights and inference activations push it past 16 GB. Either drop to bf16 + gradient_checkpointing (no help for inference), or set load_in_4bit=True (BitsAndBytes, drops to ~7 GB). Quality difference vs fp16 on LIBERO is <2 points in our experience.