06 — OpenVLA + BridgeData V2 LoRA fine-tune (with Octo fallback)

One-line goal: Stream BridgeData V2 through LeRobotDataset v3.0, LoRA-fine-tune OpenVLA-7B for 1–2k steps on a single A100, evaluate language-conditioned task success on a held-out Bridge subset, and write up where the policy generalizes vs. where it falls apart. If you don't have A100-class hardware, run the same loop with Octo-Small (27M) as the fallback.

Goal

By the end of this project you will have:

A working LeRobotDataset v3.0 streaming pipeline against IPEC-COMMUNITY/bridge_orig_lerobot (the community LeRobot conversion of BridgeData V2 from Berkeley's RAIL lab — the same Bridge slice that lives inside Open X-Embodiment). You will iterate batches without ever materializing the dataset to disk.
A loaded OpenVLA-7B checkpoint (openvla/openvla-7b) at bf16, with a LoRA adapter (rank 32 by default, target_modules="all-linear") wrapped via PEFT — exactly the recipe in the upstream vla-scripts/finetune.py.
A 1–2k-step LoRA training run on Bridge with logged loss curves (W&B if you logged in, otherwise local matplotlib), saved adapter weights, and a small ablation hook so you can re-run with a different LoRA rank.
Numbers. Language-conditioned task success rate on at least 3 held-out Bridge tasks, measured by open-loop action prediction error against the human demonstration plus a closed-loop replay-style success metric for the simulation-style rollout (full hardware closed-loop is out of scope without a WidowX). Plus a per-task breakdown of where the LoRA helped vs. where it hurt vs. where the base OpenVLA was already saturated.
A short generalization probe: take an unseen-language phrasing, an unseen object color, and an unseen camera angle from the Bridge held-out set, run the policy, and write down what changes in the action distribution.
A reflection memo (5–10 bullets) on failure modes — the kind of paragraph an interviewer at a Data Intelligence team for robotics will pull on.

The whole loop fits in 3–5 hours of A100 time after the cache is warm. On the Octo fallback path it fits in ~90 minutes on a single 4070.

Loops touched

In the four-loop physical-AI data flywheel (collect → label → train → evaluate, then curate failures back into collect), this project hits COLLECT plus LABEL & TRAIN:

COLLECT — you ingest BridgeData V2, the most-cited single embodiment inside Open X-Embodiment. You learn what's in it: the camera rig, the WidowX action schema, the language label distribution, the episode-length statistics. You feel the v3.0 streaming format from the inside, which is essential because v3.0's whole pitch is "datasets too large to materialize."
LABEL & TRAIN — Bridge already ships with natural-language task labels (one string per episode, mapped through meta/tasks.jsonl). You pair those labels with frames and feed them to a vision-language-action (VLA) model. You learn the action-tokenization trick, the LoRA target-module choice, and the action de-normalization gotcha that bites everyone the first time.

You do not run a real WidowX in this project. Closed-loop evaluation on hardware lives in project 12 (LIBERO), and the data-engine framing for AV lives in project 17 (BDD100K).

Why this matters for AI Data Intelligence

Applied Intuition's Data Intelligence team has built its credibility on AV/ADAS data — Data Explorer, Axion, the curation tooling that pulls 200 weird clips out of 50 hours of fleet logs. The user's career trajectory is AV-centric, and that is the right thing to lead with in interviews. So why a robotics-manipulation project?

Three reasons.

One — the curation primitives port one-to-one. A Bridge episode is a (t, multi-camera RGB, robot proprio, language goal, action) tuple. An AV log is a (t, multi-camera RGB, ego state, route goal, control) tuple. Same shape. The data-engine motions you'd build to mine "all the merges where the truck cut me off" are the same motions you'd build to mine "all the Bridge episodes where the gripper missed because the cloth was bunched up." Sensor multiplicity, time alignment, semantic search, slice-based eval — every primitive transfers. Owning that vocabulary is what lets you walk into a robotics customer pitch without sounding like you only speak car.

Two — Open X-Embodiment is the closest the field has to ImageNet for robot learning. Bridge V2 is one of the embodiments inside it. RT-X (Padalkar et al., 2023) showed that training across embodiments transfers; OpenVLA (Kim et al., 2024) showed that a 7B VLM backbone with action tokens beats Octo and RT-2-X on the same mixture. The whole point of cross-embodiment training is that the data engine — what you mine, how you mix robots, how you weight tasks — is the model. If Applied Intuition wants to be the data infra for embodied AI more broadly, the team that owns the data engine wins. Knowing the Open-X mechanics from the inside is how you talk credibly about that opportunity.

Three — VLAs are how AV and robotics converge. A VLA takes pixels + language and emits a motor program, expressed as discretized action tokens. End-to-end driving stacks (Wayve LINGO, Tesla v12, Comma's openpilot) are converging on the same recipe: pixels + language + a tokenized motor head. The pretraining data mixture for those systems is the data-engine problem; the curation, balancing, and slice-based eval primitives are exactly the team's wheelhouse. Going through the VLA loop end-to-end once means you can read every paper in this corner of the field.

You should not pretend to be a manipulation-policy researcher after this. You should be able to say, with receipts, that you've trained an OpenVLA LoRA on Bridge, you know what unnorm_key does, you've watched the loss curve, you know why the LoRA generalizes to one held-out object family and breaks on another, and you can map every step of that loop back to a corresponding AV data-engine motion.

Prerequisites

Python 3.10 or 3.11. lerobot 0.4.x supports 3.10 – 3.12, but bitsandbytes and flash-attn wheels are easiest on 3.10/3.11.
OS: Linux (Ubuntu 22.04 or similar) for the OpenVLA path. macOS works for the Octo fallback in JAX-CPU mode but is painfully slow.
Disk: ~30 GB for caches (HF model files + a partial Bridge stream). v3.0 streaming means you do not need the full ~400 GB Bridge dump on disk.
RAM: 32 GB minimum on the OpenVLA path, 16 GB is fine for Octo-Small.
Network: the first epoch of streaming will pull on the order of 10–20 GB from the Hub. A flaky connection will hurt — see the pitfalls section.
Background: comfortable PyTorch, you've fine-tuned a transformer at least once, you've at least skimmed the OpenVLA paper.

Hardware

Primary path (recommended): single A100 40 GB or H100 80 GB. OpenVLA-7B at bf16 with LoRA rank 32 over all-linear modules takes ~28 GB resident, which fits A100-40 with batch size 4–6 and gradient accumulation. On 80 GB you can push batch 12. If you only have a 24 GB card (3090 / 4090), enable 4-bit base weights via bitsandbytes (load_in_4bit=True); that drops resident VRAM to ~13 GB at the cost of ~25 % throughput.

LeRobotDataset v3.0 streaming reduces the disk picture dramatically: the full Bridge corpus is ~1.9 M frames across ~53 k episodes, but you only ever pull the parquet/MP4 chunks you actually iterate over. Expect ~10–20 GB of HF cache after a 1k-step run, not 400 GB.

Fallback path: any GPU with >=10 GB VRAM (RTX 4070, 3080, 4080, A4000) for Octo-Small. Octo-Small is 27M parameters and trains end-to-end (no LoRA needed). On a 4070 you can run the same Bridge slice at full fine-tune in <90 minutes. Octo-Base (93M) also fits but is slower; the brief asks for the 27M variant.

The notebook detects available VRAM and prints the recommended path. You can override with FORCE_PATH=openvla or FORCE_PATH=octo env var.

Setup

# from the repo root
cd projects/11-openvla-bridge-finetune
 
# 1. Create venv + install pinned deps. Idempotent. ~10 minutes the first time.
bash setup.sh
 
# 2. Activate
source .venv/bin/activate
 
# 3. HuggingFace login. Bridge LeRobot conversion is public but rate-limited
#    anonymously; OpenVLA depends on LLaMA-2 weights bundled into the checkpoint.
#    Get a read token at https://huggingface.co/settings/tokens
huggingface-cli login
 
# 4. (Optional) W&B for nicer loss curves
wandb login
 
# 5. (Fallback only) Install Octo. JAX needs the matching CUDA wheels.
#    Skip this if you're going OpenVLA-only.
pip install "jax[cuda12_pip]==0.4.20" \
    -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install flax==0.7.5 distrax==0.1.5
pip install "git+https://github.com/kvablack/dlimp.git"
pip install "git+https://github.com/octo-models/octo.git@main"
 
# 6. Run the notebook
jupytext --to notebook notebook.py        # convert once
jupyter lab notebook.ipynb
# OR open notebook.py in VSCode and use the "Run Cell" gutter

The verified install pieces:

lerobot 0.4.x ships LeRobotDataset (cached) and StreamingLeRobotDataset (Hub-direct). The streaming class is in lerobot.datasets.streaming_dataset.
transformers + peft + accelerate is the OpenVLA upstream recipe (vla-scripts/finetune.py); LoRA rank 32, target_modules="all-linear", bf16, no quantization unless you opt in.
bitsandbytes only needed for the 4-bit fallback on 24 GB cards.
flash-attn is optional; OpenVLA falls back cleanly to PyTorch SDPA if you skip it.

Steps

The notebook walks through the following 11 cells. Each cell is self-contained and idempotent.

Hardware sanity check. torch.cuda.is_available(), total VRAM, free VRAM, dtype support. Decide path: OpenVLA-7B if VRAM ≥ 22 GB, otherwise Octo-Small. Override via env var.
Install verification. Import lerobot, transformers, peft, accelerate; print versions; assert lerobot >= 0.4.0 so the v3.0 dataset classes exist.
Stream BridgeData V2 via LeRobotDataset v3.0. Build a StreamingLeRobotDataset("IPEC-COMMUNITY/bridge_orig_lerobot"), iterate the first 8 frames, print the schema. Confirm observation.images.image_0, action[7], observation.state[7|8], plus the task string resolved from meta/tasks.jsonl.
Visualize a few episodes. Pull 3 random episodes, plot the wrist + side-camera frames as a grid, plot the 7-DoF action stream over time, print the language goal under each. This is your sanity check that streaming + decoding actually works.
(Markdown — important) Action tokenization, the trick that makes VLAs work. OpenVLA discretizes each of the 7 action dimensions into 256 bins, then re-uses the LLaMA tokenizer's least-frequent text tokens as action tokens. Inference is "predict the next 7 tokens, then de-quantize." That's why fine-tuning a VLA is literally an LM cross-entropy loss on a sequence of action tokens. Read this cell carefully.
Load OpenVLA-7B (or Octo-Small). OpenVLA path: AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True) + AutoModelForVision2Seq.from_pretrained(..., torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True). Octo path: OctoModel.load_pretrained("hf://rail-berkeley/octo-small-1.5").
Wrap with LoRA (OpenVLA path). LoraConfig(r=32, lora_alpha=16, lora_dropout=0.0, target_modules="all-linear", bias="none") + get_peft_model. Print trainable param count (should be ~110M of 7B = ~1.5 %).
Build the training collate fn. Take a streamed sample, format the prompt as "In: What action should the robot take to {task}?\nOut:", run processor(prompt, image), pack actions into the label tokens, return the dict expected by HF Trainer. This is where action-de-normalization with unnorm_key="bridge_orig" enters.
Train 1–2k steps. Plain Trainer or a hand-rolled loop on accelerate; either is fine. Defaults: bs=4, grad-accum=2, lr=5e-4 (LoRA), warmup 50 steps, cosine schedule. Log every 10 steps, save every 500. Stop early at 1k if loss plateaus; push to 2k if it's still descending.
Evaluate on held-out Bridge tasks. Define 3+ tasks held out at the task-string level (so the model has never seen this exact instruction). Open-loop metric: mean L1 action error vs. demonstration over each held-out episode. Closed-loop-ish metric: replay the trajectory's initial frame, predict 16 steps, measure end-effector pose drift. Report success rate per task as a table.
Generalization probe. Pick one held-out task. Render the rollout video. Try perturbing the language ("pick up the cloth" → "grasp the towel"). Try a held-out camera angle. Write down what changes in 5–10 bullets. This is the artifact you take to interviews.
User TODO. Two ablations the notebook leaves to you: (a) re-run with LoRA rank ∈ {8, 16, 64} and plot success vs. rank, (b) construct a held-out split that holds out objects instead of task strings and re-measure. Both fit in another A100-hour.

Done criterion

You're done when all of the following are true:

The streaming dataloader produces at least 1k batches without crashing or repeated-sample errors.
Training loss has descended monotonically (with smoothing) over the first 500 steps. A flat curve means the action-token labels are not aligned — go back to step 8.
You have evaluated at least 3 held-out Bridge tasks and have a numeric per-task success rate table. Format: task_name | n_episodes | mean_l1_action_error | replay_pose_drift_cm | qualitative_success_rate.
The generalization-probe section of your notebook contains at least 5 bullets distinguishing where the policy generalizes (e.g. across small color variation) from where it falls apart (e.g. unseen object category, novel camera pose).
You can answer in one paragraph: "Why didn't 1k steps of LoRA on Bridge produce a state-of-the-art policy?" — the honest answer is that the base OpenVLA was already trained on Bridge, your LoRA was tiny, your batch was small, and your held-out split overlapped the training distribution more than you'd like. The point is to feel the loop, not publish.

Common pitfalls

Five plus one bonus that bit the upstream OpenVLA users.

Action-de-normalization key mismatch. OpenVLA stores per-dataset action stats inside the model checkpoint. If you forget to pass unnorm_key="bridge_orig" at inference, you'll get actions normalized against the first dataset key alphabetically and the gripper will do nothing useful. Always pass the key explicitly.
LeRobotDataset version mismatch. IPEC-COMMUNITY/bridge_orig_lerobot is currently published as codebase_version: v2.0. lerobot 0.4.x can read v2.0 and v3.0 with the same LeRobotDataset(...) constructor, but StreamingLeRobotDataset(...) requires v3.0. If streaming throws "no v3 metadata," fall back to the cached LeRobotDataset(...) path; the notebook handles this automatically. If you want true streaming, convert with python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=IPEC-COMMUNITY/bridge_orig_lerobot --push-to-hub to your own user, or wait for the upstream conversion.
LoRA target modules. OpenVLA recommends target_modules="all-linear", which catches the vision encoder, the projector, and the LLM. If you instead pass ["q_proj", "v_proj"] (the LLaMA default you might paste from a chat-LM tutorial), you'll only LoRA-fy the LLM and miss the visual adapter — your loss will still descend but on a much narrower subspace.
Prompt template gotcha. OpenVLA was trained with the exact string "In: What action should the robot take to {<INSTRUCTION>}?\nOut:". Drop the leading "In: ", change the punctuation, or omit "\nOut:" and the action-token head goes off-distribution. Don't get clever. Copy the string verbatim from the model card.
Embodiment-specific tokens. OpenVLA's action-token vocabulary is fixed at 256 bins per dim; the unnormalization stats are per dataset key. If you fine-tune on Bridge but accidentally pull stats from fractal20220817_data (Google RT-1) you'll quantize Bridge's much-smaller action ranges into 2 bins out of 256 and the model will collapse. Inspect vla.norm_stats.keys() and pick "bridge_orig" deliberately.
(Bonus) Streaming-mode shuffle is an iter-stream shuffle, not a global shuffle. StreamingLeRobotDataset shuffles within a sliding buffer, which means early-in-shard episodes from the same robot setup tend to cluster. For a 1k-step training run that's mostly fine; for a careful held-out split, materialize the held-out episodes via cached LeRobotDataset(...) and exclude their task_index values from the streamed train iterator.

Project layout

11-openvla-bridge-finetune/
├── README.md            # this file
├── requirements.txt     # pinned deps
├── setup.sh             # idempotent venv + install
├── notebook.py          # jupytext percent script — the whole walkthrough
├── data/                # gitignored; HF cache lives here if HF_HOME is set
└── outputs/             # gitignored; checkpoints, eval json, loss plots
    ├── checkpoints/
    ├── eval/
    └── logs/