Project 15 — Bench2Drive closed-loop evaluation

A hands-on extension of project 07. Where project 07 ran a single OpenSCENARIO 1.0 cut-in scenario through CARLA's rule-based Traffic Manager autopilot, this project runs the Bench2Drive (NeurIPS 2024) benchmark — 220 curated routes spanning 44 interactive scenarios × 23 weathers × 12 towns — through a learned planner (UniAD-base, the cleanest baseline Bench2DriveZoo ships). You will install the Bench2Drive fork of CARLA Leaderboard 2.0, drop in the UniAD checkpoint, run a 30–50 route subset on your workstation GPU, compute Driving Score per route, and break the result down by Bench2Drive's five disentangled abilities (Merging, Overtaking, Emergency Brake, Give Way, Traffic Sign).

By the end you should be able to answer, in your own words: "why does the same planner often rank #1 on open-loop log-replay metrics and #4 on closed-loop driving score, and which number does an OEM customer actually care about?"

Goal

Install Bench2Drive on top of the CARLA 0.9.15 server you already have from project 07.
Apply the Bench2Drive patches to CARLA Leaderboard 2.0 and scenario_runner (these are not the upstream tags — Bench2Drive ships its own fork with a curated route set).
Download the UniAD-base Bench2Drive checkpoint (~1.4 GB) from the Bench2DriveZoo release.
Run a 5-route smoke test, then a 30–50 route subset (the full 220 takes 6–10 hours on a single 3090/4090 — your call whether to commit overnight).
Compute and report:
- Driving Score per route = Route Completion × Infraction Penalty.
- Per-ability breakdown table: Success Rate by ability across the 5 ability buckets.
- Efficiency and Comfort as secondary metrics.
Identify the worst-performing ability, replay one failure rollout, and write a 1-page reflection on closed-loop vs open-loop evaluation — quoting the specific compounding-error argument that made nuPlan's open-loop leaderboard a cautionary tale in 2023–24.

The outcome artifact you are aiming for is outputs/bench2drive_subset/results.json (per-route scores) plus outputs/bench2drive_subset/ability_breakdown.csv plus a single PNG visualising the worst rollout. The learning artifact is the reflection.

Loops touched

This project is firmly in the EVAL loop of docs/04-four-loops.md. Project 07 set the harness; project 15 puts a real learned policy under that harness and gets a number out the other end.

It also touches COLLECT indirectly: the Bench2Drive training set (10 000 clips, 2 M frames, generated by their Think2Drive RL expert) is one of the cleanest examples in 2024–26 of synthetic-data-as-the-training-set, which is exactly the pattern Applied Intuition's customers use to bootstrap planner training before any real-world miles. You won't train on it in this project, but you will reuse the checkpoint that was trained on it.

Why this matters for AI Data Intelligence

Closed-loop evaluation is the load-bearing claim of any 2026 V&V pitch — for OEM, robotaxi, and trucking customers alike. The reason is straightforward and the field has converged on it: open-loop metrics (ADE, FDE, log-replay collision rate, even nuPlan's old open-loop score) systematically underestimate failure modes, because they evaluate a planner on the expert's state distribution rather than the planner's own state distribution. The two distributions diverge as soon as the planner makes a non-trivial decision.

This is the same covariate-shift argument that motivated DAgger in 2011, but it has been re-discovered every two years since:

2018–2022 era (open-loop dominant). ADE / FDE on Argoverse, nuScenes, Waymo Open Motion. Easy to leaderboard, easy to publish, and — as it turned out — easy to game with history-only baselines (e.g., the famous result that a constant velocity extrapolator beat several SOTA Argoverse submissions).
2022–2023 reckoning. nuPlan launched as a closed-loop benchmark, but the open-loop score on its leaderboard remained the headline number. Several papers (most influentially Dauner et al.'s "Parting with Misconceptions about Learning-based Planning" in CoRL 2023, and Waymo's Beyond Behavior Cloning survey) showed that open-loop and closed-loop nuPlan rankings do not correlate, and that the simplest possible IDM-rule-based baseline beat most learned planners under closed-loop scoring. The Waymo argument: closed-loop compounds errors over thousands of timesteps; open-loop measures the planner against a single-step expert label that the planner never actually sees in deployment.
2024–2025 standardization. Bench2Drive (NeurIPS 2024 D&B) and NAVSIM v2 (CVPR 2025 Autonomous Grand Challenge) are the two community responses. Bench2Drive is CARLA-based, fully reactive, sensor-in-the-loop. NAVSIM v2 is nuPlan-based, partially reactive, BEV-abstraction-in-the-loop (with synthetic novel-view rendering as of 2025). Together they bracket the trade-off: Bench2Drive gives you a true closed-loop signal at the cost of sim-to-real gap; NAVSIM v2 gives you real-sensor inputs at the cost of a shorter, partially-reactive simulation horizon.

Applied Intuition's Validation Toolset is the productized version of exactly this workflow — a closed-loop simulator (Simian) wired to a scenario library (Strada / OpenSCENARIO 2.0) with a coverage-aware test orchestrator. Bench2Drive is the cheapest open analog. If you walk into a Data Intelligence interview having actually run a learned planner through 220 closed-loop routes and read off a per-ability breakdown, you have done concretely what the Validation Toolset PMs talk about abstractly.

The intellectual lineage to keep in your head:

CARLA Leaderboard 1.0 (2019–2020). First serious closed-loop CARLA benchmark. 76 routes, single ability bucket, scoring was effectively binary "did you crash". Hard to disentangle why a method failed.
CARLA Leaderboard 2.0 (2023). 90 long routes (~10 km each), 38 scenario types, much richer — but the long routes mean a single early failure tanks the entire route's score, so the metric has high variance.
Bench2Drive (NeurIPS 2024). Same scenario library as Leaderboard 2.0, but routes deliberately shortened to ~150 m with exactly one safety-critical scenario each. This is the key methodological move: short routes with a single tested ability give you 220 quasi-independent measurements you can bucket into 5 ability dimensions, instead of 90 noisy compound measurements.
NAVSIM v2 (CVPR 2025). The non-CARLA alternative — sensor data is real (nuPlan camera + LiDAR), simulation is BEV-abstraction with reactive agents added in v2, and novel-view synthesis is used to handle the planner driving off the recorded trajectory. EPDMS (Extended Predictive Driver Model Score) is the closed-loop metric.

If a future interview asks "Bench2Drive vs NAVSIM v2, which one do I trust?", the precise answer is: Bench2Drive for the closed-loop reactive signal, NAVSIM v2 for the sensor realism, and an honest planner stack reports both because the answers disagree about 30 % of the time.

Prerequisites

Project 07 must be complete. You need CARLA 0.9.15 server installed and verified (the sanity check at the bottom of project 07's setup must print 0.9.15).
Comfortable on Linux, comfortable with a 1 GB checkpoint download.
Python 3.8 (Bench2DriveZoo's mmdet3d fork pins to 3.8 like CARLA 0.9.15 does — convenient).
Familiarity with PyTorch 2.x and mmcv / mmdet style configs (UniAD is built on the OpenMMLab stack; Bench2DriveZoo bundles a merged variant of mmcv 1.x + mmdet3d 0.17.1).
~50 GB free disk (Bench2Drive route configs are small, but the UniAD-base checkpoint is 1.4 GB and per-route logs add up fast over 220 routes).

Hardware

Recommended: Workstation Linux box with a single RTX 3090 / 4090 / A6000 (24 GB VRAM). UniAD-base inference is ~6 GB VRAM; CARLA server eats another ~6 GB; running both on one card needs ~16 GB, comfortable on 24 GB.
Acceptable: RTX 4080 16 GB, but you will need to run the CARLA server on a second GPU or accept slower frame-by-frame scheduling.
Marginal: 12 GB VRAM cards (3060, 4070). UniAD-base will OOM. Drop to UniAD-tiny or use VAD-base (smaller). VAD-base runs in ~4 GB.
Time: smoke test (5 routes) ≈ 25 minutes wall-clock. 30-route subset ≈ 2.5 hours. 50-route subset ≈ 4 hours. Full 220-route eval ≈ 8–10 hours on a single 4090. Bench2Drive's authors recommend running the full benchmark on 8 GPUs in parallel; the official scripts shard the route list and aggregate at the end.

If you are on a Mac M-series, the same remote-GPU workflow as project 07 applies: rent a 4090 from Lambda / Vast / Runpod for $0.40–$0.80/hour, run Bench2Drive there, and pull the result JSONs back. A 30-route subset costs ~$2 of compute. Don't spin a cloud GPU just to run the smoke test; use the actual workstation.

Setup

setup.sh extends project 07's environment with Bench2Drive on top. You can pass --reuse-project03 to symlink against the existing project 07 venv and CARLA install rather than re-downloading. Three things happen:

A. Clone Bench2Drive (the route configs + leaderboard fork). This is the Thinklab-SJTU/Bench2Drive repo — it ships a fork of carla-simulator/leaderboard and carla-simulator/scenario_runner with the patches needed to run the curated 220-route benchmark. The route XMLs live at Bench2Drive/leaderboard/data/bench2drive220.xml.

B. Clone Bench2DriveZoo (the planner zoo). This is Thinklab-SJTU/Bench2DriveZoo, branch uniad/vad. It ships UniAD-base, UniAD-tiny, VAD-base, and BEVFormer-base configs that have been adapted to the Bench2Drive sensor stack (different camera intrinsics and ego frame conventions than the original UniAD nuScenes setup). The agent files (uniad_b2d_agent.py, vad_b2d_agent.py) get symlinked into Bench2Drive/leaderboard/team_code/.

C. Download UniAD-base checkpoint. ~1.4 GB. The script grabs it from the Bench2DriveZoo Hugging Face mirror; if that fails, fall back to the Tsinghua Cloud mirror linked in their README. Verify the SHA-256.

After running setup.sh:

cd projects/15-bench2drive-closed-loop
chmod +x setup.sh
./setup.sh --reuse-project03   # or omit the flag to build a fresh venv
source .venv/bin/activate
# Start the CARLA server (same command as project 07)
# Verify Bench2Drive can find everything:
python -c "from team_code.uniad_b2d_agent import UniadAgent; print('OK')"

Steps

Verify project 07 still works. Run project 07's sanity check first. If your CARLA server doesn't start, fix that before touching anything in this project. Project 15 cannot debug CARLA itself.
Install Bench2Drive. Run setup.sh. Read the output. There are three submodules (Bench2Drive, Bench2DriveZoo, and the merged-mmcv wheel) and one checkpoint download.
Walk through the 220-route benchmark structure. Section 3 of the notebook parses bench2drive220.xml and shows you the route → scenario → ability mapping. Confirm with your own eyes that there are 220 routes and that each route has exactly one safety-critical scenario.
Load UniAD-base. Section 4 instantiates the agent. First run will JIT-compile some mmdet ops; expect a 60-second pause.
5-route smoke test. Section 5 runs UniAD on the first 5 routes (one from each ability bucket). ~25 min. Confirm Driving Scores come out non-zero — if they're all zero, the sensor frame conventions are off and you need to recheck uniad_b2d_agent.py's sensors() method against team_code/sensors.json.
Run the 30–50 route subset. Section 6 runs a stratified subset that hits each of the 5 abilities ~10 times. ~3 hours. Use tmux or nohup. The script is resumable.
Compute Driving Score per route. Section 7 parses the Bench2Drive-format JSON output. Driving Score = (Route Completion %) × (Infraction Penalty). Penalty is multiplicative: each infraction multiplies the score by a category-specific factor (red-light = 0.7, vehicle collision = 0.6, etc.).
Per-ability breakdown. Section 8 buckets routes by ability and computes Success Rate per bucket. SR = fraction of routes where DS > 0 and no infractions occurred. This is Bench2Drive's headline contribution: a 5-row table that tells you which aspect of driving the planner is bad at, not just how bad.
Identify the worst ability and replay one failure. Section 9 picks the lowest-SR ability and the lowest-DS route within it, then renders the saved RGB front camera + a BEV plot of the trajectory and the scenario actors at the moment of failure.
Reflection: closed-loop vs open-loop. Section 10 walks through the argument and asks you to write 5–8 sentences in your own words. The notebook has the prompts.
(Optional) Cross-pollinate with NAVSIM v2. If you have time and disk, install NAVSIM v2 and run the same UniAD checkpoint on its devkit. Compare ranking. Expect disagreement.
(User TODO) Swap in your own planner. Section 12 documents how to wrap your project-06 OpenVLA fine-tune (or any custom planner) in a Bench2Drive-compatible agent. The interface is small: implement setup(), sensors(), and run_step(input_data, timestamp) returning a carla.VehicleControl.

Done criterion

You are done when:

outputs/bench2drive_subset/results.json exists and contains per-route Driving Score for ≥ 30 routes.
outputs/bench2drive_subset/ability_breakdown.csv exists and reports SR + Efficiency + Comfort across all 5 abilities.
A failure rollout PNG exists at outputs/bench2drive_subset/worst_failure.png showing the front camera at the moment of infraction.
You can articulate, in 2–3 sentences, which ability your planner is weakest at and what the failure mode looks like (numerical and behavioral).
You can articulate, in 2–3 sentences, why open-loop ADE/FDE on the same checkpoint would have ranked it differently, citing the compounding-error argument.

A useful sanity check for the per-ability table: published UniAD-base numbers on the full 220-route benchmark (Bench2Drive paper, Table 4) report DS ≈ 37, SR ≈ 9 %, with Merging being the worst ability (SR ≈ 5 %) and Traffic Sign being the best (SR ≈ 25 %). On a 30-route subset your numbers will have ±15 % variance, but the ranking of abilities should be stable.

Common pitfalls

CARLA-Leaderboard version mismatch. Bench2Drive ships its own fork of carla-simulator/leaderboard (not just a patchset). If you git clone carla-simulator/leaderboard and try to use it, the route XML schema will not parse, several scenario triggers will silently no-op, and the score will come out wrong. Use the leaderboard inside the Bench2Drive repo. The setup.sh does this correctly; the failure mode is when you "fix things" by re-cloning upstream by reflex.
scenario_runner version mismatch. Same fork issue. Bench2Drive's scenario_runner adds the 44-scenario library in a way that the upstream v0.9.15 tag does not. The two are not drop-in interchangeable. The srunner/scenarios/ directory should contain files like accident.py, lane_change.py, cut_in_with_static_vehicle.py etc. — if those are missing, you have the wrong fork.
Sensor stack mismatch. UniAD was trained on nuScenes (6 cameras, specific intrinsics, specific ego frame). Bench2DriveZoo's uniad_b2d_agent.py re-implements a 6-camera CARLA rig that approximately matches nuScenes geometry, but the intrinsics and FOV are not identical. If you copy a UniAD agent from somewhere else (e.g., the original UniAD repo or a Reasonable Code Studio fork), the camera frames going into the model will be slightly off-axis, and the model's output trajectories will be subtly wrong — Driving Score around 5 instead of 30, with no obvious failure mode. Always use the agent that ships in Bench2DriveZoo for the matching backbone.
Eval-protocol differences across papers. UniAD's nuScenes paper reports L2 displacement error and collision rate in open-loop mode; the Bench2Drive paper reports Driving Score and ability SR in closed-loop. Same model, two different numbers. When you see "UniAD = 0.71 L2" and "UniAD = 37 DS", these are not contradictory — they are measuring different things on different protocols. The notebook's section 10 reflection is exactly about this.
Route-config gotcha: 220 routes, not 220 scenarios. A route is a (town, weather, scenario) tuple. There are 44 scenarios × 5 abilities, but routes pair them with weathers and towns to get 220. If you accidentally evaluate on the 44-scenario list (which is also shipped, for ablations), you'll get a different denominator and your DS will look ~10 % too low.
Synchronous mode and step-skip. CARLA Leaderboard 2.0 runs at 20 Hz, but UniAD inference at full resolution is ~6 Hz on a 4090. Bench2DriveZoo's agent handles this by running inference at lower frequency and replaying the last trajectory between inferences. If you crank inference frequency up (or the agent loop drops below 4 Hz for any reason), the controller fights itself, infractions spike, and your DS tanks. Don't optimize the agent's inference loop unless you understand the controller.
OOM mid-route is silent in some failure modes. If the GPU OOMs during a forward() call, mmdet sometimes catches the exception, returns zeros, and the agent keeps going — driving the ego into a wall but not crashing the eval script. Watch nvidia-smi during the smoke test. If VRAM is hitting 99 %, drop to UniAD-tiny.
Don't trust the first 5-route smoke test for ranking. Routes are not exchangeable. Five randomly-picked routes will have different ability mixtures and different difficulty ratings (Bench2Drive labels them low/medium/high). The notebook's smoke test deliberately picks one route from each ability bucket, but even then DS variance is ±20 %. Real conclusions need 30+ routes.