16 — Waymax Sim Agents (WOSAC realism)

"Without realistic NPCs, closed-loop sim is unfalsifiable." — /docs/06 §E.5. Project 15 closed the loop. This project asks: what are you closing the loop against?

Goal

Install Waymax (Waymo's open-source, JAX-based AV simulator), load Waymo Open Motion Dataset (WOMD) scenarios, and run two reactive sim-agent baselines — log-replay (the trivial non-reactive baseline) and IDM (rule-based, reactive) — and optionally a small learned policy. Score every rollout with the Waymo Open Sim Agents Challenge (WOSAC) realism meta-metric and slice the result by scenario type (free-flow / overtaking / merging / unprotected turns). Then sit with the conclusion that none of these baselines are anywhere near the 2025 SOTA (~0.79 realism), and reason about why that gap is the gating problem for productized AV evaluation.

The narrative pitch:

A planner under test makes a different choice than the human did in the log.
If the other agents are log-replays, they keep doing what the human would have done. Your planner gets driven through a ghost world.
If the other agents are IDM (or a learned policy), they react to your planner's choice. Now you can ask "would this have caused a crash?"
Realism of those reactions is the unit of measure. WOSAC is the yardstick.

Loops touched

EVAL. Specifically, the substrate of closed-loop eval. Project 15 (Bench2Drive) closes the loop via CARLA's traffic manager — rule-based, heuristic NPCs. Waymax closes it via the Waymo Open Motion Dataset and data-driven sim-agents. Both are eval. Neither is sufficient on its own.

This is also the multi-agent half of /docs/06 §E.5: the document carves the AV-sim hard-problems landscape into (i) realistic sensor synthesis (NeRF / GS / diffusion — projects 04, 05, 12) and (ii) realistic behavior synthesis (Waymax sim-agents, Nocturne, Nuplan, MetaDrive — this project). Sensor sim without behavior sim is a pretty puppet show.

Why this matters for AI Data Intelligence

Productized AV evaluation frameworks — Applied Intuition's Validation Toolset, Waabi's WorldLabs, Waymo's internal SimulationCity — all depend on the same load-bearing claim: "our NPCs behave like real road users, therefore our scenario-coverage numbers are credible, therefore our safety case is defensible." When that claim is shaky, the whole pyramid is shaky:

Coverage claims collapse if NPC reactions are unrealistic. "We tested 10,000 cut-in scenarios" is meaningless if the cutting-in agent reacts to the SDV in a way no human would.
Counterfactual reasoning collapses with non-reactive NPCs. Log-replay only tells you what would happen if your planner did exactly what the human did — which is the trivial counterfactual.
Long-horizon behavior (the hard part: minutes-long rollouts where small distribution drifts compound) is precisely where current SOTA sim-agents fail. The 2025 WOSAC winner (SMART-R1, 0.7858 realism meta) is good — but "0.78 vs human 1.0" still leaves room for distribution drift over a 60-second rollout.
The metric itself is contested. WOSAC is a likelihood-based composite; it does not directly measure "did anyone get hurt." A high-realism agent that occasionally produces weird collisions may score well on aggregate distributional metrics. This is a known issue and is why the 2025 metric added a traffic-light-violation term.

For the Data Intelligence team, this means: when a customer asks "is your closed-loop eval credible?", the honest answer is "credible for the slice of scenarios where our sim-agents are realistic enough" — and the team's job is to characterize that slice and feed back where it's thin. That feedback loop (scenario-mining → relabel → retrain sim-agents → re-evaluate) is the same loop project 16 (active learning) and project 01 (scenario mining) train.

Prerequisites

Project 15 for context on closed-loop eval and the rule-based-NPC baseline (CARLA traffic manager).
Project 13 is a strong companion: a motion-forecasting model (HiVT, MTR, etc.) trained on Argoverse-2 Motion is a sim-agent — you can drop it into Waymax. The User TODO at the end of this project is exactly that.
Waymo Open Motion Dataset access. This requires a Google account, accepting the WOMD license at https://waymo.com/open/download/, and (for the full dataset) gcloud auth login to read from gs://waymo_open_dataset_motion_v_1_3_0/. The license is research-only and forbids redistribution; respect it.
Familiarity with JAX is helpful but not required — the notebook treats Waymax as mostly opaque and only touches JAX where the API forces you to (jax.jit, jax.random.split).

Hardware

Workstation GPU (one is fine). Waymax was designed for TPU/multi-GPU scale-out at training time, but for evaluation — running rollouts and computing metrics on a few hundred scenarios — a single 24 GB GPU (RTX 3090 / 4090 / A5000 / L4) is comfortable. CPU-only works but is slow enough to be discouraging.
Disk: ~50 GB if you mirror a small slice of WOMD locally. The full validation split is ~1 TB; you do not need it. Stream from GCS for development; mirror only the slice you keep iterating on.
CUDA: 12.1+ with cuDNN 9.8+ if you go the JAX-CUDA-12 route. Match the JAX wheel to your driver version, not the other way around.

Setup

bash setup.sh

This creates .venv, installs jax[cuda12], Waymax (from GitHub main, since PyPI lags), waymo-open-dataset-tf-2-12-0, and the visualization dependencies. It does not download data — you must register at waymo.com/open separately.

After setup, verify the GPU is visible to JAX:

import jax
print(jax.devices())  # expect [CudaDevice(id=0)] or similar

If you get CpuDevice, the CUDA install is wrong — fix it before running the notebook, since CPU rollouts on 100+ scenarios will take hours.

Steps

Verify hardware. JAX sees the GPU; XLA reports the right CUDA version.
Install Waymax from GitHub main (PyPI is months behind; the WOSAC submission notebook lives only in main).
Authenticate to WOMD. Either gcloud auth application-default login to stream from GCS, or download a TFRecord shard manually. The notebook provides instructions but cannot register on your behalf.
Load a scenario. waymax.dataloader.simulator_state_from_womd_dict → SimulatorState. Visualize: ego, surrounding agents (vehicles / cyclists / pedestrians), lanes, traffic-light states.
Run log-replay. The trivial baseline. Step the simulator with each agent's logged trajectory; observe that ego choices have no effect on anyone else.
Run IDM. Use waymax.agents.IDMRoutePolicy. Agents follow the spatial path from the log but adjust speed via the IDM rule, reacting to the vehicle ahead. This is reactive — and the difference vs log-replay is visible the moment you perturb the ego.
(Optional) Run a learned agent. A small constant-speed actor or a minimal MTR-style transformer. Waymax's wosac_submission_via_waymax.ipynb ships a constant-speed example you can start from.
Generate 32 rollouts per scenario (WOSAC requirement; stochastic agents need samples). Deterministic agents produce 32 identical rollouts — that's fine.
Compute WOSAC realism. Use waymax.metrics to compute the kinematic / interactive / map-based sub-metrics; combine with the 2025 weights into the realism meta. Compare across baselines.
Slice analysis. Tag each scenario by interaction type (free-flow, overtaking, merging, unprotected turn). Compute realism per slice. Expect: log-replay scores artificially well in free-flow (it just plays the human log) and falls apart only when ego counterfactuals would have changed the world; IDM scores reasonably in car-following but fails in unprotected turns where lateral decisions matter.
Reflection. Write up the log-replay-vs-reactive distinction; connect to project 15.
(User TODO) Wire project 13's HiVT predictor into Waymax as a sim-agent and re-score. Compare to IDM. Sketch how you'd improve the slice you do worst on.

Done criterion

WOSAC realism meta-metric for at least two baseline policies (log-replay and IDM, on the same scenario set).
A slice analysis: realism broken out by scenario type (≥ 3 slices).
A short written reflection (in the notebook, in the final markdown cell) that names: which baseline is better, on which slices, and why — i.e. what the failure mode looks like in agent-trajectory space, not just in the metric.
A working pointer (commented code or a TODO list) for replacing IDM with a project-15-trained motion-forecasting model.

Common pitfalls

JAX/CUDA version mismatch. JAX wheels are picky about CUDA + cuDNN versions. Use pip install -U "jax[cuda12]" against a CUDA 12.1+ system. If jax.devices() returns CPU, do not proceed — fix it first.
Waymo Open licensing. WOMD requires a Google account and accepting the license. There is no shortcut. Don't redistribute the data, don't commit even small samples to git, don't push them to a public bucket.
Log-replay is not a sim-agent. It is a recording. Anything you "learn" from a log-replay-only eval about your planner's safety is conditional on the planner doing exactly what the human did. The whole point of closed-loop eval is to relax that condition.
IDMRoutePolicy still uses the logged path. It only modulates speed, not the spatial trajectory. So if your ego forces a lane change and the IDM agent "should" yield laterally, it won't — it will just slow down. Document this honestly in the slice analysis.
WOSAC requires 32 rollouts per scenario. Deterministic agents waste compute (all 32 are identical) but still need to be submitted in the expected shape. The format is opinionated; use Waymax's helpers.
Metric interpretation. The realism meta-metric is a likelihood-style composite; "0.78 SOTA" doesn't mean "78 % realistic." It means the model's distributional fit to held-out human behavior is high under the WOSAC weighting. A planner trained against a 0.78-realism sim-agent can still be exploiting weird tails.
Don't confuse the SDC and sim-agents. WOSAC controls all valid agents at the 11th timestep (including the SDC). It is not about evaluating a planner — it is about evaluating the behavior model.
Annual metric churn. WOSAC's metric definition changes most years (2025 added a traffic-light-violation term and changed the time-to-collision filter). Numbers across years are not directly comparable. Cite the year.