16 — Waymax Sim Agents (WOSAC realism)
"Without realistic NPCs, closed-loop sim is unfalsifiable." —
/docs/06§E.5. Project 15 closed the loop. This project asks: what are you closing the loop against?
Goal
Install Waymax (Waymo's open-source, JAX-based AV simulator), load Waymo Open Motion Dataset (WOMD) scenarios, and run two reactive sim-agent baselines — log-replay (the trivial non-reactive baseline) and IDM (rule-based, reactive) — and optionally a small learned policy. Score every rollout with the Waymo Open Sim Agents Challenge (WOSAC) realism meta-metric and slice the result by scenario type (free-flow / overtaking / merging / unprotected turns). Then sit with the conclusion that none of these baselines are anywhere near the 2025 SOTA (~0.79 realism), and reason about why that gap is the gating problem for productized AV evaluation.
The narrative pitch:
- A planner under test makes a different choice than the human did in the log.
- If the other agents are log-replays, they keep doing what the human would have done. Your planner gets driven through a ghost world.
- If the other agents are IDM (or a learned policy), they react to your planner's choice. Now you can ask "would this have caused a crash?"
- Realism of those reactions is the unit of measure. WOSAC is the yardstick.
Loops touched
EVAL. Specifically, the substrate of closed-loop eval. Project 15 (Bench2Drive) closes the loop via CARLA's traffic manager — rule-based, heuristic NPCs. Waymax closes it via the Waymo Open Motion Dataset and data-driven sim-agents. Both are eval. Neither is sufficient on its own.
This is also the multi-agent half of /docs/06 §E.5: the document carves
the AV-sim hard-problems landscape into (i) realistic sensor synthesis (NeRF /
GS / diffusion — projects 04, 05, 12) and (ii) realistic behavior synthesis
(Waymax sim-agents, Nocturne, Nuplan, MetaDrive — this project). Sensor sim
without behavior sim is a pretty puppet show.
Why this matters for AI Data Intelligence
Productized AV evaluation frameworks — Applied Intuition's Validation Toolset, Waabi's WorldLabs, Waymo's internal SimulationCity — all depend on the same load-bearing claim: "our NPCs behave like real road users, therefore our scenario-coverage numbers are credible, therefore our safety case is defensible." When that claim is shaky, the whole pyramid is shaky:
- Coverage claims collapse if NPC reactions are unrealistic. "We tested 10,000 cut-in scenarios" is meaningless if the cutting-in agent reacts to the SDV in a way no human would.
- Counterfactual reasoning collapses with non-reactive NPCs. Log-replay only tells you what would happen if your planner did exactly what the human did — which is the trivial counterfactual.
- Long-horizon behavior (the hard part: minutes-long rollouts where small distribution drifts compound) is precisely where current SOTA sim-agents fail. The 2025 WOSAC winner (SMART-R1, 0.7858 realism meta) is good — but "0.78 vs human 1.0" still leaves room for distribution drift over a 60-second rollout.
- The metric itself is contested. WOSAC is a likelihood-based composite; it does not directly measure "did anyone get hurt." A high-realism agent that occasionally produces weird collisions may score well on aggregate distributional metrics. This is a known issue and is why the 2025 metric added a traffic-light-violation term.
For the Data Intelligence team, this means: when a customer asks "is your closed-loop eval credible?", the honest answer is "credible for the slice of scenarios where our sim-agents are realistic enough" — and the team's job is to characterize that slice and feed back where it's thin. That feedback loop (scenario-mining → relabel → retrain sim-agents → re-evaluate) is the same loop project 16 (active learning) and project 01 (scenario mining) train.
Prerequisites
- Project 15 for context on closed-loop eval and the rule-based-NPC baseline (CARLA traffic manager).
- Project 13 is a strong companion: a motion-forecasting model (HiVT, MTR, etc.) trained on Argoverse-2 Motion is a sim-agent — you can drop it into Waymax. The User TODO at the end of this project is exactly that.
- Waymo Open Motion Dataset access. This requires a Google account, accepting
the WOMD license at https://waymo.com/open/download/, and (for the full
dataset)
gcloud auth loginto read fromgs://waymo_open_dataset_motion_v_1_3_0/. The license is research-only and forbids redistribution; respect it. - Familiarity with JAX is helpful but not required — the notebook treats
Waymax as mostly opaque and only touches JAX where the API forces you to
(
jax.jit,jax.random.split).
Hardware
- Workstation GPU (one is fine). Waymax was designed for TPU/multi-GPU scale-out at training time, but for evaluation — running rollouts and computing metrics on a few hundred scenarios — a single 24 GB GPU (RTX 3090 / 4090 / A5000 / L4) is comfortable. CPU-only works but is slow enough to be discouraging.
- Disk: ~50 GB if you mirror a small slice of WOMD locally. The full validation split is ~1 TB; you do not need it. Stream from GCS for development; mirror only the slice you keep iterating on.
- CUDA: 12.1+ with cuDNN 9.8+ if you go the JAX-CUDA-12 route. Match the JAX wheel to your driver version, not the other way around.
Setup
bash setup.shThis creates .venv, installs jax[cuda12], Waymax (from GitHub main,
since PyPI lags), waymo-open-dataset-tf-2-12-0, and the visualization
dependencies. It does not download data — you must register at
waymo.com/open separately.
After setup, verify the GPU is visible to JAX:
import jax
print(jax.devices()) # expect [CudaDevice(id=0)] or similarIf you get CpuDevice, the CUDA install is wrong — fix it before running the
notebook, since CPU rollouts on 100+ scenarios will take hours.
Steps
- Verify hardware. JAX sees the GPU; XLA reports the right CUDA version.
- Install Waymax from GitHub
main(PyPI is months behind; the WOSAC submission notebook lives only inmain). - Authenticate to WOMD. Either
gcloud auth application-default loginto stream from GCS, or download a TFRecord shard manually. The notebook provides instructions but cannot register on your behalf. - Load a scenario.
waymax.dataloader.simulator_state_from_womd_dict→SimulatorState. Visualize: ego, surrounding agents (vehicles / cyclists / pedestrians), lanes, traffic-light states. - Run log-replay. The trivial baseline. Step the simulator with each agent's logged trajectory; observe that ego choices have no effect on anyone else.
- Run IDM. Use
waymax.agents.IDMRoutePolicy. Agents follow the spatial path from the log but adjust speed via the IDM rule, reacting to the vehicle ahead. This is reactive — and the difference vs log-replay is visible the moment you perturb the ego. - (Optional) Run a learned agent. A small constant-speed actor or a
minimal MTR-style transformer. Waymax's
wosac_submission_via_waymax.ipynbships a constant-speed example you can start from. - Generate 32 rollouts per scenario (WOSAC requirement; stochastic agents need samples). Deterministic agents produce 32 identical rollouts — that's fine.
- Compute WOSAC realism. Use
waymax.metricsto compute the kinematic / interactive / map-based sub-metrics; combine with the 2025 weights into the realism meta. Compare across baselines. - Slice analysis. Tag each scenario by interaction type (free-flow, overtaking, merging, unprotected turn). Compute realism per slice. Expect: log-replay scores artificially well in free-flow (it just plays the human log) and falls apart only when ego counterfactuals would have changed the world; IDM scores reasonably in car-following but fails in unprotected turns where lateral decisions matter.
- Reflection. Write up the log-replay-vs-reactive distinction; connect to project 15.
- (User TODO) Wire project 13's HiVT predictor into Waymax as a sim-agent and re-score. Compare to IDM. Sketch how you'd improve the slice you do worst on.
Done criterion
- WOSAC realism meta-metric for at least two baseline policies (log-replay and IDM, on the same scenario set).
- A slice analysis: realism broken out by scenario type (≥ 3 slices).
- A short written reflection (in the notebook, in the final markdown cell) that names: which baseline is better, on which slices, and why — i.e. what the failure mode looks like in agent-trajectory space, not just in the metric.
- A working pointer (commented code or a TODO list) for replacing IDM with a project-15-trained motion-forecasting model.
Common pitfalls
- JAX/CUDA version mismatch. JAX wheels are picky about CUDA + cuDNN
versions. Use
pip install -U "jax[cuda12]"against a CUDA 12.1+ system. Ifjax.devices()returns CPU, do not proceed — fix it first. - Waymo Open licensing. WOMD requires a Google account and accepting the license. There is no shortcut. Don't redistribute the data, don't commit even small samples to git, don't push them to a public bucket.
- Log-replay is not a sim-agent. It is a recording. Anything you "learn" from a log-replay-only eval about your planner's safety is conditional on the planner doing exactly what the human did. The whole point of closed-loop eval is to relax that condition.
- IDMRoutePolicy still uses the logged path. It only modulates speed, not the spatial trajectory. So if your ego forces a lane change and the IDM agent "should" yield laterally, it won't — it will just slow down. Document this honestly in the slice analysis.
- WOSAC requires 32 rollouts per scenario. Deterministic agents waste compute (all 32 are identical) but still need to be submitted in the expected shape. The format is opinionated; use Waymax's helpers.
- Metric interpretation. The realism meta-metric is a likelihood-style composite; "0.78 SOTA" doesn't mean "78 % realistic." It means the model's distributional fit to held-out human behavior is high under the WOSAC weighting. A planner trained against a 0.78-realism sim-agent can still be exploiting weird tails.
- Don't confuse the SDC and sim-agents. WOSAC controls all valid agents at the 11th timestep (including the SDC). It is not about evaluating a planner — it is about evaluating the behavior model.
- Annual metric churn. WOSAC's metric definition changes most years (2025 added a traffic-light-violation term and changed the time-to-collision filter). Numbers across years are not directly comparable. Cite the year.
Further reading
- Waymax paper — Gulino et al., Waymax: An Accelerated, Data-Driven
Simulator for Large-Scale Autonomous Driving Research, NeurIPS 2023.
arXiv:2310.08710. The
multi_actors_demo.ipynbandwosac_submission_via_waymax.ipynbin the repo are the practical starting points. - WOSAC paper — Montali et al., The Waymo Open Sim Agents Challenge, NeurIPS 2023 D&B. arXiv:2305.12032. Definitive metric reference; check the annual updates page for current weights.
- 2023 winner — Multiverse Transformer, Wang et al. arXiv:2306.11868. First-place 2023 solution.
- 2025 winner — SMART-R1, Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning, arXiv:2509.23993. State-of-the-art realism meta = 0.7858. Key idea: SFT → RFT → SFT iteration with a metric-oriented policy optimization (MPO) reward shaped by the WOSAC metrics themselves.
- 2025 honorable mentions — TrajTok (arXiv:2506.21618), UniMM, RLFTSim. Useful diversity of approaches: tokenized next-token prediction (TrajTok / SMART), unified mixture models (UniMM), RL fine-tuning (RLFTSim).
- MTR / MultiPath++ — strong motion-forecasting backbones that can be reused as sim-agents. Drop-in candidates for the User TODO.
- CATK (NVlabs) — Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models, CVPR Oral 2025. Closed-loop fine-tuning is the obvious upgrade path for any open-loop next-token sim-agent — it is how the field is moving and is worth understanding before you propose anything to a real AV team.
- /docs/06 §E.5 — the writeup that motivated this project. Reread after the notebook; the framing will land harder.
Files in this project
- README.md
- notebook.py
- requirements.txt
- setup.sh
Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.