Project 07 — CARLA OpenSCENARIO scenarios
A hands-on introduction to scenario-based testing as a discipline. You will install CARLA 0.9.15, author one OpenSCENARIO 1.0 cut-in scenario in XML (.xosc), run it through scenario_runner, sweep two parameters (time-to-collision × precipitation) over a 10×10 grid, and plot the failure surface for the default CARLA Traffic Manager autopilot.
By the end you should be able to answer, in your own words, the question: "What does scenario-coverage actually measure, and why is it different from a unit test?"
Goal
- Stand up a CARLA 0.9.15 server in headless mode.
- Read and modify a clean OpenSCENARIO 1.0 XML file (
cut_in_baseline.xosc). - Drive that scenario through
scenario_runner.pyfrom Python. - Capture front-camera frames, lidar point clouds, ground-truth ego-vs-lead distances, and collision events for each run.
- Sweep two parameters —
ttcAtCutIn∈ [0.5, 3.0] s andprecipitation∈ [0, 100] — over a 10×10 grid (100 runs). - Plot the failure surface as a heatmap and reflect on what it does and does not tell you.
The outcome artifact you are aiming for is a single failure_surface.png plus the 100 per-run JSONs feeding it. The learning artifact is a sharp, defensible answer to the reflection question above.
Loops touched
This project lives at the intersection of two of the four loops in docs/04-four-loops.md:
- COLLECT (synthetic). Scenarios are deterministic-ish data sources. Every run produces RGB + lidar + ground-truth tracks, parameterised by a knob you control. This is exactly the role NVIDIA's Cosmos and Waymo's CarCraft play in their respective stacks: synthetic data with known ground truth and no labelling cost.
- EVAL. A scenario sweep is an evaluation harness. The failure surface is a coverage metric and the per-run JSONs are an evaluation log. When you swap a learned policy in for the TM autopilot in the user-TODO section, the same harness becomes your closed-loop benchmark.
Why this matters for AI Data Intelligence
Applied Intuition's flagship product is Simian — at its core, Simian is industrial-scale scenario-based simulation. The atomic units of work are scenario specifications (M-SDL or OpenSCENARIO 2.0 today), the deliverables are coverage surfaces over scenario parameter spaces, and the value proposition is that you can fail safely in software a hundred million miles before you fail expensively on a public road. Foretellix's Foretify, Cognata's suite, and Waymo's internal CarCraft stack are all the same idea with different tooling and a different price point.
CARLA + scenario_runner is the open-source academic analog of that workflow. It is what every autonomous-driving paper uses for closed-loop evaluation; it is what the Bench2Drive (NeurIPS 2024) and CARLA Leaderboard 2.0 benchmarks are built on; and it is the cheapest way to internalise the workflow before walking into a Data Intelligence interview.
The intellectual lineage is worth knowing:
- PEGASUS (2016-2019, BMWK-funded). German consortium project that produced the first attempt at a "what does enough simulation coverage look like for L3 highway pilot homologation in Europe?" answer. PEGASUS produced a scenario catalogue, a parameter-space taxonomy, and the conceptual machinery (logical scenario → concrete scenario → test case) that ASAM later standardised.
- ASAM OpenSCENARIO 1.x XML (2019 →). The PEGASUS deliverables became an ASAM standard. Declarative XML, schema-validated, deliberately limited in expressiveness. Versions 1.0, 1.1, 1.2, and 1.3 are all backwards-compatible variants of the same idea.
- ASAM OpenSCENARIO 2.0 DSL (2022 →). A programming language, not a markup format. Procedural composition, modifiers, constraint solving, parameter ranges. This is what Applied Intuition and Foretellix built their commercial stacks around.
- CARLA Leaderboard 2.0 (2023 →) and Bench2Drive (NeurIPS 2024). The current academic state of the art: 44 interactive scenarios × 23 weathers × 12 towns, scored with disentangled multi-ability metrics. Built directly on CARLA + scenario_runner. The "44 scenarios" includes a cut-in family that maps almost directly onto what you are building here.
- NAVSIM v2 (2024-25). Open-loop / planner-only counterpoint to the closed-loop CARLA path. Worth knowing about because the open-loop-vs-closed-loop debate is one of the fault lines in the field.
The single most important distinction you will leave this project with: OpenSCENARIO 1.x XML vs OpenSCENARIO 2.0 DSL. They are not the same thing despite the shared name.
- 1.x is declarative XML. CARLA's
scenario_runnerparses it natively and has done so since 2020. It is what 95 % of public CARLA scenarios are written in. Its expressiveness is roughly "trigger-action automaton with a few parameter knobs". The.xoscfile in this project is OpenSCENARIO 1.0. - 2.0 DSL is a procedural language. It supports composition, inheritance, parameter ranges with constraint solving, and stochastic sampling. CARLA scenario_runner has experimental 2.0 support (
--openscenario2), but the toolchain is much less mature. Applied Intuition, Foretellix, and most commercial tools target 2.0.
If a future interview asks you "which OpenSCENARIO does CARLA support?", the precise answer is "1.0 fully, 1.1+ partially, 2.0 experimentally — and the 1.x and 2.0 lineages are surprisingly distinct despite the shared name."
Prerequisites
- Comfortable on Linux command line.
- Python 3.8 (CARLA 0.9.15 ships official wheels for 3.7 and 3.8 only).
- ~30 GB free disk (CARLA tarball is 7 GB unpacked; sweep outputs add a few GB).
- Familiarity with the project-01 and project-02 notebooks; this project assumes you already know how to manage a venv and a Jupytext notebook.
Hardware
CARLA is an Unreal Engine 4 rendering pipeline. It needs a real GPU. Specifics:
- Recommended: Linux (Ubuntu 20.04 or 22.04) with an NVIDIA GPU, driver ≥ 525, ≥ 6 GB VRAM. RTX 3060 or better is comfortable.
- Acceptable: Windows 10/11 with the same NVIDIA setup. CARLA's Windows tarball works, but Docker workflow is rougher.
- Not supported: macOS arm64 (Apple Silicon). CARLA does not ship arm64 binaries and Apple Silicon does not support GPU passthrough into Linux VMs. Your only realistic options on a Mac M-series are:
- Run CARLA on a remote Linux box (a Lambda Labs or Vast.ai single-GPU instance is ~$0.50–$1.00/hour for a 3090/4090) and SSH-tunnel port 2000 back to your laptop. The Python client side of this notebook runs fine locally.
- Use a cloud workstation with an NVIDIA GPU (AWS g5.xlarge, GCP n1 with T4) for the duration of the project.
- Marginal: Intel Mac with an NVIDIA eGPU. In theory works; in practice a maintenance nightmare.
The 100-run sweep takes roughly 70 minutes on an RTX 3060. Time scales linearly — a 5×5 sweep is 17 minutes and gets you most of the qualitative shape.
Setup
The setup.sh script handles the Python side. The CARLA server (the 7 GB Unreal binary) you have to download separately. There are three reasonable paths:
A. Docker (easiest on Linux + NVIDIA). Pull carlasim/carla:0.9.15, run with --gpus all and -RenderOffScreen. The full command is printed by setup.sh. Requires nvidia-container-toolkit.
B. Tarball (Linux or Windows, when you want fine control). Download from the official GitHub release and unpack. The Python client .egg file is shipped inside the tarball at PythonAPI/carla/dist/; if pip install carla==0.9.15 fails (it sometimes does on non-3.8 Python), setup.sh will tell you to add the egg to PYTHONPATH.
C. Remote (Mac users). Run option A on a remote Linux GPU box. SSH-tunnel ports 2000-2002. Set CARLA_HOST=localhost (the tunnel makes the remote server look local). The notebook works unchanged.
After running setup.sh:
cd projects/07-carla-scenarios
chmod +x setup.sh
PYTHON_BIN=python3.8 ./setup.sh
# follow the printed instructions to start the server
source .venv/bin/activate
python -c "import carla; c=carla.Client('localhost',2000); c.set_timeout(10.0); print(c.get_server_version())"
# expected output: 0.9.15setup.sh clones scenario_runner at the matching v0.9.15 tag into ./scenario_runner/. The notebook invokes scenario_runner.py from there as a subprocess.
Steps
- Verify the install. Run the sanity-check from
setup.sh. The notebook's section 1 repeats the check inline. Server and client versions must match exactly. - Load Town04 and inspect the OpenDRIVE. Town04 is CARLA's highway map — long, multi-lane, mostly straight. Section 2 of the notebook loads it, switches to synchronous mode (essential for deterministic scenario timing), and dumps the OpenDRIVE topology. Note the road-id / lane-id conventions; they're what
cut_in_baseline.xoscreferences. - Read
cut_in_baseline.xosc. Open it. Read every comment. The file is the source of truth for the scenario; the Python notebook only invokes it. The four parameters at the top of the file (ttcAtCutIn,leadSpeed,egoSpeed,cutInDuration, plus weather knobs) are the knobs you'll sweep. - Run the baseline scenario once. Section 4 of the notebook drives
scenario_runner.py --openscenario cut_in_baseline.xoscas a subprocess, attaches a front camera, lidar, and collision sensor to the ego from a second CARLA client, and tabulates the run into aRunOutcomedataclass. ~40 seconds wall-clock. - Inspect a captured frame. A front-camera JPEG and a lidar BEV scatter plot. This is your visual sanity check that sensors are firing correctly.
- Define the parameter sweep. TTC × precipitation, 10×10. Note in section 5 how
fog_visibility_for_precipcouples the two — heavy rain implies low visibility — to keep the sweep physically realistic. - Run the sweep. ~70 minutes of compute. The
skip_existing=Trueflag makes this resumable. - Plot the failure surface. Two heatmaps: categorical (success / near-miss / collision) and continuous (min ego↔lead distance). The continuous one is more informative — it shows the gradient approaching collision, not just the binary outcome.
- Reflect. Section 7 of the notebook lays out four critiques of the heatmap you just produced. Read them. Decide which you agree with.
- Pick one user-TODO (section 8 of the notebook) and spend 30–90 minutes on it. Recommended: option B (swap the scenario to
lead_brake) — fastest to set up, biggest qualitative payoff.
Done criterion
You are done when:
outputs/sweeps/failure_surface.pngexists and shows a recognisable structure (typically: collisions concentrated in low-TTC × high-precipitation corner, successes elsewhere, a near-miss boundary along the diagonal).- You can articulate, in 2–3 sentences, what the heatmap is and is not telling you.
- You can articulate the OpenSCENARIO 1.x XML vs 2.0 DSL distinction and where each is used in industry.
Common pitfalls
- Python version mismatch. CARLA 0.9.15 official
pip install carlaonly ships 3.7 and 3.8 wheels. On Python 3.10 or 3.11, the install will fail or produce a stub package that fails at import.setup.shhandles the fallback (use the.eggfrom the tarball), but if you're on a fresh distro wherepython3.8isn't available,pyenvis your friend:pyenv install 3.8.18 && PYTHON_BIN=$(pyenv which python3.8) ./setup.sh. scenario_runnervssrunnerconfusion. Two related but distinct names.scenario_runneris the GitHub repo (carla-simulator/scenario_runner); the entry-point script inside it isscenario_runner.py; and the Python package shipped inside the repo is namedsrunner. You importfrom srunner.scenarios.basic_scenario import BasicScenario, you runpython scenario_runner.py --openscenario .... Both names will appear in error messages. They refer to the same thing.- OpenSCENARIO version drift. The
<FileHeader revMajor="1" revMinor="0">line in your.xoscis load-bearing. If you copy a snippet from a 1.1 or 1.2 example online (e.g., the newer<UserDefinedAction>shapes, or<TrafficSwarmAction>),scenario_runnerwill accept the file but silently ignore the unsupported elements. Symptoms: the trigger you wrote never fires. Stick to 1.0 and the elements documented in scenario_runner'sopenscenario_support.md. - Autopilot non-determinism. CARLA's TrafficManager is not bit-deterministic across runs even in synchronous mode — it has internal timers and seeds that drift. If you re-run the same
(ttc, precip)cell twice, expect outcomes to differ in 5-10 % of edge cases. To get reproducible runs, settm = client.get_trafficmanager(); tm.set_random_device_seed(42); tm.set_synchronous_mode(True). The notebook does not do this by default — partly because the non-determinism itself is informative (it tells you which scenarios are robustly safe vs marginal). - Synchronous mode left on. If your sweep crashes mid-run, the world stays in synchronous mode and the next CARLA client (yours or someone else's) will deadlock waiting for ticks. The notebook's section 9 resets to async mode on clean exit; if you hit a crash, restart the CARLA server entirely.
- Sensor handle leaks. Every scenario run spawns a new ego, and the notebook attaches new sensors. If
run_scenario_onceraises mid-loop, those sensors leak. Symptoms: VRAM creeps up over the sweep, eventually OOMing the GPU around run 50-70. Thetry/finallyblock inrun_scenario_oncemitigates this; if you modify it, keep the cleanup. - Town04 road IDs hard-coded. The
.xoscreferencesroadId="38"for the main highway. CARLA renumbers roads occasionally between releases. If you upgrade past 0.9.15, verify the road id fromworld.get_map().get_topology()before assuming the spawn poses are still on the highway.
Further reading
- ASAM OpenSCENARIO 1.0 user guide —
https://releases.asam.net/OpenSCENARIO/1.0.0/. Read sections 4 (Storyboard), 5 (Actions), 6 (Conditions). 90 minutes well spent. - ASAM OpenSCENARIO 2.0 spec —
https://www.asam.net/standards/detail/openscenario-dsl/. Skim for the conceptual contrast with 1.x. You don't need to learn 2.0 to do this project, but knowing what it adds is the point. - CARLA scenario_runner docs —
https://scenario-runner.readthedocs.io/en/latest/. Theopenscenario_support.mdpage is the canonical answer to "which OpenSCENARIO 1.x elements does CARLA actually implement?". - CARLA Leaderboard 2.0 —
https://leaderboard.carla.org/. The reference framework on top of scenario_runner. If you do option C of the user-TODOs (swap in a learned agent), this is where the agents live. - Bench2Drive (NeurIPS 2024) —
https://arxiv.org/abs/2406.03877. Read §3 (benchmark design) and §4 (multi-ability evaluation protocol). This paper is the cleanest argument I know of for why a coverage surface beats a single driving score. - PEGASUS final report —
https://www.pegasusprojekt.de/en/. The intellectual ancestor. Mostly relevant for the "logical → concrete → test case" abstraction hierarchy that ASAM later standardised. - Mobileye RSS (Responsibility-Sensitive Safety) —
https://arxiv.org/abs/1708.06374. Argues for continuous safety envelopes over binary collision metrics. Useful counterpoint to the heatmap you'll produce here.
If you want one paper that ties everything in this project together, read Bench2Drive end-to-end. Everything else can be skimmed.
Files in this project
- README.md
- cut_in_baseline.xosc
- notebook.py
- requirements.txt
- setup.sh
Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.