Project 09 — CARLA + Cosmos Transfer 2.5 hybrid sim, with a sim-to-real gap measurement
TL;DR. Render a CARLA scenario with full ground truth (boxes, masks, depth, HD-map projections), feed those control inputs into NVIDIA Cosmos Transfer 2.5 under a "rainy night, city" prompt, and propagate the CARLA labels onto the photoreal output. Train a small detector (YOLOv8 or DETR) on three datasets — pure CARLA, CARLA + Cosmos Transfer, real nuScenes — and benchmark each on the real nuScenes test split. Report three mAP numbers and write up the gap quantitatively.
Goal
The single deliverable is a results table with three rows and one mAP column:
| Train set | mAP@0.5 on nuScenes test |
|---|---|
| Pure CARLA (synthetic only) | ? |
| CARLA + Cosmos-Transfer 2.5 | ? |
| Real nuScenes (same image budget) | ? |
…plus an interpretation paragraph that says how much of the CARLA-vs-real gap was closed by running the photoreal world model. This is a measurement project, not a leaderboard chase. You are quantifying a productization hypothesis, not maximizing a number.
Loops touched
This project lives in two of the seven AV/robotics data loops we use as a mental map:
- COLLECT (synthetic) — CARLA gives us deterministic, replayable, labeled ground truth. Cosmos Transfer rephotographs that ground truth into a target visual domain. The combination is exactly what NVIDIA calls "neural-rendered classical sim" and what Applied Intuition is productizing through Spectral (procedural assets) + Neural Sim (world-model rephotograph).
- LABEL & TRAIN — labels do not come from auto-labeling here; they come from the simulator's perfect ground truth. The interesting engineering question is: how do those labels survive Cosmos's stochastic resynthesis? (Spoiler: imperfectly. We measure it.)
We do not touch DEPLOY, MEASURE-IN-FIELD, EVAL or CURATE in this project. Project 17 (BDD100K data engine) handles the curate/measure loops.
Why this matters for AI Data Intelligence
Every AV company eventually faces the same wall: classical CARLA-style simulation gives you cheap, perfectly labeled data, but the visual domain gap to real-world cameras keeps a model trained purely on it from working in the field. Domain randomization helps a little; large diffusion-based world models help a lot more — Cosmos-Drive-Dreams (NVIDIA Toronto, arXiv 2506.09042) shows >10 mAP improvement over pure-sim baselines on real-world AV detection benchmarks, with the same labels, simply by running the sim through Cosmos.
This is the exact hybrid pattern Applied Intuition is shipping:
- Spectral: classical, ECU-in-the-loop, deterministic sim with full ground truth (the CARLA half of this project).
- Neural Sim: large generative world model that rephotographs the classical render under arbitrary visual conditions (the Cosmos half).
If you can articulate, with numbers, "we closed X% of the sim-to-real mAP gap using Cosmos at $Y/clip of compute, and here are the three failure modes we observed," you are no longer marketing the product. You are doing science on it. That distinction is the entire job in a Data Intelligence org: turn a tool launch into a measured, defensible claim.
Prerequisites
- Project 07 must be done first. This project is downstream of the CARLA
scenarios you generated in
../07-carla-scenarios/. Specifically, we expect on disk:../07-carla-scenarios/outputs/<scenario_id>/rgb/*.png— front-camera RGB frames at 1280×720.../07-carla-scenarios/outputs/<scenario_id>/seg/*.png— semantic segmentation, CARLA palette.../07-carla-scenarios/outputs/<scenario_id>/depth/*.npy— metric depth in meters.../07-carla-scenarios/outputs/<scenario_id>/labels.json— per-frame 2D and 3D bounding boxes in camera space.../07-carla-scenarios/outputs/<scenario_id>/hdmap/*.png— top-down lane projection rasterized into camera space.
- Project 01 must be partially done — you need
data/nuscenes/v1.0-mini/laid out the way that project'ssetup.sharranges it, plus the official v1.0-trainval splits if you want a non-trivial real-data baseline. (The notebook degrades gracefully to v1.0-mini if that's all you have, but the detector mAP numbers will be noisy.) - Hugging Face account with a Read token. Cosmos Transfer 2.5
checkpoints are gated behind the NVIDIA Open Model License Agreement —
you must accept it once on the model page on Hugging Face, then
huggingface-cli loginwith a token that has Read scope. The notebook will hard-fail with a helpful message if this isn't done.
The data flow is:
CARLA scenario (project 07)
├── RGB ──► (reference, not sent to Cosmos)
├── seg / depth / edge / hdmap ──► Cosmos Transfer 2.5 control inputs
└── 3D + 2D bounding boxes ──► label propagation onto Cosmos output
│
▼
photoreal "rainy night" video + boxes
│
▼
train YOLOv8 → eval on real nuScenes testHardware
This is not a laptop project. Cosmos Transfer 2.5-2B is a 2.36 B-parameter diffusion model that needs ~65 GB of VRAM at 720p / 93-frame inference. The canonical hardware on the model card is H100, A100 80 GB, or GB200. A 4090 (24 GB) can technically run the distilled edge variant at 480p with 4 sampling steps if you are patient, but you will not get the full quality.
Practical options for the generation step:
| Option | Approx. cost | Notes |
|---|---|---|
| Lambda Labs on-demand H100 (1×) | ~$2.49/hr (2026) | Easiest; preconfigured CUDA 12.8 / 13.0 |
| Vast.ai spot H100 | ~$1.80–$3.50/hr | Cheaper, but expect a few interrupts |
| RunPod community cloud A100 80 GB | ~$1.40–$1.90/hr | Slightly slower, plenty for this project |
| Local 4090 (distilled, 480p) | $0/hr | Slow, lower quality, OK for sanity check |
| Shortcut: don't generate at all | $0/hr | See Cosmos-Drive-Dreams shortcut below |
A reasonable budget for end-to-end generation of ~200 short clips is 4–8 H100 hours, i.e. $10–$30, on top of detector training (~1 H100 hour).
Shortcut mode: skip generation, use the public dataset
If you cannot or do not want to spend on H100 time, NVIDIA Toronto released
the entire output of this pipeline on Hugging Face: 5,843 real RDS-HQ
clips + 81,802 Cosmos-generated synthetic clips (CC BY 4.0), at
nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams. The
notebook has a MODE = "shortcut" flag that skips Cosmos inference entirely
and treats the Cosmos-Drive-Dreams synthetic split as the "CARLA + Cosmos"
column of your results table. You lose the educational value of running the
control-input pipeline yourself, but you keep the measurement value of the
final detector benchmark — and the dataset weighs ~3 TB total, so you can
download just the synthetic + bbox parts (~150 GB) for this project.
Setup
# from this project directory
./setup.sh # creates .venv, installs requirements
source .venv/bin/activate
huggingface-cli login # paste a Read-scoped HF token
# accept the license once at:
# https://huggingface.co/nvidia/Cosmos-Transfer2.5-2B
# (this is required, the checkpoint is gated)Cosmos Transfer 2.5 itself is installed as a separate working tree (not as a
pip package — there isn't a stable PyPI distribution as of May 2026; the
upstream uses uv). setup.sh clones it into vendor/cosmos-transfer2.5/
and runs uv sync --extra=cu128 --active --inexact against the project venv.
The model checkpoints (nvidia/Cosmos-Transfer2.5-2B, ~12 GB across the four
control-modality heads — blur, edge, depth, segmentation) are auto-downloaded
on first inference run. You can also pre-pull with:
huggingface-cli download nvidia/Cosmos-Transfer2.5-2B --local-dir checkpoints/cosmos-transfer-2.5-2bSet HF_HOME=$PWD/checkpoints if you want everything in the project tree.
License note. Cosmos Transfer 2.5 is released under the NVIDIA Open Model License. It is commercially usable, you may distribute derivative models, and NVIDIA does not claim ownership of generated outputs. Read the full text once, then move on.
Steps
The notebook is organized in 10 numbered sections. Run them top-to-bottom the first time; subsequent runs you can jump to whatever you're iterating on.
- Hardware preflight —
nvidia-smi,torch.cuda.is_available(), VRAM check (>= 60 GB for the full model, >= 20 GB for the distilled-edge 480p path). The notebook hard-fails early with a clear error if you're below threshold. - Load CARLA outputs from project 07. RGB, seg, depth, hdmap, and the labels JSON. Sanity-render one frame with boxes overlaid.
- Build conditioning inputs. Convert CARLA's seg palette to a 16-class driving-segmentation palette Cosmos expects. Encode depth as a Cosmos-style inverse-depth video. Compute Canny edges from the RGB. Render the HD-map raster as a separate control video. (HD-map and lidar are not native modalities in 2.5-2B's main checkpoint — see "Common pitfalls" below — so we fold the HD-map raster into the segmentation channel.)
- Pack control videos into 93-frame chunks. Cosmos 2.5 generates exactly 93 frames at 16 fps (~5.8 s per chunk). Pad or trim CARLA scenarios as needed.
- Author
controlnet_specs.json. One JSON per chunk, listing each control-modality video path and itscontrol_weight(depth: 0.4, seg: 0.6, edge: 0.3 are reasonable starting weights — tune later). - Run inference.
python -m examples.inference --params_file <spec>.jsonfrom the cosmos-transfer2.5 working tree. Single-GPU is fine for ~10 chunks; for >50 chunks usetorchrun --nproc_per_node=$NUM_GPUS. Outputs are 720p MP4 at 16 fps. - Propagate labels. CARLA boxes are in camera space; the Cosmos output has the same camera intrinsics and roughly the same per-frame layout, because the segmentation+depth control inputs constrain it. Project the 3D boxes into 2D, then run a small reconciliation pass: drop boxes whose centroid pixel is now occluded (compare to a re-extracted mask of the Cosmos frame). Be honest about boxes Cosmos invented or removed — count them, don't gloss over them.
- Build three datasets in YOLOv8 format. Pure-CARLA, CARLA+Cosmos (CARLA RGB replaced by Cosmos RGB, labels propagated), and a real nuScenes split with the same image count and class set.
- Train YOLOv8n on each dataset. 100 epochs, identical hyperparameters, same seed. Save weights.
- Evaluate all three on real nuScenes test. mAP@0.5, mAP@0.5:0.95, per-class breakdown. Print the table. Write the interpretation paragraph in markdown directly in the notebook.
A "User TODO" cell at the end invites you to re-run the whole thing with a
different prompt — "snow, dusk", "foggy morning, suburban" — and add
those numbers as additional rows.
Done criterion
You are done when all three of the following are checked off:
- Three mAP numbers in the results table, computed on the same nuScenes test split with the same evaluation script.
- A 2–4 sentence interpretation paragraph in the notebook that says (a) which mAP gap was closed and by how much, (b) at what compute cost, (c) what failure modes you observed in the Cosmos output that explain why it didn't close the gap entirely.
- A
results.csvwritten tooutputs/results.csvcontaining the three rows above plus columns forn_train_images,train_seconds,gen_seconds,seed. This is what you'd hand to a reviewer.
If your CARLA → CARLA+Cosmos delta is less than 2 mAP points, that is also a valid result — write it up honestly. Cosmos Transfer is not magic, and if your conditioning inputs were misaligned you may have made the photoreal data worse than the CARLA baseline. That's the lesson.
Common pitfalls
- Conditioning-input format mismatches. Cosmos Transfer 2.5-2B accepts blur, Canny edge, depth (inverse), and segmentation as control videos. It does not natively take HD-map or LiDAR — those are Cosmos-Drive-Dreams / Cosmos-Transfer1-7B-Sample-AV territory. The naive workaround is to rasterize the HD-map and bake it into the segmentation video as extra classes. The slightly better workaround is to use the Cosmos-Drive-Dreams sample-AV checkpoint instead; the notebook documents the swap.
- Prompt engineering matters more than you think. The same control videos with the prompt "rainy night, city" vs "rainy night, city, wet asphalt reflections, sodium vapor street lights, light fog, shot on Sony A7s, 35mm" will give visibly different generations and different downstream detector accuracy. Don't tune the prompt by eyeballing one frame; tune it by retraining the detector and looking at mAP. If you don't have time for that, at least pick one prompt and stick to it for reproducibility.
- Label propagation when Cosmos hallucinates. Cosmos sometimes adds
pedestrians or cars that weren't in the CARLA scene (the prior on "city
street" is strong). If you train a detector on labels that say "no person
here" while the rendered pixels clearly show a person, you actively teach
the detector to miss people. Two mitigations: (a) increase
control_weighton segmentation to 0.8+ to suppress hallucination, at the cost of less photoreal output; (b) run a pretrained YOLO over each Cosmos frame and either drop the frame entirely if it has many extra detections, or merge the new boxes in as pseudo-labels. The notebook shows option (b). - Deterministic seeding is harder than
torch.manual_seed(0). Cosmos uses CUDA random number generators in attention dropout and in the diffusion sampler. To get bit-exact reproducibility you needtorch.manual_seed,numpy.random.seed,random.seed,torch.cuda.manual_seed_all, AND--seedon the inference CLI, ANDtorch.use_deterministic_algorithms(True)(which slows things down ~15%). The notebook sets all of these. Even so, expect ~0.1 mAP run-to-run noise from CUDNN nondeterminism; report mean ± std over 3 seeds if you want to be rigorous. - VRAM budgeting on a single H100. Generating 93 frames at 720p with four control modalities runs ~65 GB. If you ALSO try to keep the YOLO training process resident, you OOM. Generate first, write to disk, then train; do not pipeline them on the same GPU.
- CARLA depth is metric, Cosmos wants inverse normalized. CARLA's
depth.npyis meters in float32. Cosmos' depth control video is a uint8 inverse-depth video. The conversion is255 * (1 - clip(depth / max_depth, 0, 1))withmax_depth = 80.0. Get the conversion wrong and the model decides every distant car is right next to the camera. - YOLOv8 class mismatch. CARLA, nuScenes, and your Cosmos output all
need the same class taxonomy. nuScenes has 23 classes; CARLA has 28; we
collapse to a 6-class union (
car, truck, bus, person, bicycle, motorcycle) innuscenes_class_map.py. Don't forget to apply the same collapse to the val/test sets.
Further reading
- Cosmos Transfer 2.5 docs — https://docs.nvidia.com/cosmos/latest/transfer2.5/index.html
- Cosmos Transfer 2.5 GitHub — https://github.com/nvidia-cosmos/cosmos-transfer2.5
- Cosmos-Drive-Dreams paper — arXiv:2506.09042
- Cosmos-Drive-Dreams dataset on HF — https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams
- Cosmos cookbook (control modalities deep-dive) — https://nvidia-cosmos.github.io/cosmos-cookbook/core_concepts/control_modalities/overview.html
- NVIDIA Open Model License (the actual text, read it once) — https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
- CARLA Leaderboard 2.0 — https://leaderboard.carla.org/ (for the scenario conventions used in project 07)
- Applied Intuition Spectral / Neural Sim — public product pages and blog posts; this is the productized version of what you are building here.
Files in this project
- README.md
- notebook.py
- requirements.txt
- setup.sh
Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.