Project 09 — CARLA + Cosmos Transfer 2.5 hybrid sim, with a sim-to-real gap measurement

TL;DR. Render a CARLA scenario with full ground truth (boxes, masks, depth, HD-map projections), feed those control inputs into NVIDIA Cosmos Transfer 2.5 under a "rainy night, city" prompt, and propagate the CARLA labels onto the photoreal output. Train a small detector (YOLOv8 or DETR) on three datasets — pure CARLA, CARLA + Cosmos Transfer, real nuScenes — and benchmark each on the real nuScenes test split. Report three mAP numbers and write up the gap quantitatively.

Goal

The single deliverable is a results table with three rows and one mAP column:

Train set	mAP@0.5 on nuScenes test
Pure CARLA (synthetic only)	?
CARLA + Cosmos-Transfer 2.5	?
Real nuScenes (same image budget)	?

…plus an interpretation paragraph that says how much of the CARLA-vs-real gap was closed by running the photoreal world model. This is a measurement project, not a leaderboard chase. You are quantifying a productization hypothesis, not maximizing a number.

Loops touched

This project lives in two of the seven AV/robotics data loops we use as a mental map:

COLLECT (synthetic) — CARLA gives us deterministic, replayable, labeled ground truth. Cosmos Transfer rephotographs that ground truth into a target visual domain. The combination is exactly what NVIDIA calls "neural-rendered classical sim" and what Applied Intuition is productizing through Spectral (procedural assets) + Neural Sim (world-model rephotograph).
LABEL & TRAIN — labels do not come from auto-labeling here; they come from the simulator's perfect ground truth. The interesting engineering question is: how do those labels survive Cosmos's stochastic resynthesis? (Spoiler: imperfectly. We measure it.)

We do not touch DEPLOY, MEASURE-IN-FIELD, EVAL or CURATE in this project. Project 17 (BDD100K data engine) handles the curate/measure loops.

Why this matters for AI Data Intelligence

Every AV company eventually faces the same wall: classical CARLA-style simulation gives you cheap, perfectly labeled data, but the visual domain gap to real-world cameras keeps a model trained purely on it from working in the field. Domain randomization helps a little; large diffusion-based world models help a lot more — Cosmos-Drive-Dreams (NVIDIA Toronto, arXiv 2506.09042) shows >10 mAP improvement over pure-sim baselines on real-world AV detection benchmarks, with the same labels, simply by running the sim through Cosmos.

This is the exact hybrid pattern Applied Intuition is shipping:

Spectral: classical, ECU-in-the-loop, deterministic sim with full ground truth (the CARLA half of this project).
Neural Sim: large generative world model that rephotographs the classical render under arbitrary visual conditions (the Cosmos half).

If you can articulate, with numbers, "we closed X% of the sim-to-real mAP gap using Cosmos at $Y/clip of compute, and here are the three failure modes we observed," you are no longer marketing the product. You are doing science on it. That distinction is the entire job in a Data Intelligence org: turn a tool launch into a measured, defensible claim.

Prerequisites

Project 07 must be done first. This project is downstream of the CARLA scenarios you generated in ../07-carla-scenarios/. Specifically, we expect on disk:
- ../07-carla-scenarios/outputs/<scenario_id>/rgb/*.png — front-camera RGB frames at 1280×720.
- ../07-carla-scenarios/outputs/<scenario_id>/seg/*.png — semantic segmentation, CARLA palette.
- ../07-carla-scenarios/outputs/<scenario_id>/depth/*.npy — metric depth in meters.
- ../07-carla-scenarios/outputs/<scenario_id>/labels.json — per-frame 2D and 3D bounding boxes in camera space.
- ../07-carla-scenarios/outputs/<scenario_id>/hdmap/*.png — top-down lane projection rasterized into camera space.
Project 01 must be partially done — you need data/nuscenes/v1.0-mini/ laid out the way that project's setup.sh arranges it, plus the official v1.0-trainval splits if you want a non-trivial real-data baseline. (The notebook degrades gracefully to v1.0-mini if that's all you have, but the detector mAP numbers will be noisy.)
Hugging Face account with a Read token. Cosmos Transfer 2.5 checkpoints are gated behind the NVIDIA Open Model License Agreement — you must accept it once on the model page on Hugging Face, then huggingface-cli login with a token that has Read scope. The notebook will hard-fail with a helpful message if this isn't done.

The data flow is:

CARLA scenario (project 07)
   ├── RGB                              ──► (reference, not sent to Cosmos)
   ├── seg / depth / edge / hdmap       ──► Cosmos Transfer 2.5 control inputs
   └── 3D + 2D bounding boxes           ──► label propagation onto Cosmos output
                                                         │
                                                         ▼
                                          photoreal "rainy night" video + boxes
                                                         │
                                                         ▼
                                       train YOLOv8 → eval on real nuScenes test

Hardware

This is not a laptop project. Cosmos Transfer 2.5-2B is a 2.36 B-parameter diffusion model that needs ~65 GB of VRAM at 720p / 93-frame inference. The canonical hardware on the model card is H100, A100 80 GB, or GB200. A 4090 (24 GB) can technically run the distilled edge variant at 480p with 4 sampling steps if you are patient, but you will not get the full quality.

Practical options for the generation step:

Option	Approx. cost	Notes
Lambda Labs on-demand H100 (1×)	~$2.49/hr (2026)	Easiest; preconfigured CUDA 12.8 / 13.0
Vast.ai spot H100	~$1.80–$3.50/hr	Cheaper, but expect a few interrupts
RunPod community cloud A100 80 GB	~$1.40–$1.90/hr	Slightly slower, plenty for this project
Local 4090 (distilled, 480p)	$0/hr	Slow, lower quality, OK for sanity check
Shortcut: don't generate at all	$0/hr	See Cosmos-Drive-Dreams shortcut below

A reasonable budget for end-to-end generation of ~200 short clips is 4–8 H100 hours, i.e. $10–$30, on top of detector training (~1 H100 hour).

Shortcut mode: skip generation, use the public dataset

If you cannot or do not want to spend on H100 time, NVIDIA Toronto released the entire output of this pipeline on Hugging Face: 5,843 real RDS-HQ clips + 81,802 Cosmos-generated synthetic clips (CC BY 4.0), at nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams. The notebook has a MODE = "shortcut" flag that skips Cosmos inference entirely and treats the Cosmos-Drive-Dreams synthetic split as the "CARLA + Cosmos" column of your results table. You lose the educational value of running the control-input pipeline yourself, but you keep the measurement value of the final detector benchmark — and the dataset weighs ~3 TB total, so you can download just the synthetic + bbox parts (~150 GB) for this project.

Setup

# from this project directory
./setup.sh                              # creates .venv, installs requirements
source .venv/bin/activate
huggingface-cli login                   # paste a Read-scoped HF token
 
# accept the license once at:
#   https://huggingface.co/nvidia/Cosmos-Transfer2.5-2B
# (this is required, the checkpoint is gated)

Cosmos Transfer 2.5 itself is installed as a separate working tree (not as a pip package — there isn't a stable PyPI distribution as of May 2026; the upstream uses uv). setup.sh clones it into vendor/cosmos-transfer2.5/ and runs uv sync --extra=cu128 --active --inexact against the project venv.

The model checkpoints (nvidia/Cosmos-Transfer2.5-2B, ~12 GB across the four control-modality heads — blur, edge, depth, segmentation) are auto-downloaded on first inference run. You can also pre-pull with:

huggingface-cli download nvidia/Cosmos-Transfer2.5-2B --local-dir checkpoints/cosmos-transfer-2.5-2b

Set HF_HOME=$PWD/checkpoints if you want everything in the project tree.

License note. Cosmos Transfer 2.5 is released under the NVIDIA Open Model License. It is commercially usable, you may distribute derivative models, and NVIDIA does not claim ownership of generated outputs. Read the full text once, then move on.

Steps

The notebook is organized in 10 numbered sections. Run them top-to-bottom the first time; subsequent runs you can jump to whatever you're iterating on.

Hardware preflight — nvidia-smi, torch.cuda.is_available(), VRAM check (>= 60 GB for the full model, >= 20 GB for the distilled-edge 480p path). The notebook hard-fails early with a clear error if you're below threshold.
Load CARLA outputs from project 07. RGB, seg, depth, hdmap, and the labels JSON. Sanity-render one frame with boxes overlaid.
Build conditioning inputs. Convert CARLA's seg palette to a 16-class driving-segmentation palette Cosmos expects. Encode depth as a Cosmos-style inverse-depth video. Compute Canny edges from the RGB. Render the HD-map raster as a separate control video. (HD-map and lidar are not native modalities in 2.5-2B's main checkpoint — see "Common pitfalls" below — so we fold the HD-map raster into the segmentation channel.)
Pack control videos into 93-frame chunks. Cosmos 2.5 generates exactly 93 frames at 16 fps (~5.8 s per chunk). Pad or trim CARLA scenarios as needed.
Author controlnet_specs.json. One JSON per chunk, listing each control-modality video path and its control_weight (depth: 0.4, seg: 0.6, edge: 0.3 are reasonable starting weights — tune later).
Run inference. python -m examples.inference --params_file <spec>.json from the cosmos-transfer2.5 working tree. Single-GPU is fine for ~10 chunks; for >50 chunks use torchrun --nproc_per_node=$NUM_GPUS. Outputs are 720p MP4 at 16 fps.
Propagate labels. CARLA boxes are in camera space; the Cosmos output has the same camera intrinsics and roughly the same per-frame layout, because the segmentation+depth control inputs constrain it. Project the 3D boxes into 2D, then run a small reconciliation pass: drop boxes whose centroid pixel is now occluded (compare to a re-extracted mask of the Cosmos frame). Be honest about boxes Cosmos invented or removed — count them, don't gloss over them.
Build three datasets in YOLOv8 format. Pure-CARLA, CARLA+Cosmos (CARLA RGB replaced by Cosmos RGB, labels propagated), and a real nuScenes split with the same image count and class set.
Train YOLOv8n on each dataset. 100 epochs, identical hyperparameters, same seed. Save weights.
Evaluate all three on real nuScenes test. mAP@0.5, mAP@0.5:0.95, per-class breakdown. Print the table. Write the interpretation paragraph in markdown directly in the notebook.

A "User TODO" cell at the end invites you to re-run the whole thing with a different prompt — "snow, dusk", "foggy morning, suburban" — and add those numbers as additional rows.

Done criterion

You are done when all three of the following are checked off:

Three mAP numbers in the results table, computed on the same nuScenes test split with the same evaluation script.
A 2–4 sentence interpretation paragraph in the notebook that says (a) which mAP gap was closed and by how much, (b) at what compute cost, (c) what failure modes you observed in the Cosmos output that explain why it didn't close the gap entirely.
A results.csv written to outputs/results.csv containing the three rows above plus columns for n_train_images, train_seconds, gen_seconds, seed. This is what you'd hand to a reviewer.

If your CARLA → CARLA+Cosmos delta is less than 2 mAP points, that is also a valid result — write it up honestly. Cosmos Transfer is not magic, and if your conditioning inputs were misaligned you may have made the photoreal data worse than the CARLA baseline. That's the lesson.

Common pitfalls

Conditioning-input format mismatches. Cosmos Transfer 2.5-2B accepts blur, Canny edge, depth (inverse), and segmentation as control videos. It does not natively take HD-map or LiDAR — those are Cosmos-Drive-Dreams / Cosmos-Transfer1-7B-Sample-AV territory. The naive workaround is to rasterize the HD-map and bake it into the segmentation video as extra classes. The slightly better workaround is to use the Cosmos-Drive-Dreams sample-AV checkpoint instead; the notebook documents the swap.
Prompt engineering matters more than you think. The same control videos with the prompt "rainy night, city" vs "rainy night, city, wet asphalt reflections, sodium vapor street lights, light fog, shot on Sony A7s, 35mm" will give visibly different generations and different downstream detector accuracy. Don't tune the prompt by eyeballing one frame; tune it by retraining the detector and looking at mAP. If you don't have time for that, at least pick one prompt and stick to it for reproducibility.
Label propagation when Cosmos hallucinates. Cosmos sometimes adds pedestrians or cars that weren't in the CARLA scene (the prior on "city street" is strong). If you train a detector on labels that say "no person here" while the rendered pixels clearly show a person, you actively teach the detector to miss people. Two mitigations: (a) increase control_weight on segmentation to 0.8+ to suppress hallucination, at the cost of less photoreal output; (b) run a pretrained YOLO over each Cosmos frame and either drop the frame entirely if it has many extra detections, or merge the new boxes in as pseudo-labels. The notebook shows option (b).
Deterministic seeding is harder than torch.manual_seed(0). Cosmos uses CUDA random number generators in attention dropout and in the diffusion sampler. To get bit-exact reproducibility you need torch.manual_seed, numpy.random.seed, random.seed, torch.cuda.manual_seed_all, AND --seed on the inference CLI, AND torch.use_deterministic_algorithms(True) (which slows things down ~15%). The notebook sets all of these. Even so, expect ~0.1 mAP run-to-run noise from CUDNN nondeterminism; report mean ± std over 3 seeds if you want to be rigorous.
VRAM budgeting on a single H100. Generating 93 frames at 720p with four control modalities runs ~65 GB. If you ALSO try to keep the YOLO training process resident, you OOM. Generate first, write to disk, then train; do not pipeline them on the same GPU.
CARLA depth is metric, Cosmos wants inverse normalized. CARLA's depth.npy is meters in float32. Cosmos' depth control video is a uint8 inverse-depth video. The conversion is 255 * (1 - clip(depth / max_depth, 0, 1)) with max_depth = 80.0. Get the conversion wrong and the model decides every distant car is right next to the camera.
YOLOv8 class mismatch. CARLA, nuScenes, and your Cosmos output all need the same class taxonomy. nuScenes has 23 classes; CARLA has 28; we collapse to a 6-class union (car, truck, bus, person, bicycle, motorcycle) in nuscenes_class_map.py. Don't forget to apply the same collapse to the val/test sets.