05 — Cosmos Predict 2 Action-Conditioned Rollout vs nuScenes Ground Truth

Take a real nuScenes scene, hand the first camera frame plus the ego trajectory to NVIDIA Cosmos Predict 2 (the Video2World variant), let the world model roll out 5 seconds of forward video, then compare it pixel-for-pixel and trajectory-for-trajectory against the ground-truth recording. Produce numbers, not vibes.

Goal

By the end of this project you should be able to answer, with calibrated specifics:

For a 5-second open-loop rollout from a single front-camera frame plus the ego control sequence, how close does Cosmos Predict 2 stay to the actual recorded video?
How does similarity decay over time? (LPIPS@1s, LPIPS@2s, ..., LPIPS@5s; same for SSIM.)
How well does the model follow the action conditioning you provide? (Lateral and longitudinal drift in meters, measured by re-localizing the synthesized ego from generated frames.)
What classes of failure does it produce, and which ones matter for downstream uses?

The deliverables are a reproducible notebook, four quantitative numbers (LPIPS@5s, SSIM@5s, lateral drift in meters, longitudinal drift in meters), drift curves over time, and a written failure-mode taxonomy.

Loops touched

COLLECT — Cosmos Predict 2 is a generator, so this project sits in the synthetic-data branch of the COLLECT loop. The output is candidate clips to feed downstream training, not real sensor data.
EVAL — The whole point of the exercise is evaluating the generator itself against real ground truth. The metrics here are exactly what you would put on a dashboard if you were the engineer responsible for making "is Cosmos good enough for X?" calls.

This project does not touch the CURATE/LABEL loops — there is no labeling work — and it does not touch the TRAIN loop except by providing a calibrated input to the question "should we use Cosmos rollouts as augmentation?".

Why this matters for AI Data Intelligence

Cosmos Predict 2 is, as of 2026, the most public, most-cited reference world model for autonomous-vehicle and robotics customers. Applied Intuition's Data Intelligence customers will ask, in this order:

"Can we use Cosmos for closed-loop planner evaluation?"
"Can we use Cosmos to generate edge cases we don't have in the fleet?"
"Can we use Cosmos rollouts to augment our training set?"

Every one of those is a different question with a different answer, and every one of them turns on the same underlying numbers: how much does the generated video drift from the ground truth, and how faithful is it to the action conditioning you give it? If you cannot answer that with a graph, you cannot make the call — and you certainly cannot challenge an over-confident customer or vendor.

The industry currently lives in two camps. One camp has read the Cosmos paper and Cosmos blog posts and concluded "it works." The other camp has tried to integrate it and concluded "it drifts." The valuable position — the one a Forward-Deployed Engineer or Data Intelligence lead should hold — is the third one: "here's the LPIPS curve, here's the lateral drift in meters, here's the regime where it's useful and here's the regime where it isn't." That is what this project produces.

A second, slightly subtler reason this matters: action-conditioning fidelity is the load-bearing property for any closed-loop use. A world model that hallucinates a beautiful video in the rough vicinity of the input scene is a great demo and a bad simulator. The only way to tell which one you have is to compare commanded trajectory vs realized trajectory inside the generation. Practitioners who have not done this have no opinion worth listening to.

Prerequisites

Python 3.10 (Cosmos Predict 2 requires exactly 3.10).
A nuScenes mini split download (~4 GB) from https://www.nuscenes.org/nuscenes#download. The mini split is enough; you do not need the full ~300 GB.
A Hugging Face account and a read-scoped access token, accepted into the NVIDIA Cosmos gated repos: nvidia/Cosmos-Predict2-2B-Video2World (and optionally the 14B and 0.6B variants).
ffmpeg installed on the system (brew install ffmpeg or apt install ffmpeg).
Comfort with PyTorch tensors and basic geometry (rotation quaternions, ego-frame transforms).

Hardware

This is the awkward part, so be honest with yourself before you start.

NVIDIA H100-80GB or A100-80GB is the recommended target. The Cosmos team explicitly recommends ≥64 GB VRAM for the 14B and reports ~32 GB peak for the 2B Video2World during inference at 720p/16fps.
A single H100 80GB runs the 2B Video2World comfortably; expect ~30–60 seconds per 5-second clip.
A single H100 80GB runs the 14B Video2World if you offload the prompt refiner and guardrail to CPU (~49 GB at 720p text2image is the published figure; video2world is a bit higher).
A single A100 40GB will OOM the 14B and is tight for the 2B at 720p — drop to 480p or use the 2B with 8-bit weight loading.
No GPU at all — this notebook will not run end-to-end. The notebook is structured so cells that require the GPU are flagged; the data-loading and metric cells will run on CPU against a saved video.

A practical fallback: there is a Cosmos-Predict2-0.6B-Text2Image variant (text-to-image only, not video2world) that fits in ~12 GB. It will not produce 5-second video; it can be used to sanity-check the install and the conditioning format. Use it only as a smoke test.

License note: NVIDIA Cosmos models are released under the NVIDIA Open Model License Agreement, which permits commercial use with attribution and a few restrictions (no use to train competing foundation models, etc.). Read it once before you ship anything downstream.

Setup

cd /Users/zhiruifeng/Workspace/dev/PhysicalAI/projects/10-cosmos-predict-rollout
bash setup.sh

setup.sh does five things:

Installs uv if you do not have it.
Creates a Python 3.10 virtualenv at .venv and installs cosmos-predict2[cu126] from NVIDIA's index.
Installs the Python deps in requirements.txt (lpips, scikit-image, nuscenes-devkit, etc.).
Authenticates to Hugging Face (it will prompt for a token).
Downloads the 2B Video2World checkpoint (~4–6 GB on disk after dequantization) into ./checkpoints/.

Set COSMOS_MODEL_SIZE=14B before running setup.sh if you want the 14B (~28 GB on disk).

Steps

Hardware sanity — open notebook.py, run the first cell. It prints torch.cuda.is_available(), the device name, the free VRAM, and a hard-fail if you are below 30 GB free.
Pick a scene — load nuScenes mini, list the 10 scenes, pick one with non-trivial ego motion (e.g. scene-0061 has a left turn). The notebook defaults to that scene.
Extract first frame and trajectory — pull the first CAM_FRONT keyframe and the next 5 seconds of ego_pose at 12 Hz. Convert global poses into ego-frame deltas (Δx, Δy, Δheading) per timestep. Save the GT video clip to ./gt_clip.mp4 for later comparison.
Format action conditioning — Cosmos Predict 2's core Video2World pipeline takes a text prompt + an initial image. Cosmos Predict 2's action-conditioned sample variant (Cosmos-Predict2-2B-Sample-Action-Conditioned) takes a 7-dim end-effector delta sequence — it was trained on Bridge robot data, not driving data. For nuScenes we therefore use the standard Video2World pipeline and encode the trajectory as: (a) a structured natural-language prompt ("ego vehicle drives forward 12 m, turns left 18 degrees, decelerates from 8 m/s to 5 m/s") and (b) frame-rate matched. This is the honest workflow today; real action-conditioned driving prediction needs Cosmos-Predict2.5 with post-training on a driving dataset.
Run inference — call Video2WorldPipeline.from_config(...) with the 2B preset, pass the first frame and the prompt, ask for 60 frames at 12 fps. Save to ./gen_clip.mp4.
Pixel-level drift — frame-align GT vs generated, compute LPIPS and SSIM per frame, plot both curves with time on the x-axis. Save ./figures/lpips_ssim_curves.png.
Trajectory-following error — re-localize the ego in the generated frames. Two options: (a) run a small monocular-depth + flow estimate to infer ego motion from generated video, or (b) detect lane markings and infer lateral position relative to lane center. The notebook implements (b) using a simple yellow/white line detector — crude, but it gives a number. Compare to the commanded trajectory.
Failure-mode tour — pick three timestamps (e.g. 0.5 s, 2.5 s, 4.5 s) and write down what is wrong. Tag each with one of: object permanence (a car disappears), counting (3 cars become 2), identity drift (a sedan turns into a truck), action-conditioning fidelity (the ego follows a different path than commanded), texture rot (lane markings turn into mush).
Counterfactual — re-run with a deliberately wrong action prompt ("ego brakes hard to a stop"). Observe whether the generated video actually shows brake behavior or whether the model ignores the change. This is the most diagnostic test of action-conditioning fidelity in the entire notebook.
Write up — fill in the reflection block at the bottom of notebook.py with your three numbers (LPIPS@5s, SSIM@5s, lateral drift in m, longitudinal drift in m) and your verdict on the three customer questions above.

Done criterion

You are done when, in notebook.py, the following four cells produce concrete values and the corresponding plots are saved:

lpips_at_5s — a single float, typically 0.35–0.55 for the 2B at default guidance.
ssim_at_5s — a single float, typically 0.20–0.40.
lateral_drift_m — a single float in meters, typically 1–3 m for current frontier WMs over 5 s of open-loop rollout.
longitudinal_drift_m — a single float in meters, typically 2–8 m.
figures/lpips_ssim_curves.png and figures/trajectory_overlay.png exist on disk.
A written failure-mode taxonomy with at least three annotated frames in the markdown cell at the bottom of the notebook.

If your numbers are far better than the ranges above, double-check that you are not accidentally comparing GT-to-GT or that the metric is not being computed on the conditioning frame itself. If they are far worse, check the resolution match (LPIPS is sensitive to mismatched resolutions) and the FPS alignment.

Common pitfalls

Action-format mismatch. The action-conditioned Cosmos checkpoint is trained on Bridge end-effector deltas (7-D), not on driving trajectories. If you pipe nuScenes ego deltas into the action-conditioned variant directly, it will produce garbage that looks plausible. Use the standard Video2World variant with text+image conditioning and encode trajectory in the prompt, or post-train Predict 2.5 on driving data.
Classifier-free guidance scale. Default CFG is 7. Push it to 12–15 if you want the model to follow the prompt more strictly; this also amplifies hallucinations and reduces visual realism. Run a small sweep (CFG ∈ {5, 7, 10, 14}) and report which one minimizes lateral drift — it is rarely the default.
Frame rate / temporal consistency. Cosmos Predict 2 was trained at 16 fps; nuScenes camera samples come in at 12 Hz with a different aspect ratio. If you do not match resolution and fps in pre/post-processing, your LPIPS curve will be dominated by interpolation artifacts, not generator drift.
Deterministic seeds are not actually deterministic across hardware. Set torch.manual_seed, numpy.random.seed, and cuda seeds, then accept that two H100s and an A100 will give you slightly different videos for the same seed. Report the seed you used and run 3 trials if you care about variance.
Compute-vs-disk gotcha. The 14B checkpoint is ~28 GB on disk. The 2B is ~4–6 GB. Hugging Face downloads via LFS — make sure your home directory has at least 60 GB free if you plan to compare 2B vs 14B side-by-side, since both copies of the weights need to be resident.
The first frame is not the conditioning frame. Some Cosmos pipelines expect a 1-frame or 5-frame "history" — passing only one frame can make the rollout immediately inconsistent. Read the chunk_size and num_conditional_frames fields in the inference JSON before you trust your first output.
LPIPS on un-normalized images is silently wrong. The lpips package expects images in [-1, 1]. If you pass [0, 255] uint8 or [0, 1] floats without scaling, you get a number — just not a meaningful one.