World Models and Generative Video for Physical AI

Last verified: 2026-05-08. Initial draft written from training knowledge (cutoff Jan 2026); claims, dates, and parameter counts have been verified and enriched with citations to primary sources (arXiv, official blogs, GitHub) where possible.

"World model" is the most overloaded phrase in Physical AI. It is also where, in 2024–2026, NVIDIA's Cosmos play most directly threatens the classical synthetic-data vendors (Applied Intuition, Parallel Domain, Foretellix, AnyVerse). This doc tries to be precise about what people actually mean, what's been released vs. demoed, and where the value sits today.

A. What is a "world model" in Physical AI context?

The phrase gets used for at least five distinct families of models. They share an idea — predict what happens next — but differ in representation, supervision, and what they're good for.

A.1 Latent dynamics models (RL lineage)

Classical "world model" in the Ha & Schmidhuber sense: an encoder maps observations to a compact latent, a recurrent dynamics model rolls the latent forward conditioned on actions, and a policy is trained on rollouts inside that latent.

Dreamer / DreamerV2 / DreamerV3 (Hafner et al.) — single-agent RL world model that solved Atari, DMC, Crafter, and Minecraft diamond from scratch. Dynamics live in a small RSSM latent; rollouts cost milliseconds.
MuZero (Schrittwieser et al.) — learns a value-equivalent latent dynamics, no pixel reconstruction. Used inside DeepMind's planners.
TD-MPC2 (Hansen et al.) — modern continuous-control variant.

These models are tiny (millions of params), do not generate pixels, and are designed for policy training in imagination. They are what RL researchers usually mean by "world model." They are largely not what AV/robotics product teams mean today.

A.2 Generative video models (pixel-space rollouts)

Diffusion or autoregressive models that produce future video frames, possibly conditioned on text, an initial frame, a trajectory, or other control inputs.

OpenAI Sora / Sora 2 (Sora 2, Sep 30 2025) — text-to-video with synchronized audio; no public action-steering API. (Note: as of late April 2026 OpenAI announced the Sora consumer product was being wound down, though the underlying model continues to be available via API.)
Google Veo 2 / Veo 3 (Veo 3, May 2025) — joint audio+video diffusion-transformer; 4–8s clips up to 4K at 24fps with synchronized audio.
NVIDIA Cosmos Predict / Predict 2 / Predict 2.5 (Cosmos paper, Jan 2025; Cosmos-Predict2 GitHub) — diffusion + autoregressive WFMs explicitly framed as a Physical AI substrate. Predict 2 (released April–June 2025) ships in 2B and 14B variants; Predict 2.5 followed later in 2025 (more in §B).
Runway Gen-3 / Gen-4, Kling, Pika, HunyuanVideo, Wan2.1 — general-purpose creative video; sometimes used for AV demos but not designed for it.

These are large (billions of params), generate pixels, and are good for data, photoreal augmentation, and qualitative eval. They are usually not what an RL person means when they say "world model."

A.3 Driving world models

Generative video specialised on driving scenes, typically multi-camera, conditioned on high-level controls (route, BEV layout, agent boxes, trajectory, weather/time-of-day).

GAIA-1 (Wayve, blog Jun 2023; tech report Sep 2023) — 9B autoregressive transformer + 2.6B video-diffusion decoder, trained on ~4,700 hours of London driving 2019–2023.
GAIA-2 (Wayve, Mar 26 2025; arXiv 2503.20523) — latent-diffusion (flow-matching), native multi-camera, structured controls; trained on ~25M video sequences from UK/US/Germany fleets.
GAIA-3 (Wayve, Dec 2 2025) — 15B params, 10× more data than GAIA-2, redesigned video tokenizer; positioned as an evaluation substrate (safety-critical scenario gen, embodiment transfer).
DriveDreamer / DriveDreamer-2 (DriveDreamer ECCV 2024; DriveDreamer-2 AAAI 2025).
MagicDrive / MagicDrive3D (Gao et al.) — multi-view consistent.
Vista (Gao et al., NeurIPS 2024).
Drive-WM (Wang et al., CVPR 2024).
GenAD (OpenDriveLab, CVPR 2024 highlight).
NVIDIA Cosmos-Drive / Cosmos-Drive-Dreams (Cosmos-Drive-Dreams, Jun 10 2025; arXiv 2506.09042).

A.4 Robotics world models

Action-conditioned video or latent models for manipulation, navigation, or whole-body humanoid control.

1X World Model (initial blog Sep 16 2024; tech note PDF; major update Jan 13 2026) — first-person video prediction conditioned on EVE/NEO action vocabulary; released alongside a public 1X World Model Challenge.
DeepMind Genie / Genie 2 / Genie 3 (Genie ICML 2024; Genie 2 blog Dec 4 2024; Genie 3 blog Aug 5 2025) — interactive 3D worlds; Genie 3 is an 11B autoregressive transformer that generates several minutes of 720p / 24fps interactive video.
V-JEPA / V-JEPA 2 (V-JEPA Feb 2024; V-JEPA 2 Jun 11 2025, 1.2B params) — joint-embedding predictive, non-generative; trained on >1M hours video + ~62 hours robot data.
RoboDreamer (Zhou et al., ICML 2024), IRASim (Zhu et al.), UniSim (Yang et al., ICLR 2024).
NVIDIA GR00T-Dreams / GR00T N1.5 (Computex May 2025 announcement; Isaac GR00T product) — uses Cosmos to synthesise humanoid training data.

A.5 Action-conditioned ("steerable") video

Cross-cutting capability: the difference between "make a video of a car turning left" and "given THIS scene, simulate what happens if I steer +5° and brake at 3.2 m/s²." This is what turns a generative video model into a world model. Cosmos Predict, GAIA-2, IRASim, and Genie 2 all expose it; Sora and Veo do not.

A.6 Quick "what is each good for"

Family	Planning	Sim	Data	Eval
Latent dynamics (Dreamer/MuZero)	yes (in latent)	weak	no	weak
Generative video (Sora/Veo)	no	weak	yes (creative)	qualitative
Driving WMs (GAIA-2, Cosmos-Drive)	partial (closed-loop)	yes	yes	yes
Robotics WMs (1X, IRASim, GR00T-Dreams)	yes (imagined rollouts)	yes	yes	partial
JEPA-style (V-JEPA 2)	yes (in embedding)	no	no	yes (probing)

B. Driving world models — deep dive

This is the area Applied Intuition (and every classical AV-sim vendor) has to have an opinion on, because Cosmos-Drive is explicitly aimed at the same workflows.

B.1 Wayve GAIA-1, GAIA-2, GAIA-3

GAIA-1 (initial blog Jun 2023; 9B-param tech report Sep 2023) was the field-defining work: a 9B autoregressive model on video + text + action tokens, trained on ~4,700 hours of Wayve London driving (2019–2023). Single-camera, VQ tokeniser, with a separate 2.6B-param video-diffusion decoder (trained 15 days on 32 A100s).

GAIA-2 (announced Mar 26 2025; arXiv 2503.20523) is the production-relevant version: latent-diffusion with flow-matching, native multi-camera, structured conditioning (ego dynamics, agent configurations, environmental factors, road semantics, plus CLIP and proprietary embedding inputs), trained on ~25M curated video sequences from Wayve's UK+US+Germany fleet across multiple vehicle/sensor configurations. It serves both as a closed-loop simulator for evaluating Wayve's end-to-end driver and as a scenario generator for rare events.

GAIA-3 (announced Dec 2 2025) is the current generation: ~15B params (2× GAIA-2), 10× more training data, redesigned next-gen video tokenizer also doubled in size. The framing has shifted from generation to evaluation: Wayve pitches GAIA-3 as a substrate for safety-critical scenario generation, "embodiment transfer" (consistent eval across different vehicle rigs), and controlled visual diversity for robustness testing. Wayve reports a fivefold reduction in synthetic-test rejection rates vs. earlier generations. The world model is not a separate product, it's the same generative substrate they use to train and evaluate the driver — the strongest "vertically integrated" WM story in AV.

B.2 Academic driving world models

Model	Year	Multi-view	Action-cond.	Notes
DriveDreamer	2023	no	yes (BEV + text)	nuScenes, single-front-cam
DriveDreamer-2	2024	yes	yes	LLM expands user prompts to structured controls
MagicDrive	2023	yes (6 cams)	bbox + map	Strong multi-view geometry conditioning
MagicDrive3D	2024	yes + 3D	yes	Generates a 3D scene you can re-render
Vista	2024	no	yes	Long-horizon (15s+) at higher res
Drive-WM	2024	yes	trajectory	First end-to-end planner-in-the-loop with a WM
GenAD	2024	yes	yes	OpenDriveLab; large-scale OpenDV-2K data

The OpenDriveLab OpenDV-2K dataset (~2,000 hours of YouTube driving) is the de-facto open pre-training corpus and what most academic WMs use.

B.3 NVIDIA Cosmos and Cosmos-Drive

Cosmos (first announced at CES Jan 6 2025; major release at GTC Mar 18 2025) is the move that scared the synthetic-data incumbents. Three families, all open-weight on GitHub and Hugging Face under the NVIDIA Open Model License:

Cosmos Predict — future-frame / video2world generation from past frames + optional text/trajectory; diffusion-transformer and autoregressive variants. Cosmos Predict-1 (Jan 2025) shipped 4B/5B/14B variants. Cosmos Predict-2 (April–June 2025) ships 0.6B/2B/14B variants for text2image, text2video, and video2world; Predict-2.5 followed later in 2025. The 14B variant targets high-fidelity long-horizon coherence; 2B targets faster inference.
Cosmos Transfer (Cosmos-Transfer1, arXiv 2503.14492, Mar 2025) — controlnet-style: takes structured sim inputs (segmentation, depth, edges, lidar, pose, trajectory, HD map) and outputs photoreal video that respects them. Adaptive multimodal control (different conditions weighted differently per spatial region). The operationally important head for AV Sim2Real.
Cosmos Reason (Cosmos-Reason1, May 17 2025, 7B params; Cosmos-Reason2, Dec 19 2025, 2B/8B; 32B added Apr 2026) — reasoning VLM that scores whether a generated video is physically plausible / matches the prompt; produces long chain-of-thought rationales over space, time, and physics. Used as filter, planner, and reward signal.

The Cosmos paper does not publish a single canonical "X million hours" headline; the data-curation appendix describes a multi-source pipeline (driving, robotics, navigation, human activity) but exact composition and total hours are not fully disclosed.

Cosmos-Drive-Dreams (arXiv 2506.09042, Jun 10 2025; open dataset on HF) is the AV specialisation: Cosmos Transfer fine-tuned on driving, conditioned on HD-map + 3D-box + lidar layouts. Open dataset includes labels (HD-map, BBox, lidar) for 5,843 10-second NVIDIA-collected clips plus 81,802 synthetic Cosmos-generated video samples derived from those labels. This is the head-on shot at Parallel Domain.

B.4 Other 2025–2026 driving WMs

A flurry of academic and corporate follow-ups appeared in 2025–2026 (DriveWM-2, Vista-style extensions); production deployment of most is unclear. Tesla has shown world-model demos at investor events but published no architecture (treat as rumour-grade). Waabi World is a closed neural simulator for trucking. The Decart/Etched Oasis model (Oct 31 2024; open weights for Oasis-500M; Oasis 2.0 followed in 2025) is a real-time interactive Minecraft world model — important for the interactive WM lineage but unrelated to AV ("OASIS-AD" is not a published artefact and was a hallucination of the original draft, [unverified]).

B.5 What problems they solve

Use case	What the WM gives you	Replaces
Closed-loop sim for planner	Photoreal counterfactuals from logged scenes	Rendered sim + ML perception layer
Edge-case generation	Prompt for "child runs out from parked van at dusk"	Hand-authored scenarios
Perception data augmentation	Photoreal frames with propagated labels	Hand-collected data
Pre-training perception	Next-frame prediction as self-sup	ImageNet/JFT pretraining
Policy training in imagination	RL inside the WM	On-policy real-world miles
Adversarial eval	Search prompts that break the planner	Manual red-teaming

In production, the only ones that are unambiguously deployed at scale are (a) edge-case generation for eval and (b) data augmentation. Closed-loop sim against a learned planner is operational at Wayve and Waabi but not yet a general-purpose product elsewhere.

C. Robotics world models — deep dive

Robotics is harder than driving because the action space is high-dimensional, the camera often moves in 6DoF, and contact dynamics matter. The 2024–2025 field split into three camps.

C.1 Generative video for robotics

1X World Model (initial blog Sep 16 2024; tech report PDF; major productized update Jan 13 2026) — first humanoid-OEM world model. Trained on first-person video from the EVE (now succeeded by NEO) fleet plus teleop-action labels. Conditioned on the same action vocabulary the policy outputs, so they can roll out alternative futures for the same scene. 1X open-sourced a 1X World Model Challenge (v1.0 Jul 8 2024) on HF with subsets of their fleet data — useful for hands-on.
UniSim (Yang et al., ICLR 2024 outstanding paper) — early "universal simulator" trained on a mixed corpus (driving, manipulation, navigation, human video) with text + action conditioning. Demonstrated cross-embodiment behaviour generation.
IRASim (ByteDance, 2024) — manipulation-focused, conditioned on robot end-effector trajectory; trained on RT-1, BridgeData, and Language-Table.
RoboDreamer (Zhou et al., ICML 2024) — compositional generation: composes "robot arm" + "object" + "verb" trajectories. Generalisation rather than fidelity story.

C.2 Game-world / interactive WMs

DeepMind Genie (Bruce et al., ICML 2024) — 11B foundation model of 2D platformers; learns latent actions from unlabelled video and lets you "play" the generated world.
Genie 2 (DeepMind blog, Dec 4 2024) — interactive 3D worlds from a single image; reported up to ~1 minute of consistent rollouts in best cases (most demos shorter), keyboard/mouse control, model never publicly released.
Genie 3 (DeepMind blog, Aug 5 2025) — 11B-param autoregressive transformer; real-time, several-minute interactive 720p / 24fps environments from a text prompt with "promptable world events." Builds on Veo 3. Research preview only — not publicly released. Used internally as an RL training environment for SIMA-style agents.

These are not directly robotics models, but the techniques (latent action, interactive rollout) flow into the robotics WMs.

C.3 Joint-embedding predictive (the LeCun line)

V-JEPA (LeCun et al., Feb 2024) — predicts in latent space, masked feature prediction on video, no pixel reconstruction.
V-JEPA 2 (Meta blog, Jun 11 2025; arXiv 2506.09985; code on GitHub) — 1.2B params, trained on >1M hours of internet video + 1M images, then second-stage action-conditioned training on ~62 hours of robot data. Meta's V-JEPA 2-AC was deployed zero-shot on Franka arms in two different labs for image-goal pick-and-place (reported 65–80% success on novel objects/settings) by searching in the embedding space for actions that move the embedding toward a goal-image embedding. Meta also released three new physical-reasoning benchmarks (IntPhys 2, MVPBench, CausalVQA) alongside V-JEPA 2.

V-JEPA's selling point: doesn't waste capacity on photoreal pixels, learns physical regularities at the right level of abstraction. The selling point is also its limitation — you can't render data out of it, so it competes with classical perception data, not with synthetic-data vendors.

C.4 NVIDIA GR00T-Dreams

GR00T is NVIDIA's humanoid foundation-model line. GR00T N1.5 and the GR00T-Dreams blueprint were announced together at Computex 2025 (May 2025); the open-source release of N1.5 and Newton physics engine followed in July 2025. GR00T-Dreams is the data-generation pipeline: feed Cosmos Predict a few real demonstrations from a humanoid, generate hundreds of action-conditioned video futures, run a separate inverse-dynamics network to extract pseudo-labels (joint trajectories), and use those as additional training data for the policy. NVIDIA reports they used GR00T-Dreams to generate the training data for N1.5 in ~36 hours vs. nearly three months of manual collection — treat the headline multipliers as marketing until reproduced.

C.5 Physical Intelligence (π)

Physical Intelligence ships VLA policies (π0, Oct 2024; π0.5, Apr 22 2025 / arXiv 2504.16054) — vision-language-action models that map pixels and language directly to actions. π0.5 demonstrated open-world generalization to entirely new homes (cleaning unseen kitchens/bedrooms via 10–15 minute multi-stage behaviors). They have no public world model. Their public stance has been "scale VLAs on diverse robot data" rather than "imagine in pixels." This is a deliberate bet on the LeCun/Karpathy view that pixel rollouts are the wrong abstraction for control; whether they are using a private latent WM internally is unclear.

C.6 Google DeepMind Gemini Robotics

Gemini Robotics (Mar 2025 announcement; arXiv 2503.20020) is DeepMind's VLA built on Gemini 2.0, with physical actions added as an output modality. Released alongside Gemini Robotics-ER (Embodied Reasoning), a multimodal model with 3D bounding-box prediction, pointing, trajectory and grasp prediction, and multi-view correspondence. Followed by Gemini Robotics On-Device (Jul 2025) and Gemini Robotics 1.5 / ER 1.5 later in 2025. Like π0.5 this is a VLA, not a generative world model — included here because it is the most-watched 2025 entrant in the policy half of the stack and increasingly competes for the same "physical-AI foundation model" mindshare.

C.7 Tesla Optimus

Tesla has shown internal "Optimus world model" demos (occupancy-network-style 3D + video rollouts conditioned on humanoid action) at investor events through 2025, but has published no architecture. Treat as rumour-grade.

D. The big debate: does Sora "understand physics"?

Two camps, both partially right.

D.1 The LeCun camp

Yann LeCun's argument, made repeatedly through 2024–2025: a generative pixel model is optimised for likelihood under a tokeniser, not for any physical invariant. It will happily hallucinate violations of object permanence or gravity if the dataset lets it, because the loss is per-pixel. Worse, photoreal generation wastes capacity on irrelevant texture detail. The "true" world model lives in a learned latent (his JEPA argument), not in pixels.

Empirical support: PhyWorld, PhyGenBench, and VBench-Physics / VideoPhy benchmarks showed that even 2025-era frontier video models (Sora, Veo, Kling) fail basic physics tests at rates of 30–60% on tasks like "ball bounces off ramp," "liquid pours into glass," "two objects collide." The community largely agrees the failures are real.

D.2 The "scale is all you need" camp

Counter-argument, articulated by Bill Peebles (Sora lead) and NVIDIA's Cosmos team: pixels are the only universal interface. Every other representation (3D mesh, occupancy, latent embedding) is some team's design choice. If you train at sufficient scale on enough multimodal data, the model will discover the physical regularities as a byproduct, the same way LLMs discovered programming. The Cosmos paper explicitly calls itself a "world foundation model" and frames scale as the path.

Empirical support: scaling laws on PhyWorld show monotonic improvement with both data and compute. NVIDIA reports Cosmos Predict-2 14B is meaningfully better than 2B at preserving object identity over 5–10s [unverified, NVIDIA-internal]; Genie 3 jumped from Genie 2's ~10–20s rollouts to several minutes at 720p / 24fps.

D.3 What benchmarks actually measure

Benchmark	What it tests	Frontier model results (approx.)
VBench	16 dimensions: subject consistency, motion smoothness, etc.	Sora/Veo/Kling all >80% aggregate
VideoPhy	Physical commonsense in generated video	<50% physical-plausibility scores
PhyGenBench	27 physics laws across 4 domains	best models ~0.5
WorldModelBench	Action-conditioning fidelity for embodied use	early-stage, low scores

Long-horizon failure modes that show up consistently:

Object permanence: an obscured car returns as a different car.
Counting: 4 cyclists become 3 then 5.
Causality: brake lights without deceleration; collisions without contact response.
Action drift: ego trajectory diverges from commanded over 5+ seconds.

D.4 The synthesis view

The honest answer in 2026: generative pixel models are a usable but leaky world model. They are good enough for data augmentation and qualitative eval, not yet good enough as a load-bearing simulator for safety-critical decisions. The disagreement is whether the leaks close with scale (Cosmos/OpenAI bet) or whether you need a different architecture (LeCun/V-JEPA bet). Reasonable people disagree, and Applied Intuition needs an opinion that doesn't require winning that bet.

E. Use cases — where world models actually deliver value today

E.1 Closed-loop simulation for AV planners

Replace classical sim's rendered pixels with WM-generated photoreal video, conditioned on the same scenario script. Wayve uses GAIA-2 this way internally. The advantage over rendered sim: no "sim2real perception gap." The risk: the WM might paper over a real perception failure with plausible-looking pixels.

E.2 Data augmentation (rare scenarios on demand)

The clearest deployed win. Generate 10,000 frames of "ambulance with lights, rainy night, four-way stop" and use them for perception training. Cosmos-Drive-Dreams is built around this. PD's Step and Anyverse Studio are the classical analogues.

E.3 Pre-training (next-frame prediction as SSL)

Predicting future video is a strong self-supervised objective for perception backbones. V-JEPA 2, DINO-WM, and the Cosmos backbones have been used as pretraining sources for downstream perception heads. Quietly important.

E.4 Policy rollouts in imagination

The Dreamer story, applied to robotics: roll out the policy inside the WM, train on those imagined trajectories. Demonstrated at toy scale (RoboDreamer, TWM) but not yet at humanoid-product scale. GR00T-Dreams is the closest production example, though it uses the WM for data not for on-policy imagination.

E.5 Evaluation / adversarial testing

Use the WM as a fuzzer: search the conditioning space for prompts that produce failures in the planner. This is the eval-time flip of (a). Wayve and Waabi do this; it's a natural fit for Applied Intuition's existing scenario-search infrastructure.

E.6 Sim-to-real bridging via Cosmos Transfer

The pattern that is going mainstream in 2025–2026: keep the classical sim for ground truth (boxes, depth, segmentation, HD map, perfect ego physics) and run a Cosmos-Transfer-style head over its renders to add photoreal style. You get the labels of classical sim and the realism of generative video. This is the hybrid workflow and probably what will dominate.

F. Threat to classical synthetic data

This is the question that matters for an Applied Intuition product person.

F.1 The bull case for Cosmos against classical sim

If a 14B Cosmos checkpoint can produce 4K, multi-camera, 10-second driving videos from a text prompt or a sketch, with controllable weather and agents, why pay for a classical sim stack? Cosmos's marketing pitch is essentially this: "we replace your $50M sim engineering team with a foundation model and ~$10M of inference compute." Parallel Domain, AnyVerse, Cognata, Spectral, and the rendering portions of Applied Intuition / Foretellix are the targets.

F.2 The counter-arguments

These are real and load-bearing for the incumbents:

Ground truth. Classical sim gives you pixel-perfect 3D boxes, semantic masks, depth, optical flow, lidar, radar — derived from the simulator's internal state. Cosmos gives you pixels and you have to label them with another model. Auto-label noise compounds, especially on the long tail (which is the only tail that matters for AV).
Controllability and determinism. A safety case wants to say "we ran the same scenario 10,000 times across firmware versions." Diffusion video models are stochastic; getting frame-by-frame determinism is not natively supported.
Physical fidelity for dynamics-critical tests. Vehicle dynamics, sensor models (especially radar, lidar near-field), V2X timing — none of this comes out of a pixel model. You still need a physics engine.
Certifiability. ISO 26262 / SOTIF / UN R157 don't have a clear story for "we trained on samples from a 14B diffusion model." They do have a story for parameterised classical sim with provenance.
Long-tail control. Generating "child runs from parked van at dusk in 23°C drizzle on a 6.5%-grade road" requires either prompt engineering against a model that may or may not respect every constraint, or a structured controller. Classical sim has structured controllers natively.

F.3 The hybrid that's actually winning

The compromise that has emerged in 2025: classical sim for scenario, layout, physics, and labels; Cosmos Transfer (or equivalent) on top for photoreal style. This is exactly the Cosmos Transfer / DRIVE Sim + Cosmos pattern. NVIDIA does it. Wayve does a variant where they run Sim-rendered geometry through GAIA-2's layout-to-pixel head. Waabi has hinted at a similar pipeline.

This is good news for Applied Intuition. The classical-sim moat (scenario authoring, ground truth, certifiability) is preserved; the photoreal-pixels moat was already eroding regardless of Cosmos. The right product strategy is to own the conditioning stack (scenario language, BEV layout, ground truth) and treat the photoreal renderer as a swappable backend, possibly Cosmos itself.

F.4 A useful framing

Layer	Owned by	Cosmos threat
Scenario authoring & ODD	Applied Intuition	low
Ground truth labels	classical sim	low
Vehicle dynamics, sensor sim	physics engine	low
Photoreal rendering	classical render → WM	high
Perception training data	classical sim → hybrid	medium
Closed-loop planner eval	classical sim → WM-augmented	medium

The product team's job is to make sure the photoreal-rendering layer is swappable so customers can choose between Unreal-style rendering, neural rendering, and Cosmos Transfer, and route the choice by use case (certified safety case → Unreal; perception data → Cosmos).

G. Open problems

These are the technical risks that determine whether the WM bet pays off.

3D / multi-view consistency. Most WMs are 2D-token models; preserving geometry across 5+ surround cameras over 10s is not solved. MagicDrive3D and GAIA-2 are the best public attempts.
Long-horizon coherence (>30s). Identity drift, object permanence, and accumulated error all kick in past ~10s. Genie 3 and Cosmos Predict 14B push this but don't solve it.
Action-conditioning fidelity. Commanded ego trajectory vs. realised trajectory in the generated video drifts. Quantitatively measured by WorldModelBench and similar; current frontier is ~1–2m lateral drift over 5s — too high for closed-loop planner eval.
Physical plausibility. Per benchmarks above, ~50% on hard tests. Especially poor on contact, deformation, fluids, articulation.
Compute cost. A Cosmos-14B 5s 720p multi-camera generation is in the tens of seconds on an H100; an Unreal frame is sub-millisecond. For high-volume training data you can amortise this; for closed-loop sim at 10 Hz it's a non-starter today.
Label generation. "Do you get cuboids out of a generated frame?" is the single most-asked question by AV customers. Today the answer is via auto-labelling with a separately trained detector, which means your generated data is only as labelled as your detector is good — and on long-tail scenes, the detector is the bottleneck. Cosmos-Drive-Dreams sidesteps this by generating from labels rather than labelling generations, which is the right move but constrains diversity.
Distribution shift / hallucination. Out-of-distribution prompts ("snowy night in Phoenix") generate plausible-looking but quietly wrong scenes. Hard to detect without a ground-truth reference.
Open evaluation. No standard "is this WM useful for AV?" benchmark exists yet. Bench2Drive, NAVSIM, and CARLA Leaderboard 2.0 are planning benchmarks, not WM benchmarks.

H. Implications for the user (Applied Intuition Data Intelligence)

H.1 What to read first (in this order)

Cosmos paper — the most important single artefact. Read all of it; the data-curation appendix matters.
GAIA-1, GAIA-2 (arXiv 2503.20523), and GAIA-3 blog — the production-pull counterpart, including the explicit shift from generation to evaluation.
V-JEPA 2 paper — the rival paradigm, plus Meta's robot evaluation.
DriveDreamer-2 and Drive-WM — the academic landscape.
VideoPhy and PhyGenBench — the reality-check benchmarks.
Cosmos-Transfer1 (arXiv 2503.14492) and the Cosmos-Drive-Dreams paper (arXiv 2506.09042) — the hybrid sim+WM pattern and the open synthetic-data release.
The Ha & Schmidhuber world-model paper and DreamerV3 — for the RL grounding so you don't get bluffed by the term.
Genie 3 blog — the strongest 2025 demonstration of an interactive WM and where the pixel-WM frontier is.

H.2 Repos and tooling worth running

NVIDIA/Cosmos — open weights, inference scripts. A 7B Predict checkpoint runs on a single H100.
OpenDriveLab/DriveAGI — OpenDV-2K and reference WM training code.
1x-technologies/1xgpt — 1X World Model Challenge, real humanoid data subset.
facebookresearch/jepa — V-JEPA / V-JEPA 2 reference code.
carla-simulator/carla — for the classical-sim baseline.
nuscenes-devkit — for any AV WM toy project.

H.3 Concrete first projects (in escalating ambition)

Reproduce a Cosmos Predict driving rollout on nuScenes. Take a public Cosmos-Drive checkpoint, condition on a real nuScenes scene's first frame + ego trajectory, generate 5s of multi-camera future, and overlay the ground-truth camera. Measure pixel and trajectory drift. ~1–2 weeks. Forces you to feel the action-conditioning failure modes first-hand.
Hybrid sim project: CARLA + Cosmos Transfer. Render a CARLA scenario with ground-truth boxes, run Cosmos Transfer to photorealise it, retrain a YOLO/DETR detector on the photoreal version, and benchmark on real nuScenes. Quantify the sim2real gap closure. ~3–4 weeks. This is the workflow Applied Intuition customers will ask about.
Toy GAIA-1 on nuScenes. Train a small (~500M) autoregressive video tokeniser + transformer on nuScenes front-camera + ego-action, ~4–6 weeks on a single 8xH100 node. You'll learn why GAIA-2 is hard.
WM-as-evaluator. Use Cosmos Reason or a fine-tuned VLM to score generated driving videos on physical plausibility, then do prompt-search to find planner failures. This is the closest to Applied Intuition's actual product surface.
Action-conditioning probe. Take an open WM (Cosmos Predict, IRASim) and quantitatively measure trajectory-following error vs. commanded action. Publishable as a short paper if done carefully — and exactly the kind of internal capability assessment a Data Intelligence team should own.

H.4 The strategic frame to bring to the team

The right framing for Applied Intuition's Data Intelligence team is not "do we build a Cosmos competitor" — they would lose. It is:

Own the conditioning (scenario language, BEV/HD-map, ground-truth labels) — that's the durable moat.
Treat the photoreal renderer as a backend that customers swap (Unreal / Cosmos Transfer / a future model) by use case.
Evaluate WMs rigorously as a service to customers — most AV teams can't tell a good WM from a bad one and would pay for that judgment.
Generate from labels, not label generations — Cosmos-Drive-Dreams' key insight; Applied Intuition's existing scenario authoring already produces the labels.
Be the certified path — the safety-case story for classical sim is real and not threatened by WMs in any near-term regulatory framework.

The biggest risk is not Cosmos. It is that customers conclude "real driving data + fleet learning > any synthetic" and route around all sim vendors. World models actually help the synthetic-data category here, by raising the realism ceiling. The fight is between sim+WM vendors and pure fleet-data plays (Tesla, Wayve, Waymo), not between sim vendors and Cosmos.