Physical AI
Research·Doc 05·~60 min

World Models and Generative Video for Physical AI

Last verified: 2026-05-08. Initial draft written from training knowledge (cutoff Jan 2026); claims, dates, and parameter counts have been verified and enriched with citations to primary sources (arXiv, official blogs, GitHub) where possible.

"World model" is the most overloaded phrase in Physical AI. It is also where, in 2024–2026, NVIDIA's Cosmos play most directly threatens the classical synthetic-data vendors (Applied Intuition, Parallel Domain, Foretellix, AnyVerse). This doc tries to be precise about what people actually mean, what's been released vs. demoed, and where the value sits today.


A. What is a "world model" in Physical AI context?

The phrase gets used for at least five distinct families of models. They share an idea — predict what happens next — but differ in representation, supervision, and what they're good for.

A.1 Latent dynamics models (RL lineage)

Classical "world model" in the Ha & Schmidhuber sense: an encoder maps observations to a compact latent, a recurrent dynamics model rolls the latent forward conditioned on actions, and a policy is trained on rollouts inside that latent.

  • Dreamer / DreamerV2 / DreamerV3 (Hafner et al.) — single-agent RL world model that solved Atari, DMC, Crafter, and Minecraft diamond from scratch. Dynamics live in a small RSSM latent; rollouts cost milliseconds.
  • MuZero (Schrittwieser et al.) — learns a value-equivalent latent dynamics, no pixel reconstruction. Used inside DeepMind's planners.
  • TD-MPC2 (Hansen et al.) — modern continuous-control variant.

These models are tiny (millions of params), do not generate pixels, and are designed for policy training in imagination. They are what RL researchers usually mean by "world model." They are largely not what AV/robotics product teams mean today.

A.2 Generative video models (pixel-space rollouts)

Diffusion or autoregressive models that produce future video frames, possibly conditioned on text, an initial frame, a trajectory, or other control inputs.

  • OpenAI Sora / Sora 2 (Sora 2, Sep 30 2025) — text-to-video with synchronized audio; no public action-steering API. (Note: as of late April 2026 OpenAI announced the Sora consumer product was being wound down, though the underlying model continues to be available via API.)
  • Google Veo 2 / Veo 3 (Veo 3, May 2025) — joint audio+video diffusion-transformer; 4–8s clips up to 4K at 24fps with synchronized audio.
  • NVIDIA Cosmos Predict / Predict 2 / Predict 2.5 (Cosmos paper, Jan 2025; Cosmos-Predict2 GitHub) — diffusion + autoregressive WFMs explicitly framed as a Physical AI substrate. Predict 2 (released April–June 2025) ships in 2B and 14B variants; Predict 2.5 followed later in 2025 (more in §B).
  • Runway Gen-3 / Gen-4, Kling, Pika, HunyuanVideo, Wan2.1 — general-purpose creative video; sometimes used for AV demos but not designed for it.

These are large (billions of params), generate pixels, and are good for data, photoreal augmentation, and qualitative eval. They are usually not what an RL person means when they say "world model."

A.3 Driving world models

Generative video specialised on driving scenes, typically multi-camera, conditioned on high-level controls (route, BEV layout, agent boxes, trajectory, weather/time-of-day).

A.4 Robotics world models

Action-conditioned video or latent models for manipulation, navigation, or whole-body humanoid control.

A.5 Action-conditioned ("steerable") video

Cross-cutting capability: the difference between "make a video of a car turning left" and "given THIS scene, simulate what happens if I steer +5° and brake at 3.2 m/s²." This is what turns a generative video model into a world model. Cosmos Predict, GAIA-2, IRASim, and Genie 2 all expose it; Sora and Veo do not.

A.6 Quick "what is each good for"

FamilyPlanningSimDataEval
Latent dynamics (Dreamer/MuZero)yes (in latent)weaknoweak
Generative video (Sora/Veo)noweakyes (creative)qualitative
Driving WMs (GAIA-2, Cosmos-Drive)partial (closed-loop)yesyesyes
Robotics WMs (1X, IRASim, GR00T-Dreams)yes (imagined rollouts)yesyespartial
JEPA-style (V-JEPA 2)yes (in embedding)nonoyes (probing)

B. Driving world models — deep dive

This is the area Applied Intuition (and every classical AV-sim vendor) has to have an opinion on, because Cosmos-Drive is explicitly aimed at the same workflows.

B.1 Wayve GAIA-1, GAIA-2, GAIA-3

GAIA-1 (initial blog Jun 2023; 9B-param tech report Sep 2023) was the field-defining work: a 9B autoregressive model on video + text + action tokens, trained on ~4,700 hours of Wayve London driving (2019–2023). Single-camera, VQ tokeniser, with a separate 2.6B-param video-diffusion decoder (trained 15 days on 32 A100s).

GAIA-2 (announced Mar 26 2025; arXiv 2503.20523) is the production-relevant version: latent-diffusion with flow-matching, native multi-camera, structured conditioning (ego dynamics, agent configurations, environmental factors, road semantics, plus CLIP and proprietary embedding inputs), trained on ~25M curated video sequences from Wayve's UK+US+Germany fleet across multiple vehicle/sensor configurations. It serves both as a closed-loop simulator for evaluating Wayve's end-to-end driver and as a scenario generator for rare events.

GAIA-3 (announced Dec 2 2025) is the current generation: ~15B params (2× GAIA-2), 10× more training data, redesigned next-gen video tokenizer also doubled in size. The framing has shifted from generation to evaluation: Wayve pitches GAIA-3 as a substrate for safety-critical scenario generation, "embodiment transfer" (consistent eval across different vehicle rigs), and controlled visual diversity for robustness testing. Wayve reports a fivefold reduction in synthetic-test rejection rates vs. earlier generations. The world model is not a separate product, it's the same generative substrate they use to train and evaluate the driver — the strongest "vertically integrated" WM story in AV.

B.2 Academic driving world models

ModelYearMulti-viewAction-cond.Notes
DriveDreamer2023noyes (BEV + text)nuScenes, single-front-cam
DriveDreamer-22024yesyesLLM expands user prompts to structured controls
MagicDrive2023yes (6 cams)bbox + mapStrong multi-view geometry conditioning
MagicDrive3D2024yes + 3DyesGenerates a 3D scene you can re-render
Vista2024noyesLong-horizon (15s+) at higher res
Drive-WM2024yestrajectoryFirst end-to-end planner-in-the-loop with a WM
GenAD2024yesyesOpenDriveLab; large-scale OpenDV-2K data

The OpenDriveLab OpenDV-2K dataset (~2,000 hours of YouTube driving) is the de-facto open pre-training corpus and what most academic WMs use.

B.3 NVIDIA Cosmos and Cosmos-Drive

Cosmos (first announced at CES Jan 6 2025; major release at GTC Mar 18 2025) is the move that scared the synthetic-data incumbents. Three families, all open-weight on GitHub and Hugging Face under the NVIDIA Open Model License:

  • Cosmos Predict — future-frame / video2world generation from past frames + optional text/trajectory; diffusion-transformer and autoregressive variants. Cosmos Predict-1 (Jan 2025) shipped 4B/5B/14B variants. Cosmos Predict-2 (April–June 2025) ships 0.6B/2B/14B variants for text2image, text2video, and video2world; Predict-2.5 followed later in 2025. The 14B variant targets high-fidelity long-horizon coherence; 2B targets faster inference.
  • Cosmos Transfer (Cosmos-Transfer1, arXiv 2503.14492, Mar 2025) — controlnet-style: takes structured sim inputs (segmentation, depth, edges, lidar, pose, trajectory, HD map) and outputs photoreal video that respects them. Adaptive multimodal control (different conditions weighted differently per spatial region). The operationally important head for AV Sim2Real.
  • Cosmos Reason (Cosmos-Reason1, May 17 2025, 7B params; Cosmos-Reason2, Dec 19 2025, 2B/8B; 32B added Apr 2026) — reasoning VLM that scores whether a generated video is physically plausible / matches the prompt; produces long chain-of-thought rationales over space, time, and physics. Used as filter, planner, and reward signal.

The Cosmos paper does not publish a single canonical "X million hours" headline; the data-curation appendix describes a multi-source pipeline (driving, robotics, navigation, human activity) but exact composition and total hours are not fully disclosed.

Cosmos-Drive-Dreams (arXiv 2506.09042, Jun 10 2025; open dataset on HF) is the AV specialisation: Cosmos Transfer fine-tuned on driving, conditioned on HD-map + 3D-box + lidar layouts. Open dataset includes labels (HD-map, BBox, lidar) for 5,843 10-second NVIDIA-collected clips plus 81,802 synthetic Cosmos-generated video samples derived from those labels. This is the head-on shot at Parallel Domain.

B.4 Other 2025–2026 driving WMs

A flurry of academic and corporate follow-ups appeared in 2025–2026 (DriveWM-2, Vista-style extensions); production deployment of most is unclear. Tesla has shown world-model demos at investor events but published no architecture (treat as rumour-grade). Waabi World is a closed neural simulator for trucking. The Decart/Etched Oasis model (Oct 31 2024; open weights for Oasis-500M; Oasis 2.0 followed in 2025) is a real-time interactive Minecraft world model — important for the interactive WM lineage but unrelated to AV ("OASIS-AD" is not a published artefact and was a hallucination of the original draft, [unverified]).

B.5 What problems they solve

Use caseWhat the WM gives youReplaces
Closed-loop sim for plannerPhotoreal counterfactuals from logged scenesRendered sim + ML perception layer
Edge-case generationPrompt for "child runs out from parked van at dusk"Hand-authored scenarios
Perception data augmentationPhotoreal frames with propagated labelsHand-collected data
Pre-training perceptionNext-frame prediction as self-supImageNet/JFT pretraining
Policy training in imaginationRL inside the WMOn-policy real-world miles
Adversarial evalSearch prompts that break the plannerManual red-teaming

In production, the only ones that are unambiguously deployed at scale are (a) edge-case generation for eval and (b) data augmentation. Closed-loop sim against a learned planner is operational at Wayve and Waabi but not yet a general-purpose product elsewhere.


C. Robotics world models — deep dive

Robotics is harder than driving because the action space is high-dimensional, the camera often moves in 6DoF, and contact dynamics matter. The 2024–2025 field split into three camps.

C.1 Generative video for robotics

  • 1X World Model (initial blog Sep 16 2024; tech report PDF; major productized update Jan 13 2026) — first humanoid-OEM world model. Trained on first-person video from the EVE (now succeeded by NEO) fleet plus teleop-action labels. Conditioned on the same action vocabulary the policy outputs, so they can roll out alternative futures for the same scene. 1X open-sourced a 1X World Model Challenge (v1.0 Jul 8 2024) on HF with subsets of their fleet data — useful for hands-on.
  • UniSim (Yang et al., ICLR 2024 outstanding paper) — early "universal simulator" trained on a mixed corpus (driving, manipulation, navigation, human video) with text + action conditioning. Demonstrated cross-embodiment behaviour generation.
  • IRASim (ByteDance, 2024) — manipulation-focused, conditioned on robot end-effector trajectory; trained on RT-1, BridgeData, and Language-Table.
  • RoboDreamer (Zhou et al., ICML 2024) — compositional generation: composes "robot arm" + "object" + "verb" trajectories. Generalisation rather than fidelity story.

C.2 Game-world / interactive WMs

  • DeepMind Genie (Bruce et al., ICML 2024) — 11B foundation model of 2D platformers; learns latent actions from unlabelled video and lets you "play" the generated world.
  • Genie 2 (DeepMind blog, Dec 4 2024) — interactive 3D worlds from a single image; reported up to ~1 minute of consistent rollouts in best cases (most demos shorter), keyboard/mouse control, model never publicly released.
  • Genie 3 (DeepMind blog, Aug 5 2025) — 11B-param autoregressive transformer; real-time, several-minute interactive 720p / 24fps environments from a text prompt with "promptable world events." Builds on Veo 3. Research preview only — not publicly released. Used internally as an RL training environment for SIMA-style agents.

These are not directly robotics models, but the techniques (latent action, interactive rollout) flow into the robotics WMs.

C.3 Joint-embedding predictive (the LeCun line)

  • V-JEPA (LeCun et al., Feb 2024) — predicts in latent space, masked feature prediction on video, no pixel reconstruction.
  • V-JEPA 2 (Meta blog, Jun 11 2025; arXiv 2506.09985; code on GitHub) — 1.2B params, trained on >1M hours of internet video + 1M images, then second-stage action-conditioned training on ~62 hours of robot data. Meta's V-JEPA 2-AC was deployed zero-shot on Franka arms in two different labs for image-goal pick-and-place (reported 65–80% success on novel objects/settings) by searching in the embedding space for actions that move the embedding toward a goal-image embedding. Meta also released three new physical-reasoning benchmarks (IntPhys 2, MVPBench, CausalVQA) alongside V-JEPA 2.

V-JEPA's selling point: doesn't waste capacity on photoreal pixels, learns physical regularities at the right level of abstraction. The selling point is also its limitation — you can't render data out of it, so it competes with classical perception data, not with synthetic-data vendors.

C.4 NVIDIA GR00T-Dreams

GR00T is NVIDIA's humanoid foundation-model line. GR00T N1.5 and the GR00T-Dreams blueprint were announced together at Computex 2025 (May 2025); the open-source release of N1.5 and Newton physics engine followed in July 2025. GR00T-Dreams is the data-generation pipeline: feed Cosmos Predict a few real demonstrations from a humanoid, generate hundreds of action-conditioned video futures, run a separate inverse-dynamics network to extract pseudo-labels (joint trajectories), and use those as additional training data for the policy. NVIDIA reports they used GR00T-Dreams to generate the training data for N1.5 in ~36 hours vs. nearly three months of manual collection — treat the headline multipliers as marketing until reproduced.

C.5 Physical Intelligence (π)

Physical Intelligence ships VLA policies (π0, Oct 2024; π0.5, Apr 22 2025 / arXiv 2504.16054) — vision-language-action models that map pixels and language directly to actions. π0.5 demonstrated open-world generalization to entirely new homes (cleaning unseen kitchens/bedrooms via 10–15 minute multi-stage behaviors). They have no public world model. Their public stance has been "scale VLAs on diverse robot data" rather than "imagine in pixels." This is a deliberate bet on the LeCun/Karpathy view that pixel rollouts are the wrong abstraction for control; whether they are using a private latent WM internally is unclear.

C.6 Google DeepMind Gemini Robotics

Gemini Robotics (Mar 2025 announcement; arXiv 2503.20020) is DeepMind's VLA built on Gemini 2.0, with physical actions added as an output modality. Released alongside Gemini Robotics-ER (Embodied Reasoning), a multimodal model with 3D bounding-box prediction, pointing, trajectory and grasp prediction, and multi-view correspondence. Followed by Gemini Robotics On-Device (Jul 2025) and Gemini Robotics 1.5 / ER 1.5 later in 2025. Like π0.5 this is a VLA, not a generative world model — included here because it is the most-watched 2025 entrant in the policy half of the stack and increasingly competes for the same "physical-AI foundation model" mindshare.

C.7 Tesla Optimus

Tesla has shown internal "Optimus world model" demos (occupancy-network-style 3D + video rollouts conditioned on humanoid action) at investor events through 2025, but has published no architecture. Treat as rumour-grade.


D. The big debate: does Sora "understand physics"?

Two camps, both partially right.

D.1 The LeCun camp

Yann LeCun's argument, made repeatedly through 2024–2025: a generative pixel model is optimised for likelihood under a tokeniser, not for any physical invariant. It will happily hallucinate violations of object permanence or gravity if the dataset lets it, because the loss is per-pixel. Worse, photoreal generation wastes capacity on irrelevant texture detail. The "true" world model lives in a learned latent (his JEPA argument), not in pixels.

Empirical support: PhyWorld, PhyGenBench, and VBench-Physics / VideoPhy benchmarks showed that even 2025-era frontier video models (Sora, Veo, Kling) fail basic physics tests at rates of 30–60% on tasks like "ball bounces off ramp," "liquid pours into glass," "two objects collide." The community largely agrees the failures are real.

D.2 The "scale is all you need" camp

Counter-argument, articulated by Bill Peebles (Sora lead) and NVIDIA's Cosmos team: pixels are the only universal interface. Every other representation (3D mesh, occupancy, latent embedding) is some team's design choice. If you train at sufficient scale on enough multimodal data, the model will discover the physical regularities as a byproduct, the same way LLMs discovered programming. The Cosmos paper explicitly calls itself a "world foundation model" and frames scale as the path.

Empirical support: scaling laws on PhyWorld show monotonic improvement with both data and compute. NVIDIA reports Cosmos Predict-2 14B is meaningfully better than 2B at preserving object identity over 5–10s [unverified, NVIDIA-internal]; Genie 3 jumped from Genie 2's ~10–20s rollouts to several minutes at 720p / 24fps.

D.3 What benchmarks actually measure

BenchmarkWhat it testsFrontier model results (approx.)
VBench16 dimensions: subject consistency, motion smoothness, etc.Sora/Veo/Kling all >80% aggregate
VideoPhyPhysical commonsense in generated video<50% physical-plausibility scores
PhyGenBench27 physics laws across 4 domainsbest models ~0.5
WorldModelBenchAction-conditioning fidelity for embodied useearly-stage, low scores

Long-horizon failure modes that show up consistently:

  • Object permanence: an obscured car returns as a different car.
  • Counting: 4 cyclists become 3 then 5.
  • Causality: brake lights without deceleration; collisions without contact response.
  • Action drift: ego trajectory diverges from commanded over 5+ seconds.

D.4 The synthesis view

The honest answer in 2026: generative pixel models are a usable but leaky world model. They are good enough for data augmentation and qualitative eval, not yet good enough as a load-bearing simulator for safety-critical decisions. The disagreement is whether the leaks close with scale (Cosmos/OpenAI bet) or whether you need a different architecture (LeCun/V-JEPA bet). Reasonable people disagree, and Applied Intuition needs an opinion that doesn't require winning that bet.


E. Use cases — where world models actually deliver value today

E.1 Closed-loop simulation for AV planners

Replace classical sim's rendered pixels with WM-generated photoreal video, conditioned on the same scenario script. Wayve uses GAIA-2 this way internally. The advantage over rendered sim: no "sim2real perception gap." The risk: the WM might paper over a real perception failure with plausible-looking pixels.

E.2 Data augmentation (rare scenarios on demand)

The clearest deployed win. Generate 10,000 frames of "ambulance with lights, rainy night, four-way stop" and use them for perception training. Cosmos-Drive-Dreams is built around this. PD's Step and Anyverse Studio are the classical analogues.

E.3 Pre-training (next-frame prediction as SSL)

Predicting future video is a strong self-supervised objective for perception backbones. V-JEPA 2, DINO-WM, and the Cosmos backbones have been used as pretraining sources for downstream perception heads. Quietly important.

E.4 Policy rollouts in imagination

The Dreamer story, applied to robotics: roll out the policy inside the WM, train on those imagined trajectories. Demonstrated at toy scale (RoboDreamer, TWM) but not yet at humanoid-product scale. GR00T-Dreams is the closest production example, though it uses the WM for data not for on-policy imagination.

E.5 Evaluation / adversarial testing

Use the WM as a fuzzer: search the conditioning space for prompts that produce failures in the planner. This is the eval-time flip of (a). Wayve and Waabi do this; it's a natural fit for Applied Intuition's existing scenario-search infrastructure.

E.6 Sim-to-real bridging via Cosmos Transfer

The pattern that is going mainstream in 2025–2026: keep the classical sim for ground truth (boxes, depth, segmentation, HD map, perfect ego physics) and run a Cosmos-Transfer-style head over its renders to add photoreal style. You get the labels of classical sim and the realism of generative video. This is the hybrid workflow and probably what will dominate.


F. Threat to classical synthetic data

This is the question that matters for an Applied Intuition product person.

F.1 The bull case for Cosmos against classical sim

If a 14B Cosmos checkpoint can produce 4K, multi-camera, 10-second driving videos from a text prompt or a sketch, with controllable weather and agents, why pay for a classical sim stack? Cosmos's marketing pitch is essentially this: "we replace your $50M sim engineering team with a foundation model and ~$10M of inference compute." Parallel Domain, AnyVerse, Cognata, Spectral, and the rendering portions of Applied Intuition / Foretellix are the targets.

F.2 The counter-arguments

These are real and load-bearing for the incumbents:

  1. Ground truth. Classical sim gives you pixel-perfect 3D boxes, semantic masks, depth, optical flow, lidar, radar — derived from the simulator's internal state. Cosmos gives you pixels and you have to label them with another model. Auto-label noise compounds, especially on the long tail (which is the only tail that matters for AV).
  2. Controllability and determinism. A safety case wants to say "we ran the same scenario 10,000 times across firmware versions." Diffusion video models are stochastic; getting frame-by-frame determinism is not natively supported.
  3. Physical fidelity for dynamics-critical tests. Vehicle dynamics, sensor models (especially radar, lidar near-field), V2X timing — none of this comes out of a pixel model. You still need a physics engine.
  4. Certifiability. ISO 26262 / SOTIF / UN R157 don't have a clear story for "we trained on samples from a 14B diffusion model." They do have a story for parameterised classical sim with provenance.
  5. Long-tail control. Generating "child runs from parked van at dusk in 23°C drizzle on a 6.5%-grade road" requires either prompt engineering against a model that may or may not respect every constraint, or a structured controller. Classical sim has structured controllers natively.

F.3 The hybrid that's actually winning

The compromise that has emerged in 2025: classical sim for scenario, layout, physics, and labels; Cosmos Transfer (or equivalent) on top for photoreal style. This is exactly the Cosmos Transfer / DRIVE Sim + Cosmos pattern. NVIDIA does it. Wayve does a variant where they run Sim-rendered geometry through GAIA-2's layout-to-pixel head. Waabi has hinted at a similar pipeline.

This is good news for Applied Intuition. The classical-sim moat (scenario authoring, ground truth, certifiability) is preserved; the photoreal-pixels moat was already eroding regardless of Cosmos. The right product strategy is to own the conditioning stack (scenario language, BEV layout, ground truth) and treat the photoreal renderer as a swappable backend, possibly Cosmos itself.

F.4 A useful framing

LayerOwned byCosmos threat
Scenario authoring & ODDApplied Intuitionlow
Ground truth labelsclassical simlow
Vehicle dynamics, sensor simphysics enginelow
Photoreal renderingclassical render → WMhigh
Perception training dataclassical sim → hybridmedium
Closed-loop planner evalclassical sim → WM-augmentedmedium

The product team's job is to make sure the photoreal-rendering layer is swappable so customers can choose between Unreal-style rendering, neural rendering, and Cosmos Transfer, and route the choice by use case (certified safety case → Unreal; perception data → Cosmos).


G. Open problems

These are the technical risks that determine whether the WM bet pays off.

  1. 3D / multi-view consistency. Most WMs are 2D-token models; preserving geometry across 5+ surround cameras over 10s is not solved. MagicDrive3D and GAIA-2 are the best public attempts.
  2. Long-horizon coherence (>30s). Identity drift, object permanence, and accumulated error all kick in past ~10s. Genie 3 and Cosmos Predict 14B push this but don't solve it.
  3. Action-conditioning fidelity. Commanded ego trajectory vs. realised trajectory in the generated video drifts. Quantitatively measured by WorldModelBench and similar; current frontier is ~1–2m lateral drift over 5s — too high for closed-loop planner eval.
  4. Physical plausibility. Per benchmarks above, ~50% on hard tests. Especially poor on contact, deformation, fluids, articulation.
  5. Compute cost. A Cosmos-14B 5s 720p multi-camera generation is in the tens of seconds on an H100; an Unreal frame is sub-millisecond. For high-volume training data you can amortise this; for closed-loop sim at 10 Hz it's a non-starter today.
  6. Label generation. "Do you get cuboids out of a generated frame?" is the single most-asked question by AV customers. Today the answer is via auto-labelling with a separately trained detector, which means your generated data is only as labelled as your detector is good — and on long-tail scenes, the detector is the bottleneck. Cosmos-Drive-Dreams sidesteps this by generating from labels rather than labelling generations, which is the right move but constrains diversity.
  7. Distribution shift / hallucination. Out-of-distribution prompts ("snowy night in Phoenix") generate plausible-looking but quietly wrong scenes. Hard to detect without a ground-truth reference.
  8. Open evaluation. No standard "is this WM useful for AV?" benchmark exists yet. Bench2Drive, NAVSIM, and CARLA Leaderboard 2.0 are planning benchmarks, not WM benchmarks.

H. Implications for the user (Applied Intuition Data Intelligence)

H.1 What to read first (in this order)

  1. Cosmos paper — the most important single artefact. Read all of it; the data-curation appendix matters.
  2. GAIA-1, GAIA-2 (arXiv 2503.20523), and GAIA-3 blog — the production-pull counterpart, including the explicit shift from generation to evaluation.
  3. V-JEPA 2 paper — the rival paradigm, plus Meta's robot evaluation.
  4. DriveDreamer-2 and Drive-WM — the academic landscape.
  5. VideoPhy and PhyGenBench — the reality-check benchmarks.
  6. Cosmos-Transfer1 (arXiv 2503.14492) and the Cosmos-Drive-Dreams paper (arXiv 2506.09042) — the hybrid sim+WM pattern and the open synthetic-data release.
  7. The Ha & Schmidhuber world-model paper and DreamerV3 — for the RL grounding so you don't get bluffed by the term.
  8. Genie 3 blog — the strongest 2025 demonstration of an interactive WM and where the pixel-WM frontier is.

H.2 Repos and tooling worth running

H.3 Concrete first projects (in escalating ambition)

  1. Reproduce a Cosmos Predict driving rollout on nuScenes. Take a public Cosmos-Drive checkpoint, condition on a real nuScenes scene's first frame + ego trajectory, generate 5s of multi-camera future, and overlay the ground-truth camera. Measure pixel and trajectory drift. ~1–2 weeks. Forces you to feel the action-conditioning failure modes first-hand.
  2. Hybrid sim project: CARLA + Cosmos Transfer. Render a CARLA scenario with ground-truth boxes, run Cosmos Transfer to photorealise it, retrain a YOLO/DETR detector on the photoreal version, and benchmark on real nuScenes. Quantify the sim2real gap closure. ~3–4 weeks. This is the workflow Applied Intuition customers will ask about.
  3. Toy GAIA-1 on nuScenes. Train a small (~500M) autoregressive video tokeniser + transformer on nuScenes front-camera + ego-action, ~4–6 weeks on a single 8xH100 node. You'll learn why GAIA-2 is hard.
  4. WM-as-evaluator. Use Cosmos Reason or a fine-tuned VLM to score generated driving videos on physical plausibility, then do prompt-search to find planner failures. This is the closest to Applied Intuition's actual product surface.
  5. Action-conditioning probe. Take an open WM (Cosmos Predict, IRASim) and quantitatively measure trajectory-following error vs. commanded action. Publishable as a short paper if done carefully — and exactly the kind of internal capability assessment a Data Intelligence team should own.

H.4 The strategic frame to bring to the team

The right framing for Applied Intuition's Data Intelligence team is not "do we build a Cosmos competitor" — they would lose. It is:

  • Own the conditioning (scenario language, BEV/HD-map, ground-truth labels) — that's the durable moat.
  • Treat the photoreal renderer as a backend that customers swap (Unreal / Cosmos Transfer / a future model) by use case.
  • Evaluate WMs rigorously as a service to customers — most AV teams can't tell a good WM from a bad one and would pay for that judgment.
  • Generate from labels, not label generations — Cosmos-Drive-Dreams' key insight; Applied Intuition's existing scenario authoring already produces the labels.
  • Be the certified path — the safety-case story for classical sim is real and not threatened by WMs in any near-term regulatory framework.

The biggest risk is not Cosmos. It is that customers conclude "real driving data + fleet learning > any synthetic" and route around all sim vendors. World models actually help the synthetic-data category here, by raising the realism ceiling. The fight is between sim+WM vendors and pure fleet-data plays (Tesla, Wayve, Waymo), not between sim vendors and Cosmos.


Sources

World-model foundations

Generative video

Driving world models

Robotics world models

The physics-understanding debate

Classical synthetic-data context