Physical AI
Research·Doc 09·~50 min

Research Frontier and 2026–2030 Outlook

Forward-looking companion to docs 01–06. Last verified: 2026-05-08.

This is the opinion doc. The other docs in this folder describe the field as it is; this one describes where it is going, what could derail the trajectory, and how to keep your reading list calibrated. Read 00-overview.md first if you have not — every prediction here connects back to a thread already developed in 0106.


A. The 2026 research frontier — what's hot right now

A.1 Active arXiv subareas

Six subareas are absorbing most of the publication oxygen in cs.RO, cs.LG, cs.CV, and cs.AI as of May 2026. They are not independent — each is in some sense a different attack on the data and evaluation are the bottleneck thesis from 00 §1.

  1. VLA scaling and compositional generalization. The frontier moved past "does the recipe scale?" (yes, but with diminishing returns on within-embodiment data) to "does it compose?" Physical Intelligence's π0.7 is the canonical reference — a single steerable policy that recombines learned skills into novel long-horizon tasks (espresso, laundry on hardware never seen during training). The opening direction is post-training composition — RLHF-style preference tuning, scenario-conditioned distillation, and skill-graph search over a frozen base policy.

  2. World models as evaluators, not just generators. GAIA-3 (Wayve, December 2025, 15B params) reframes WMs from "generate plausible video" to "re-drive a logged scenario with parameterised variations while the rest of the world stays fixed" — the closed-loop counterfactual primitive that AV V&V actually needs (06 §E.3). ReSim (NeurIPS 2025) and Cosmos-Drive-Dreams attack the same problem from different conditioning sides.

  3. Long-horizon VLA composition with hierarchical planners. Gemini Robotics 1.5 / ER 1.5 explicitly splits the stack: a thinking VLM (ER 1.5) that plans and a low-level VLA (1.5) that executes. This is the same pattern as Figure's Helix 02 System 0/1/2 architecture. The bet is that language is the right interface between modules — and that the same trick gives you interpretability for safety arguments without sacrificing E2E learnability.

  4. Closed-loop, scenario-based driving evaluation. Open-loop displacement-error metrics are dead as primary metrics; NAVSIM v2 (CoRL '25), nuPlan-R, DriveE2E, and Bench2Drive together cover the grid of (real-world traces vs. CARLA) × (reactive vs. non-reactive) sim. The unsolved problem is agent realism (see §A.2 and §C.6).

  5. Cross-embodiment learning with embodiment normalization. X-VLA, Universal Actions (CVPR '25), GR00T N1.5, and the LeRobotDataset v3 format are converging on a separation: a large pretrained VLA backbone + a thin, embodiment-specific action head learned with kinematic-aware soft prompts. Empirical scaling laws now explicitly favour more embodiments over more trajectories per embodiment.

  6. Multi-view consistent neural rendering for AV closed-loop sim. SplatAD, DrivingGaussian, CoDa-4DGS (ICCV '25), SplatFlow, and feed-forward variants like ReconDrive convert logged drives into editable 4D digital twins. This is the real-to-sim pillar of the hybrid sim pattern from 03 §G.

  7. World-model-driven reward shaping and data flywheels. V-JEPA 2 (Meta, June 2025) trained on >1M hours of internet video plus 62 hours of robot data demonstrates zero-shot pick-and-place planning by imagining roll-outs against a learned video predictor. The thesis: WMs are training-data factories and dense reward signals.

  8. Robust evaluation that survives perturbation. LIBERO-PRO was the most painful paper of the year for VLA shops: SoTA models that score >90% on standard LIBERO collapse to 0.0% under modest object/instruction/layout perturbations. The open question — quietly reshaping every internal eval suite — is which benchmarks measure rote memorisation versus genuine generalization.

A.2 The 15 papers a 2026 frontier engineer should know

A working bibliography. Each is verified against its primary source.

Driving world models and synthetic data

Robotics foundation models

Evaluation, benchmarks, datasets

If you read those fifteen, you can hold a credible conversation with anyone in the field.

A.3 Conference signals

The publication firehose has consolidated around six venues; the deadlines and standout tracks tell you what to read.

Venue2026 datesWhy it matters
CVPR 2026 (June, Nashville)Workshop chair: Hongyang Li, OpenDriveLabContinues the Autonomous Grand Challenge (NVIDIA won 2024 and 2025 in End-to-End Driving). 2026 themes: generalisable embodied systems, scenario gen, world models.
ICLR 2026 (April–May)Strong VLA showing — 164 VLA submissions analysed by ReussDiscrete-diffusion VLAs, reasoning VLAs, scaling-law papers. The frontier-vs-academia gap on VLA budget is now well-documented.
CoRL 2025 (Seoul, Sept-Oct)Robot Data Workshop, Generalizable PriorsThe de-facto VLA conference now. Workshops are where the contrarian work shows up.
NeurIPS 2025Embodied World Models for Decision Making workshopThe single best signal on world-model trajectory.
ICRA 2026Dexterous manipulation, humanoid bring-upMore hardware-and-systems flavoured than CoRL. Watch for ablations against teleop-collected priors.
RSS 2026RoboMIND v2 venue, AgiBot World follow-upsSmaller, deeper. Check for sim-to-real and benchmark papers.
CVPR Waymo Open Dataset Challenges4 tracks in 2025 — E2E driving, scenario gen, interaction prediction, sim agentsThe 2025 Sim Agents winner was SMART-R1 at 0.7858 realism. Realism ceiling has barely moved year-over-year — see §C.6.

The CoRL workshops on Robot Data and on Generalizable Priors are where you would have first seen V-JEPA 2 and X-VLA in pre-publication form. Subscribe to both.


B. 2026 industry signals — what's just shipped or about to ship

B.1 Models

  • Cosmos 2.5 family is GA. Cosmos-Predict 2.5, Cosmos-Transfer 2.5, Cosmos-Reason 2, released Oct 6 2025; up to 30-second multi-view driving video, ~10× higher accuracy when post-trained on proprietary data. Cosmos 3 was announced at CES 2026 as the first WFM unifying world generation + vision reasoning + action simulation; full release expected later in 2026.
  • GAIA-3 (Wayve, Dec 2025). Internal studies show 5× reduction in synthetic-test rejection rates (Wayve press). No GAIA-4 announcement as of May 2026 — speculation only.
  • Helix 02 (Figure, 2026) is shipping in pilot deployments alongside the Figure 03 hardware (5'8", 61 kg, 5 hr battery, 20 kg payload, palm cameras + tactile).
  • Gemini Robotics 1.5 / ER 1.5 in private preview; ER 1.6 available via the Gemini API.
  • Tesla FSD v14.3 rolled out March 2026; v15 targeted Q4 2026 / Q1 2027, full rewrite of AI compiler/runtime to MLIR (20% faster reaction time). Unsupervised FSD targeted Q4 2026 — Musk's confidence interval, not yours.
  • π0.7 (late 2025) is the strongest open-recipe signal that compositional VLAs work without RL specialisation.

B.2 Datasets

B.3 Hardware

  • Jetson AGX Thor GA: $3,499 dev kit / $2,999 module @ 1k units; 2,070 FP4 TFLOPS; 7.5× the AI perf and 3.5× the efficiency of AGX Orin. Early adopters: Agility Robotics (Digit Gen 6), Amazon Robotics, Boston Dynamics, Caterpillar, Figure, Hexagon, Medtronic, Meta. This is the on-robot inference target every 2026 humanoid stack assumes.
  • DRIVE AGX Thor dev kit shipping September 2025 — this is the AV-side variant.
  • Apple Vision Pro has not (yet) materially affected teleop economics — quality is unmatched but $3.5k unit price + iOS-only data egress make it a research toy. Watch for the rumoured cheaper variant.

B.4 Funding

B.5 Regulation

  • EU AI Act: high-risk system obligations technically apply August 2 2026. Risk: the Digital AI Omnibus (proposed Nov 2025) would defer to Dec 2 2027. Trilogue ongoing as of May 2026 — assume some August 2026 obligations stick.
  • UN R157: Supplement 2025/64 added clearer lane-change and heavy-vehicle rules; speed limit raised from 60 → 130 km/h in 2024. GRVA 25th session second part: 23 June 2026. Expect 2026–2027 to settle the L3 highway-pilot rules and start L4 urban work.
  • California / Texas / Florida: Waymo, Zoox, Cruise (returned), and Tesla robotaxi operations all under active 2026 PUC/DMV rule-making. Texas SB 2205 now governs commercial autonomous trucking.

C. 2027–2028 research directions — calibrated predictions

Format: Direction · Current evidence · What would need to be true · confidence label.

C.1 Generalist VLA at scale (medium confidence)

Evidence: π0.7 shows clean compositional gains; OpenVLA/X-VLA scaling laws favour more embodiments. What needs to be true to keep working through 2028: (a) robot-data growth continues to compound (AgiBot/RoboMIND/Open X-Embodiment trajectory); (b) RL post-training becomes routine; (c) compute on Jetson Thor-class chips is sufficient for 7B-class policies at 30+ Hz. What blocks it: the LIBERO-PRO result. If the 2026 best models stay at 0.0% under perturbation, the entire VLA scaling story is partially fake — and the field reverts to specialist policies + a planner.

C.2 World models as primary closed-loop simulator (medium-low confidence)

Evidence: GAIA-3, ReSim, Cosmos-Predict 2.5 closing on action conditioning. Hybrid (classical sim + WM photoreal augmentation) is the production pattern (03 §G). What needs to be true: WMs reach <10 cm action-conditioning drift over 5 s (06 §E) and ship reproducible re-rendering with a fixed seed. What blocks it: the seed-reproducibility problem in diffusion-based WMs. Simulators need bit-exact replay for regression eval; current WMs are stochastic enough that this is non-trivial. Prediction: by end of 2028, classical sim is still the spine for V&V, but WMs own >50% of photoreal augmentation and ~30% of scenario-counterfactual generation.

C.3 Cross-embodiment data infrastructure (high confidence)

Evidence: LeRobot v3.0 + Open X-Embodiment is the open spine; Applied Intuition, NVIDIA NeMo Curator, and Voxel51 FiftyOne are the proprietary contenders. The AgiBot/RoboMIND scale is now too large for ad-hoc tooling. What needs to be true: embodiment normalisation becomes standardised (kinematic skeleton, action-space mapping, sensor calibration); large enterprises adopt either LeRobot or a vendor wrapper. Prediction: by 2028 there is a de-facto standard for robot-data exchange (probably LeRobot-format-compatible), and 2–3 enterprise products competing on top of it. The "robot Voxel51" is an acquisition target by 2028.

C.4 SOTIF-defensible auto-labeling (high confidence on demand, medium on supply)

Evidence: 04 §B — auto-labels are everywhere but provenance, calibration, and uncertainty metadata are not standardised. ISO 21448:2022 is in force; EU AI Act enforcement adds traceability obligations. What needs to be true: a labeling-tool vendor (Applied Intuition, Encord, Scale, or a new entrant) publishes a label-provenance schema and gets it adopted. Prediction: by 2027 there is a working draft of an ASAM OpenLABEL extension for label provenance + uncertainty. Whoever owns it owns the regulatory choke point — and the user's team is well-placed to author it.

C.5 Real-to-sim digital twins via 4DGS at fleet scale (medium confidence)

Evidence: SplatAD, DrivingGaussian, CoDa-4DGS, ReconDrive's feed-forward variant. What needs to be true: per-scene optimisation cost drops below ~1 GPU-hour per minute of drive log (currently 5–20×). Feed-forward 4DGS networks must generalise. Prediction: by 2028, every Tier-1 AV team runs an offline 4DGS twin of their fleet's hardest 1–10% of drives, and uses it for closed-loop replay with planner perturbations. Synthetic photons are mostly free; the bottleneck is labels and physics consistency.

C.6 Sim agents realism (low confidence — the ceiling is real)

Evidence: WOSAC 2025 winner SMART-R1 at 0.7858 realism meta. Year-over-year improvement is 1–3% — slow. nuPlan-R is the new arena (Nov 2025 paper). What needs to be true: either reactive WMs solve the multi-agent generative consistency problem (no current evidence of a phase change), or the field redefines "realism" away from log-likelihood toward causal counterfactual fidelity. Prediction: realism plateaus near 0.82 on WOSAC by 2028; the metric itself gets replaced.

C.7 Reward modeling and RLHF for E2E driving (medium confidence)

Evidence: PE-RLHF (2025), Tesla's preference-tuning loop, Wayve's "ai-driver" preference data. Post-training LLM playbook transfers cleanly. What needs to be true: OEMs accept that some behavioural choices are not safety (handled by ISO 21448 / 26262) but preference and require labelled preference data. Prediction: by 2027, every Tier-1 AV stack has an RLHF post-training stage. The labour pool shifts: comfort raters / ride-experience experts become a recognised data category, parallel to RLHF annotators in LLM land. This is a direct extension of the Mercor-style neutral-data tier.

C.8 Foundation-model labeling collapse (medium-high confidence)

Evidence: Auto-labels from Grounded-SAM 2 + GPT-4o-mini are >85% as good as humans on common labels at 1/100 cost (04 §B). What humans still uniquely do: edge cases, calibration, semantic disagreements. What needs to be true: uncertainty-aware sampling routes only the contested labels to humans. Prediction: by 2028, ≥90% of bulk labels in production AV pipelines are foundation-model auto-labels with confidence-thresholded human verification. Pure-bulk labeling vendors (the post-Datagen, post-Synthesis-AI category) consolidate or pivot upmarket. The neutral-data tier expands on the high-skill, expert-evaluation side.

C.9 Neutral-data tier consolidation (medium confidence)

Evidence: Mercor 5×'d valuation in 9 months. Surge bootstrapped to $1B+ revenue. Scale's commercial-LLM business hollowed by Meta deal in 2025. What blocks consolidation: frontier labs explicitly want diverse, non-overlapping vendor pools — Anthropic, OpenAI, and Google won't all buy from the same shop. Prediction: 2027 sees one merger or one new entrant funded above $5B. By 2028, three survivors: Mercor, Surge, and one of {Turing, Invisible}. Scale becomes a defense / government specialist (its 2025 pivot is already visible). Note: this matters for the user's day-job because Applied Intuition's auto-labeling tooling has to interoperate with whichever vendor the OEM customer chose.


D. 2029–2030 outlook — testable forecasts

These are calibrated bets. Each is testable. Each has an implicit "how I would know I was wrong" attached.

D.1 Robotaxi deployment — high confidence

By end-of-2030, the global driverless commercial robotaxi fleet exceeds 100,000 vehicles, with Waymo + Tesla + Apollo Go (Baidu) + Pony.ai accounting for >80%. Waymo's 10-cities-by-Feb-2026 / 17-cities-by-end-2026 / 1M-rides-per-week target sets a 5-year compounding base rate. Falsifier: if Waymo is still <5,000 vehicles at end-of-2027, this is wrong.

D.2 Humanoid commercial deployment — medium confidence

By end-of-2030, ≥3 humanoid OEMs ship >10,000 commercial robots in industrial / logistics settings under paid RaaS or capex contracts. Most likely roster: Apptronik (Mercedes/GXO), Figure (BMW + a US logistics customer), Agility (Amazon/GXO), one Chinese player (Unitree/UBTech/Fourier). Goldman / BoA project 50–100k units in 2026 alone; compounding to 2030 is plausible. Falsifier: if no humanoid OEM is at >2,000 deployed units by end-of-2027, downgrade. Home / household humanoids remain rare in 2030 — high confidence on that subprediction.

D.3 Synthetic-data market consolidation — high confidence

By 2030, the standalone "rendering-only" synthetic-data category is gone (03 §F was already half-confirmed in 2025 with Datagen + Synthesis AI shutdowns). Survivors are vertically integrated either with sim platforms (Applied, NVIDIA) or with labeling+curation (Encord, Voxel51). The synthetic-data spend itself grows ~3–5× from 2025 levels; it migrates from line-item to embedded service.

D.4 Foundation-model-as-labeler economics — high confidence

By 2030, labeling cost per object-frame on common categories drops by ≥90% from 2024 baseline. Humans remain in the loop for: rare classes, calibration, audit, ambiguous semantics, and any safety-case-relevant ground truth. The total labeling spend is flat or up because the category mix shifts upmarket (3D + temporal + behavioural + preference labels).

D.5 ML-perception standard — medium confidence

By 2030, ISO publishes a successor or extension to ISO 21448 specifically for learned perception components. Most likely vehicle: an ISO/IEC TR or PAS that bridges 21448 with ISO 5469 (functional aspects of AI systems). UN R157 has its L4-urban amendment by 2029. Falsifier: if no ISO ML-perception PAS is in committee draft by end-2027, downgrade.

D.6 Compute economics — medium-low confidence

By 2030, the cost of generating one minute of multi-camera, action-conditioned, photoreal driving video at 1080p approaches the cost of rendering the same minute in Unreal/Omniverse with classical assets (i.e., within 2×, on the same hardware). This is the moment WMs become the default renderer for non-safety-critical V&V. Falsifier: if WM cost is still >10× Unreal cost in 2028, downgrade aggressively — most predictions in §C.2 also weaken.

D.7 Market structure — medium confidence

By 2030, NVIDIA owns the foundation-model + silicon + low-level robotics-OS layer; Applied Intuition owns the V&V + scenario + ingestion + safety-case layer, with one credible competitor (Foretellix or a new entrant) at half the share; a LeRobot-derivative owns the open-data exchange standard. There is no AWS-of-Physical-AI; the layer cake stays vertically split. Reason: the regulated portions (safety-case, governance) cannot be commoditised the way storage/compute were.


E. Disruption candidates — what could invalidate the thesis

The point of this section is to keep an honest fail-set in mind. Each candidate names the trigger event, the broken assumption, and the early signal a working AI Data Intelligence engineer should track.

  1. Tesla open-sources FSD-grade fleet labels. Trigger: Musk decides labels are not the moat; releases >100M labeled frames + fleet trigger data. Breaks: the high price of 04 §C curation flywheels and most of the synthetic-data economy. Early signal: Tesla AI Day announcements; rumour of label-API sale to Tier-1s; FSD v15 docs hint at an external-data offering.

  2. A frontier WM closes the action-conditioning gap to <10 cm at 5 s. Trigger: a Cosmos-3 or GAIA-4 paper with rigorous closed-loop benchmarks on planner perturbations. Breaks: the V&V-on-classical-sim spine assumption (03, 05). Early signal: WOSAC realism jumps >0.85; nuPlan-R reactive sim agents pass a Turing-style test against logged human drives.

  3. A "robot internet" startup aggregates millions of teleop hours cheaply. Trigger: a $50M-funded company ships a teleop-as-a-service rig at $5/hr fully loaded, with an Apple Vision Pro–class capture stack. Open-licensed, embodiment-normalised data exceeds 10× Open X-Embodiment within 18 months. Breaks: the robotics-data scarcity argument (02 §C). Early signal: Y Combinator batch with two such companies; Mercor or Surge launches a robotics vertical.

  4. EU AI Act bans certain synthetic training data categories. Trigger: the August 2026 enforcement (or Dec 2027 if Omnibus passes) interpreted as forbidding deepfake-class WM outputs in safety-critical training. Breaks: the synthetic-flywheel thesis for EU-regulated AVs. Early signal: AI Office guidance memos in 2026; ENISA reports on WM provenance.

  5. A Mojo-class language obsoletes PyTorch tooling. Trigger: a Mojo / Triton-2 / JAX-XLA push delivers a 2–3× perf advantage and the major frontier labs migrate. Breaks: every internal pipeline built on PyTorch + LeRobot. Low probability (high inertia) but disruptive. Early signal: Modular hires from JAX team; Anthropic / OpenAI commit to a non-PyTorch stack.

  6. OEMs internalise the data engine. Trigger: VW or Toyota stands up a 500-person internal Data Intelligence org and stops buying from Applied. Breaks: the Series F thesis bet 2 (00 §How AI sits). Early signal: senior Applied or Voxel51 engineers leaving for OEMs.

  7. A robust universal eval benchmark emerges and is adopted across industry. LIBERO-PRO-style perturbation harnesses become the public default; SoTA models are forced to compete on robustness, not headline pass rates. Breaks: a lot of marketed numbers. This one is likely net-positive but disruptive to incumbents who marketed inflated success rates.


F. Continuous-learning pipeline — how to stay current

This is the operating manual part of the doc. It assumes you have ~30–60 min/day to spend on the field, not 4 hours.

F.1 arXiv flow

  • Subscribe to arXiv daily mailing lists for cs.RO, cs.LG (action learning subset), cs.CV (video gen, 3DGS). Volume: ~100 abstracts/day combined.
  • Filter through arxiv-sanity-lite (Karpathy) for personalised relevance.
  • alphaxiv and emergentmind.com publish curated weekly digests; alphaxiv has a useful "frontier" filter.
  • Hugging Face Daily Papers is the single highest-signal feed in 2026 — subscribe.

F.2 People to follow on X / Bluesky

These are the 18 accounts that, together, surface most relevant signal within 24–48 hours of publication. Curate one X list called "Physical AI" and add them.

F.3 Newsletters and blogs

F.4 Conference firehose strategy

Each major conference has 1,500–4,000 papers. The trick is filter, then triage, then deep-read.

  1. Pull the accepted-papers list as soon as it's posted.
  2. Run a keyword filter against your 6–8 active subareas (§A.1). Cuts to ~150–300.
  3. Read titles + 1-line abstracts. Cut to ~50–80.
  4. Read full abstracts + figures. Cut to ~15–25.
  5. Read the methods and ablations on the ~5–10 that survive.
  6. Reproduce the headline figure on 2–3 of them.

The whole flow is ~6–10 hours over a week. Without it, you're flying blind.

F.5 Reading groups

  • Applied Intuition has internal paper-discussion groups; join from week one.
  • The Open Robotics Reading Group (Discord, ~weekly) covers VLA papers.
  • MLST (Machine Learning Street Talk) YouTube + the Latent Space pod are good commuting substitutes.

F.6 A weekly cadence (suggested)

DayActivityTime
MonarXiv + HF Daily Papers triage30 min
TueDeep-read 1 paper (rotating subarea)60 min
WedIndustry blog round-up (Wayve / NVIDIA / OpenDriveLab)30 min
ThuReproduce 1 figure or 1 ablation60 min
FriX list scroll + write 5 sentences in a personal "open questions" doc30 min
Sat / SunOptional: 1 long-form blog or chapter (RLHF Book, Sutton & Barto)60 min

F.7 If you only have 30 minutes a day

Pick one: HF Daily Papers (Mon-Wed-Fri) + Wayve Thinking + Import AI. This alone keeps you in the top decile of "knows what's happening" — and frees the rest of your day for actually building.


G. What's most likely to be wrong about this doc

In descending order of self-doubt.

  1. §C.6 sim-agents-realism plateau prediction. I'm calling a ceiling at 0.82 on WOSAC by 2028 based on a 1–3% YoY trajectory. A Cosmos-3 or a Gemini-3 with strong action conditioning could break this in a single jump. Early signal it's wrong: WOSAC 2026 winner > 0.82, or a paper showing closed-loop driving-from-WM-rollouts within 5% of from-real driving on Bench2Drive.

  2. §D.2 humanoid commercial deployment threshold (>10k units, ≥3 OEMs by 2030). I'm bracketing high — Goldman/BoA say 50–100k in 2026 alone. If they're right, my bound is too conservative; if humanoids hit a perception or reliability wall in 2027 industrial pilots, my bound is too optimistic. Early signal: by end-2027, count installed Apollo + Figure + Digit units. If <3,000 combined, downgrade.

  3. §C.9 neutral-data-tier consolidation timing. Calling 3 survivors by 2028 assumes Surge raises and stays disciplined. If Surge takes a $25B round and over-extends on robotics labeling, the survivor list could be Mercor + Turing + a new entrant. The frame is right, the names may be wrong.

  4. §D.7 market structure — AWS-of-Physical-AI. I'm betting no such thing emerges, and the layer cake stays split. NVIDIA's CES 2026 Cosmos 3 + Jetson Thor + Isaac stack push could prove this wrong if it captures the V&V/safety-case layer. Early signal: an OEM publicly chooses NVIDIA over Applied Intuition for a full-stack V&V commitment.

  5. The whole §C / §D framing assumes the current foundation-model paradigm holds through 2030. If V-JEPA-style joint-embedding non-generative WMs displace generative video models, much of the synthetic-data and renderer-cost economics changes. LeCun is loud about this; the 2025 V-JEPA 2 results are credible but not yet decisive. Early signal: a JEPA-class model wins WOSAC or NAVSIM v2 outright.

I would not be surprised if 2 of these 5 predictions look embarrassing in 2028. The document is most useful as a prior against which to update — re-read it quarterly with the early-signals checklist in hand.


H. Sources

Frontier papers and models

Datasets and benchmarks

Industry and funding

Regulation

Cross-references inside this folder


Last verified: 2026-05-08.