Research Frontier and 2026–2030 Outlook
Forward-looking companion to docs 01–06. Last verified: 2026-05-08.
This is the opinion doc. The other docs in this folder describe the field as it is; this one describes where it is going, what could derail the trajectory, and how to keep your reading list calibrated. Read 00-overview.md first if you have not — every prediction here connects back to a thread already developed in 01–06.
A. The 2026 research frontier — what's hot right now
A.1 Active arXiv subareas
Six subareas are absorbing most of the publication oxygen in cs.RO, cs.LG, cs.CV, and cs.AI as of May 2026. They are not independent — each is in some sense a different attack on the data and evaluation are the bottleneck thesis from 00 §1.
-
VLA scaling and compositional generalization. The frontier moved past "does the recipe scale?" (yes, but with diminishing returns on within-embodiment data) to "does it compose?" Physical Intelligence's π0.7 is the canonical reference — a single steerable policy that recombines learned skills into novel long-horizon tasks (espresso, laundry on hardware never seen during training). The opening direction is post-training composition — RLHF-style preference tuning, scenario-conditioned distillation, and skill-graph search over a frozen base policy.
-
World models as evaluators, not just generators. GAIA-3 (Wayve, December 2025, 15B params) reframes WMs from "generate plausible video" to "re-drive a logged scenario with parameterised variations while the rest of the world stays fixed" — the closed-loop counterfactual primitive that AV V&V actually needs (06 §E.3). ReSim (NeurIPS 2025) and Cosmos-Drive-Dreams attack the same problem from different conditioning sides.
-
Long-horizon VLA composition with hierarchical planners. Gemini Robotics 1.5 / ER 1.5 explicitly splits the stack: a thinking VLM (ER 1.5) that plans and a low-level VLA (1.5) that executes. This is the same pattern as Figure's Helix 02 System 0/1/2 architecture. The bet is that language is the right interface between modules — and that the same trick gives you interpretability for safety arguments without sacrificing E2E learnability.
-
Closed-loop, scenario-based driving evaluation. Open-loop displacement-error metrics are dead as primary metrics; NAVSIM v2 (CoRL '25), nuPlan-R, DriveE2E, and Bench2Drive together cover the grid of (real-world traces vs. CARLA) × (reactive vs. non-reactive) sim. The unsolved problem is agent realism (see §A.2 and §C.6).
-
Cross-embodiment learning with embodiment normalization. X-VLA, Universal Actions (CVPR '25), GR00T N1.5, and the LeRobotDataset v3 format are converging on a separation: a large pretrained VLA backbone + a thin, embodiment-specific action head learned with kinematic-aware soft prompts. Empirical scaling laws now explicitly favour more embodiments over more trajectories per embodiment.
-
Multi-view consistent neural rendering for AV closed-loop sim. SplatAD, DrivingGaussian, CoDa-4DGS (ICCV '25), SplatFlow, and feed-forward variants like ReconDrive convert logged drives into editable 4D digital twins. This is the real-to-sim pillar of the hybrid sim pattern from 03 §G.
-
World-model-driven reward shaping and data flywheels. V-JEPA 2 (Meta, June 2025) trained on >1M hours of internet video plus 62 hours of robot data demonstrates zero-shot pick-and-place planning by imagining roll-outs against a learned video predictor. The thesis: WMs are training-data factories and dense reward signals.
-
Robust evaluation that survives perturbation. LIBERO-PRO was the most painful paper of the year for VLA shops: SoTA models that score >90% on standard LIBERO collapse to 0.0% under modest object/instruction/layout perturbations. The open question — quietly reshaping every internal eval suite — is which benchmarks measure rote memorisation versus genuine generalization.
A.2 The 15 papers a 2026 frontier engineer should know
A working bibliography. Each is verified against its primary source.
Driving world models and synthetic data
- GAIA-3 (Wayve, Dec 2025) — 15B-param controllable WM for evaluation; embodiment transfer; 5× reduction in synthetic-test rejection.
- Cosmos-Drive-Dreams (Ren et al., arXiv:2506.09042, June 2025) — HD-map-conditioned multi-view driving video; 81k synthetic clips released.
- ReSim (NeurIPS 2025) — reliable WM under non-expert actions via dynamics-consistency loss.
- EMMA (Waymo, 2024 → updated Sept 2025) — Gemini-derived multimodal E2E driving, the openly-described counterpart to Tesla v14.
Robotics foundation models
- π0.7 (Physical Intelligence, late 2025) — compositional generalization; matches RL-tuned π*0.6 specialists across tasks.
- Gemini Robotics 1.5 + ER 1.5 (DeepMind, Sept 2025) — hierarchical thinking-VLM + acting-VLA.
- Helix 02 (Figure, 2026) — full-body autonomy; System 0 whole-body controller trained on >1k hours of human motion.
- V-JEPA 2 (Meta, June 2025) — joint-embedding video WM, 62 hours of robot data sufficient for zero-shot planning.
- X-VLA (CoRL 2025) — soft-prompted cross-embodiment transformer.
- OpenVLA scaling work (Kim et al., ICML 2025) — embodiment scaling laws via OXE.
Evaluation, benchmarks, datasets
- LIBERO-PRO (Oct 2025) — perturbation-robust evaluation; collapses pi0/pi0.5/OpenVLA from 90%+ to 0.0%.
- nuPlan-R (Nov 2025) — diffusion-based reactive sim agents replacing IDM in nuPlan.
- Waymo Open Sim Agents Challenge 2025 (winner: SMART-R1) — first R1-style RFT winning a sim-agents leaderboard at 0.7858 realism meta.
- AgiBot World Colosseo (IROS 2025 best-paper finalist; TRO 2026) — 1M+ trajectories, 217 tasks, multi-embodiment.
- RoboMIND v2 (RSS 2025) and RoboMIND 2.0 (Dec 2025) — 107k → 310k+ trajectories, dual-arm + tactile + mobile.
If you read those fifteen, you can hold a credible conversation with anyone in the field.
A.3 Conference signals
The publication firehose has consolidated around six venues; the deadlines and standout tracks tell you what to read.
| Venue | 2026 dates | Why it matters |
|---|---|---|
| CVPR 2026 (June, Nashville) | Workshop chair: Hongyang Li, OpenDriveLab | Continues the Autonomous Grand Challenge (NVIDIA won 2024 and 2025 in End-to-End Driving). 2026 themes: generalisable embodied systems, scenario gen, world models. |
| ICLR 2026 (April–May) | Strong VLA showing — 164 VLA submissions analysed by Reuss | Discrete-diffusion VLAs, reasoning VLAs, scaling-law papers. The frontier-vs-academia gap on VLA budget is now well-documented. |
| CoRL 2025 (Seoul, Sept-Oct) | Robot Data Workshop, Generalizable Priors | The de-facto VLA conference now. Workshops are where the contrarian work shows up. |
| NeurIPS 2025 | Embodied World Models for Decision Making workshop | The single best signal on world-model trajectory. |
| ICRA 2026 | Dexterous manipulation, humanoid bring-up | More hardware-and-systems flavoured than CoRL. Watch for ablations against teleop-collected priors. |
| RSS 2026 | RoboMIND v2 venue, AgiBot World follow-ups | Smaller, deeper. Check for sim-to-real and benchmark papers. |
| CVPR Waymo Open Dataset Challenges | 4 tracks in 2025 — E2E driving, scenario gen, interaction prediction, sim agents | The 2025 Sim Agents winner was SMART-R1 at 0.7858 realism. Realism ceiling has barely moved year-over-year — see §C.6. |
The CoRL workshops on Robot Data and on Generalizable Priors are where you would have first seen V-JEPA 2 and X-VLA in pre-publication form. Subscribe to both.
B. 2026 industry signals — what's just shipped or about to ship
B.1 Models
- Cosmos 2.5 family is GA. Cosmos-Predict 2.5, Cosmos-Transfer 2.5, Cosmos-Reason 2, released Oct 6 2025; up to 30-second multi-view driving video, ~10× higher accuracy when post-trained on proprietary data. Cosmos 3 was announced at CES 2026 as the first WFM unifying world generation + vision reasoning + action simulation; full release expected later in 2026.
- GAIA-3 (Wayve, Dec 2025). Internal studies show 5× reduction in synthetic-test rejection rates (Wayve press). No GAIA-4 announcement as of May 2026 — speculation only.
- Helix 02 (Figure, 2026) is shipping in pilot deployments alongside the Figure 03 hardware (5'8", 61 kg, 5 hr battery, 20 kg payload, palm cameras + tactile).
- Gemini Robotics 1.5 / ER 1.5 in private preview; ER 1.6 available via the Gemini API.
- Tesla FSD v14.3 rolled out March 2026; v15 targeted Q4 2026 / Q1 2027, full rewrite of AI compiler/runtime to MLIR (20% faster reaction time). Unsupervised FSD targeted Q4 2026 — Musk's confidence interval, not yours.
- π0.7 (late 2025) is the strongest open-recipe signal that compositional VLAs work without RL specialisation.
B.2 Datasets
- AgiBot World (Beta, March 2025): 1M+ trajectories, 2,976 hours, 217 tasks, 3,000+ objects.
- RoboMIND v2 (RSS 2025) → RoboMIND 2.0 (Dec 2025): 107k → 310k+ trajectories; dual-arm + tactile + mobile.
- Cosmos-Drive-Dreams dataset (June 2025): 5.8k labeled clips + 81.8k synthetic videos.
- LeRobotDataset v3.0: chunked-episode + streaming format that finally makes OXE-scale (>400 GB) datasets workable in HF/PyTorch pipelines.
- Waymo Open Dataset 2025 Challenges (March 31 – May 22 2025) — see §A.3.
B.3 Hardware
- Jetson AGX Thor GA: $3,499 dev kit / $2,999 module @ 1k units; 2,070 FP4 TFLOPS; 7.5× the AI perf and 3.5× the efficiency of AGX Orin. Early adopters: Agility Robotics (Digit Gen 6), Amazon Robotics, Boston Dynamics, Caterpillar, Figure, Hexagon, Medtronic, Meta. This is the on-robot inference target every 2026 humanoid stack assumes.
- DRIVE AGX Thor dev kit shipping September 2025 — this is the AV-side variant.
- Apple Vision Pro has not (yet) materially affected teleop economics — quality is unmatched but $3.5k unit price + iOS-only data egress make it a research toy. Watch for the rumoured cheaper variant.
B.4 Funding
- Mercor: $350M Series C at $10B valuation, Oct 2025 (CNBC); $1B annualised revenue Feb 2026.
- Surge AI: reportedly raising at $25B valuation — bootstrapped, $1B+ revenue 2024.
- Turing: $111M Series E at $2.2B (March 2025).
- Apptronik: $520M Feb 2026 at ~$5B, Google + Mercedes + B Capital + Qatar.
- Applied Intuition: $600M Series F at $15B (June 2025), Kleiner + BlackRock co-led; 50M+ simulations and hundreds of PB of training data in 2025.
- Tesla shut down Dojo in August 2025 (TechCrunch); pivoted to AI5/AI6 + external NVIDIA/AMD/Samsung. Reinforces the 00 §1 thesis: compute is plentiful, data is the moat.
B.5 Regulation
- EU AI Act: high-risk system obligations technically apply August 2 2026. Risk: the Digital AI Omnibus (proposed Nov 2025) would defer to Dec 2 2027. Trilogue ongoing as of May 2026 — assume some August 2026 obligations stick.
- UN R157: Supplement 2025/64 added clearer lane-change and heavy-vehicle rules; speed limit raised from 60 → 130 km/h in 2024. GRVA 25th session second part: 23 June 2026. Expect 2026–2027 to settle the L3 highway-pilot rules and start L4 urban work.
- California / Texas / Florida: Waymo, Zoox, Cruise (returned), and Tesla robotaxi operations all under active 2026 PUC/DMV rule-making. Texas SB 2205 now governs commercial autonomous trucking.
C. 2027–2028 research directions — calibrated predictions
Format: Direction · Current evidence · What would need to be true · confidence label.
C.1 Generalist VLA at scale (medium confidence)
Evidence: π0.7 shows clean compositional gains; OpenVLA/X-VLA scaling laws favour more embodiments. What needs to be true to keep working through 2028: (a) robot-data growth continues to compound (AgiBot/RoboMIND/Open X-Embodiment trajectory); (b) RL post-training becomes routine; (c) compute on Jetson Thor-class chips is sufficient for 7B-class policies at 30+ Hz. What blocks it: the LIBERO-PRO result. If the 2026 best models stay at 0.0% under perturbation, the entire VLA scaling story is partially fake — and the field reverts to specialist policies + a planner.
C.2 World models as primary closed-loop simulator (medium-low confidence)
Evidence: GAIA-3, ReSim, Cosmos-Predict 2.5 closing on action conditioning. Hybrid (classical sim + WM photoreal augmentation) is the production pattern (03 §G). What needs to be true: WMs reach <10 cm action-conditioning drift over 5 s (06 §E) and ship reproducible re-rendering with a fixed seed. What blocks it: the seed-reproducibility problem in diffusion-based WMs. Simulators need bit-exact replay for regression eval; current WMs are stochastic enough that this is non-trivial. Prediction: by end of 2028, classical sim is still the spine for V&V, but WMs own >50% of photoreal augmentation and ~30% of scenario-counterfactual generation.
C.3 Cross-embodiment data infrastructure (high confidence)
Evidence: LeRobot v3.0 + Open X-Embodiment is the open spine; Applied Intuition, NVIDIA NeMo Curator, and Voxel51 FiftyOne are the proprietary contenders. The AgiBot/RoboMIND scale is now too large for ad-hoc tooling. What needs to be true: embodiment normalisation becomes standardised (kinematic skeleton, action-space mapping, sensor calibration); large enterprises adopt either LeRobot or a vendor wrapper. Prediction: by 2028 there is a de-facto standard for robot-data exchange (probably LeRobot-format-compatible), and 2–3 enterprise products competing on top of it. The "robot Voxel51" is an acquisition target by 2028.
C.4 SOTIF-defensible auto-labeling (high confidence on demand, medium on supply)
Evidence: 04 §B — auto-labels are everywhere but provenance, calibration, and uncertainty metadata are not standardised. ISO 21448:2022 is in force; EU AI Act enforcement adds traceability obligations. What needs to be true: a labeling-tool vendor (Applied Intuition, Encord, Scale, or a new entrant) publishes a label-provenance schema and gets it adopted. Prediction: by 2027 there is a working draft of an ASAM OpenLABEL extension for label provenance + uncertainty. Whoever owns it owns the regulatory choke point — and the user's team is well-placed to author it.
C.5 Real-to-sim digital twins via 4DGS at fleet scale (medium confidence)
Evidence: SplatAD, DrivingGaussian, CoDa-4DGS, ReconDrive's feed-forward variant. What needs to be true: per-scene optimisation cost drops below ~1 GPU-hour per minute of drive log (currently 5–20×). Feed-forward 4DGS networks must generalise. Prediction: by 2028, every Tier-1 AV team runs an offline 4DGS twin of their fleet's hardest 1–10% of drives, and uses it for closed-loop replay with planner perturbations. Synthetic photons are mostly free; the bottleneck is labels and physics consistency.
C.6 Sim agents realism (low confidence — the ceiling is real)
Evidence: WOSAC 2025 winner SMART-R1 at 0.7858 realism meta. Year-over-year improvement is 1–3% — slow. nuPlan-R is the new arena (Nov 2025 paper). What needs to be true: either reactive WMs solve the multi-agent generative consistency problem (no current evidence of a phase change), or the field redefines "realism" away from log-likelihood toward causal counterfactual fidelity. Prediction: realism plateaus near 0.82 on WOSAC by 2028; the metric itself gets replaced.
C.7 Reward modeling and RLHF for E2E driving (medium confidence)
Evidence: PE-RLHF (2025), Tesla's preference-tuning loop, Wayve's "ai-driver" preference data. Post-training LLM playbook transfers cleanly. What needs to be true: OEMs accept that some behavioural choices are not safety (handled by ISO 21448 / 26262) but preference and require labelled preference data. Prediction: by 2027, every Tier-1 AV stack has an RLHF post-training stage. The labour pool shifts: comfort raters / ride-experience experts become a recognised data category, parallel to RLHF annotators in LLM land. This is a direct extension of the Mercor-style neutral-data tier.
C.8 Foundation-model labeling collapse (medium-high confidence)
Evidence: Auto-labels from Grounded-SAM 2 + GPT-4o-mini are >85% as good as humans on common labels at 1/100 cost (04 §B). What humans still uniquely do: edge cases, calibration, semantic disagreements. What needs to be true: uncertainty-aware sampling routes only the contested labels to humans. Prediction: by 2028, ≥90% of bulk labels in production AV pipelines are foundation-model auto-labels with confidence-thresholded human verification. Pure-bulk labeling vendors (the post-Datagen, post-Synthesis-AI category) consolidate or pivot upmarket. The neutral-data tier expands on the high-skill, expert-evaluation side.
C.9 Neutral-data tier consolidation (medium confidence)
Evidence: Mercor 5×'d valuation in 9 months. Surge bootstrapped to $1B+ revenue. Scale's commercial-LLM business hollowed by Meta deal in 2025. What blocks consolidation: frontier labs explicitly want diverse, non-overlapping vendor pools — Anthropic, OpenAI, and Google won't all buy from the same shop. Prediction: 2027 sees one merger or one new entrant funded above $5B. By 2028, three survivors: Mercor, Surge, and one of {Turing, Invisible}. Scale becomes a defense / government specialist (its 2025 pivot is already visible). Note: this matters for the user's day-job because Applied Intuition's auto-labeling tooling has to interoperate with whichever vendor the OEM customer chose.
D. 2029–2030 outlook — testable forecasts
These are calibrated bets. Each is testable. Each has an implicit "how I would know I was wrong" attached.
D.1 Robotaxi deployment — high confidence
By end-of-2030, the global driverless commercial robotaxi fleet exceeds 100,000 vehicles, with Waymo + Tesla + Apollo Go (Baidu) + Pony.ai accounting for >80%. Waymo's 10-cities-by-Feb-2026 / 17-cities-by-end-2026 / 1M-rides-per-week target sets a 5-year compounding base rate. Falsifier: if Waymo is still <5,000 vehicles at end-of-2027, this is wrong.
D.2 Humanoid commercial deployment — medium confidence
By end-of-2030, ≥3 humanoid OEMs ship >10,000 commercial robots in industrial / logistics settings under paid RaaS or capex contracts. Most likely roster: Apptronik (Mercedes/GXO), Figure (BMW + a US logistics customer), Agility (Amazon/GXO), one Chinese player (Unitree/UBTech/Fourier). Goldman / BoA project 50–100k units in 2026 alone; compounding to 2030 is plausible. Falsifier: if no humanoid OEM is at >2,000 deployed units by end-of-2027, downgrade. Home / household humanoids remain rare in 2030 — high confidence on that subprediction.
D.3 Synthetic-data market consolidation — high confidence
By 2030, the standalone "rendering-only" synthetic-data category is gone (03 §F was already half-confirmed in 2025 with Datagen + Synthesis AI shutdowns). Survivors are vertically integrated either with sim platforms (Applied, NVIDIA) or with labeling+curation (Encord, Voxel51). The synthetic-data spend itself grows ~3–5× from 2025 levels; it migrates from line-item to embedded service.
D.4 Foundation-model-as-labeler economics — high confidence
By 2030, labeling cost per object-frame on common categories drops by ≥90% from 2024 baseline. Humans remain in the loop for: rare classes, calibration, audit, ambiguous semantics, and any safety-case-relevant ground truth. The total labeling spend is flat or up because the category mix shifts upmarket (3D + temporal + behavioural + preference labels).
D.5 ML-perception standard — medium confidence
By 2030, ISO publishes a successor or extension to ISO 21448 specifically for learned perception components. Most likely vehicle: an ISO/IEC TR or PAS that bridges 21448 with ISO 5469 (functional aspects of AI systems). UN R157 has its L4-urban amendment by 2029. Falsifier: if no ISO ML-perception PAS is in committee draft by end-2027, downgrade.
D.6 Compute economics — medium-low confidence
By 2030, the cost of generating one minute of multi-camera, action-conditioned, photoreal driving video at 1080p approaches the cost of rendering the same minute in Unreal/Omniverse with classical assets (i.e., within 2×, on the same hardware). This is the moment WMs become the default renderer for non-safety-critical V&V. Falsifier: if WM cost is still >10× Unreal cost in 2028, downgrade aggressively — most predictions in §C.2 also weaken.
D.7 Market structure — medium confidence
By 2030, NVIDIA owns the foundation-model + silicon + low-level robotics-OS layer; Applied Intuition owns the V&V + scenario + ingestion + safety-case layer, with one credible competitor (Foretellix or a new entrant) at half the share; a LeRobot-derivative owns the open-data exchange standard. There is no AWS-of-Physical-AI; the layer cake stays vertically split. Reason: the regulated portions (safety-case, governance) cannot be commoditised the way storage/compute were.
E. Disruption candidates — what could invalidate the thesis
The point of this section is to keep an honest fail-set in mind. Each candidate names the trigger event, the broken assumption, and the early signal a working AI Data Intelligence engineer should track.
-
Tesla open-sources FSD-grade fleet labels. Trigger: Musk decides labels are not the moat; releases >100M labeled frames + fleet trigger data. Breaks: the high price of 04 §C curation flywheels and most of the synthetic-data economy. Early signal: Tesla AI Day announcements; rumour of label-API sale to Tier-1s; FSD v15 docs hint at an external-data offering.
-
A frontier WM closes the action-conditioning gap to <10 cm at 5 s. Trigger: a Cosmos-3 or GAIA-4 paper with rigorous closed-loop benchmarks on planner perturbations. Breaks: the V&V-on-classical-sim spine assumption (03, 05). Early signal: WOSAC realism jumps >0.85; nuPlan-R reactive sim agents pass a Turing-style test against logged human drives.
-
A "robot internet" startup aggregates millions of teleop hours cheaply. Trigger: a $50M-funded company ships a teleop-as-a-service rig at $5/hr fully loaded, with an Apple Vision Pro–class capture stack. Open-licensed, embodiment-normalised data exceeds 10× Open X-Embodiment within 18 months. Breaks: the robotics-data scarcity argument (02 §C). Early signal: Y Combinator batch with two such companies; Mercor or Surge launches a robotics vertical.
-
EU AI Act bans certain synthetic training data categories. Trigger: the August 2026 enforcement (or Dec 2027 if Omnibus passes) interpreted as forbidding deepfake-class WM outputs in safety-critical training. Breaks: the synthetic-flywheel thesis for EU-regulated AVs. Early signal: AI Office guidance memos in 2026; ENISA reports on WM provenance.
-
A Mojo-class language obsoletes PyTorch tooling. Trigger: a Mojo / Triton-2 / JAX-XLA push delivers a 2–3× perf advantage and the major frontier labs migrate. Breaks: every internal pipeline built on PyTorch + LeRobot. Low probability (high inertia) but disruptive. Early signal: Modular hires from JAX team; Anthropic / OpenAI commit to a non-PyTorch stack.
-
OEMs internalise the data engine. Trigger: VW or Toyota stands up a 500-person internal Data Intelligence org and stops buying from Applied. Breaks: the Series F thesis bet 2 (00 §How AI sits). Early signal: senior Applied or Voxel51 engineers leaving for OEMs.
-
A robust universal eval benchmark emerges and is adopted across industry. LIBERO-PRO-style perturbation harnesses become the public default; SoTA models are forced to compete on robustness, not headline pass rates. Breaks: a lot of marketed numbers. This one is likely net-positive but disruptive to incumbents who marketed inflated success rates.
F. Continuous-learning pipeline — how to stay current
This is the operating manual part of the doc. It assumes you have ~30–60 min/day to spend on the field, not 4 hours.
F.1 arXiv flow
- Subscribe to arXiv daily mailing lists for cs.RO, cs.LG (action learning subset), cs.CV (video gen, 3DGS). Volume: ~100 abstracts/day combined.
- Filter through arxiv-sanity-lite (Karpathy) for personalised relevance.
- alphaxiv and emergentmind.com publish curated weekly digests; alphaxiv has a useful "frontier" filter.
- Hugging Face Daily Papers is the single highest-signal feed in 2026 — subscribe.
F.2 People to follow on X / Bluesky
These are the 18 accounts that, together, surface most relevant signal within 24–48 hours of publication. Curate one X list called "Physical AI" and add them.
- AVs / driving WMs: @karpathy, @adcock_brett (Figure CEO useful mostly as signal), the Wayve research team account, @arohan, @andrew_lampinen.
- Robotics: @chelseabfinn (Physical Intelligence), @svlevine (Physical Intelligence), @ericjang11 (1X / Gemini Robotics), @DrJimFan (NVIDIA GR00T), @RchTangye (occasional), @PeterAbbeel.
- World models / video gen: @hardmaru, @SimoRyu, @willdepue (OpenAI Sora alumnus), @karan_4d.
- Eval / data: @AnthropicAI, @srush_nlp, @_jasonwei, @scale_AI, @mercor_ai.
- Industry signal: @TheRobotReport, @Waymo, @nvidiaAIDev.
F.3 Newsletters and blogs
- Wayve Thinking — best single industry blog on driving WMs and embodied AI.
- NVIDIA Developer — Cosmos / Isaac / GR00T release notes; high signal once you filter.
- OpenDriveLab blog — challenge announcements, benchmark releases.
- Import AI (Jack Clark, Anthropic) — weekly cross-cutting.
- The Algorithmic Bridge — opinion-rich, occasional weak signal.
- The Driverless Digest — robotaxi market signal.
- Robohorizon and Humanoids Daily — humanoid pulse.
- The Sequence — broad ML newsletter; useful when you're behind.
- Physical Intelligence blog (pi.website/blog) — sparse, but every post matters.
F.4 Conference firehose strategy
Each major conference has 1,500–4,000 papers. The trick is filter, then triage, then deep-read.
- Pull the accepted-papers list as soon as it's posted.
- Run a keyword filter against your 6–8 active subareas (§A.1). Cuts to ~150–300.
- Read titles + 1-line abstracts. Cut to ~50–80.
- Read full abstracts + figures. Cut to ~15–25.
- Read the methods and ablations on the ~5–10 that survive.
- Reproduce the headline figure on 2–3 of them.
The whole flow is ~6–10 hours over a week. Without it, you're flying blind.
F.5 Reading groups
- Applied Intuition has internal paper-discussion groups; join from week one.
- The Open Robotics Reading Group (Discord, ~weekly) covers VLA papers.
- MLST (Machine Learning Street Talk) YouTube + the Latent Space pod are good commuting substitutes.
F.6 A weekly cadence (suggested)
| Day | Activity | Time |
|---|---|---|
| Mon | arXiv + HF Daily Papers triage | 30 min |
| Tue | Deep-read 1 paper (rotating subarea) | 60 min |
| Wed | Industry blog round-up (Wayve / NVIDIA / OpenDriveLab) | 30 min |
| Thu | Reproduce 1 figure or 1 ablation | 60 min |
| Fri | X list scroll + write 5 sentences in a personal "open questions" doc | 30 min |
| Sat / Sun | Optional: 1 long-form blog or chapter (RLHF Book, Sutton & Barto) | 60 min |
F.7 If you only have 30 minutes a day
Pick one: HF Daily Papers (Mon-Wed-Fri) + Wayve Thinking + Import AI. This alone keeps you in the top decile of "knows what's happening" — and frees the rest of your day for actually building.
G. What's most likely to be wrong about this doc
In descending order of self-doubt.
-
§C.6 sim-agents-realism plateau prediction. I'm calling a ceiling at 0.82 on WOSAC by 2028 based on a 1–3% YoY trajectory. A Cosmos-3 or a Gemini-3 with strong action conditioning could break this in a single jump. Early signal it's wrong: WOSAC 2026 winner > 0.82, or a paper showing closed-loop driving-from-WM-rollouts within 5% of from-real driving on Bench2Drive.
-
§D.2 humanoid commercial deployment threshold (>10k units, ≥3 OEMs by 2030). I'm bracketing high — Goldman/BoA say 50–100k in 2026 alone. If they're right, my bound is too conservative; if humanoids hit a perception or reliability wall in 2027 industrial pilots, my bound is too optimistic. Early signal: by end-2027, count installed Apollo + Figure + Digit units. If <3,000 combined, downgrade.
-
§C.9 neutral-data-tier consolidation timing. Calling 3 survivors by 2028 assumes Surge raises and stays disciplined. If Surge takes a $25B round and over-extends on robotics labeling, the survivor list could be Mercor + Turing + a new entrant. The frame is right, the names may be wrong.
-
§D.7 market structure — AWS-of-Physical-AI. I'm betting no such thing emerges, and the layer cake stays split. NVIDIA's CES 2026 Cosmos 3 + Jetson Thor + Isaac stack push could prove this wrong if it captures the V&V/safety-case layer. Early signal: an OEM publicly chooses NVIDIA over Applied Intuition for a full-stack V&V commitment.
-
The whole §C / §D framing assumes the current foundation-model paradigm holds through 2030. If V-JEPA-style joint-embedding non-generative WMs displace generative video models, much of the synthetic-data and renderer-cost economics changes. LeCun is loud about this; the 2025 V-JEPA 2 results are credible but not yet decisive. Early signal: a JEPA-class model wins WOSAC or NAVSIM v2 outright.
I would not be surprised if 2 of these 5 predictions look embarrassing in 2028. The document is most useful as a prior against which to update — re-read it quarterly with the early-signals checklist in hand.
H. Sources
Frontier papers and models
- Pi 0.7 (Physical Intelligence)
- GAIA-3 (Wayve) and press release
- Cosmos-Drive-Dreams paper and project page
- Cosmos-Predict 2.5 GitHub
- Gemini Robotics 1.5 blog and tech report
- Gemini Robotics-ER 1.6 page
- Helix 02 (Figure) and Figure 03
- V-JEPA 2 (Meta blog) and paper
- Genie 3 (DeepMind)
- LIBERO-PRO
- nuPlan-R
- NAVSIM v2 GitHub and HF leaderboard
- DriveE2E
- EMMA (Waymo) and Waymo research page
- SMART-R1 (WOSAC 2025 winner)
- TrajTok (WOSAC 2025 #2)
- OpenVLA (Kim et al.) and GitHub
- X-VLA (CoRL 2025)
- ReSim (NeurIPS 2025)
- SplatAD, DrivingGaussian, CoDa-4DGS, SplatFlow
- PE-RLHF
Datasets and benchmarks
- AgiBot World and paper
- RoboMIND and RoboMIND 2.0
- LeRobotDataset v3 and LeRobot GitHub
- Cosmos-Drive-Dreams dataset (HF)
- Waymo Open Dataset 2025 Challenges
- OpenDriveLab Challenge 2026
- ICLR 2026 VLA review (Reuss)
- CoRL 2025 Robot Data workshop
- CoRL 2025 Generalizable Priors workshop
- Embodied World Models for Decision Making (NeurIPS)
Industry and funding
- Applied Intuition Series F and 2025 Year in Review
- Mercor $10B (CNBC), Mercor metrics (Sacra)
- Apptronik $520M / $5B (Robot Report)
- Tesla Dojo shutdown (TechCrunch)
- Tesla FSD v14.3 / v15 timeline
- Waymo 10 cities (Axios) and Waymo new cities blog
- Jetson Thor GA (NVIDIA)
- NVIDIA Cosmos 3 announcement (CES 2026)
- NVIDIA Cosmos GA (Oct 2025)
Regulation
- EU AI Act timeline
- EU AI Act August 2026 deadline (Holland & Knight)
- Digital AI Omnibus (DLA Piper)
- UN R157 amendments (EFS Consulting)
- UNECE WP.29 GRVA
- ISO 21448 SOTIF overview
Cross-references inside this folder
- 00-overview.md, 01-av-industry-and-data.md, 02-robotics-foundation-models.md, 03-simulation-and-synthetic-data.md, 04-labeling-and-data-curation.md, 05-world-models-and-generative.md, 06-open-problems-and-benchmarks.md, 07-learning-roadmap.md.
Last verified: 2026-05-08.