Physical AI
Research·Doc 06·~60 min

Open Problems, Public Datasets, and Benchmarks in AV and Embodied AI

Sandbox brief for the Applied Intuition Data Intelligence team. The aim of this doc is to map the external scaffolding of the field — the public datasets, benchmarks, simulation/data standards, and regulatory frameworks that any AV or generalist-robotics team has to either consume, contribute to, or work around. Then, to enumerate the deep open problems that explain why the field still needs companies like Applied Intuition.

Last verified: 2026-05-08.


A. AV public datasets

Public AV datasets are how the academic community benchmarks itself and how every AV team bootstraps perception before touching their own (much larger) production logs. Two structural facts: (i) the largest open dataset (BDD100K, ~1,100 hrs) is a rounding error against Tesla's billions of fleet-miles or Waymo's tens of millions of rider-only miles; (ii) most are CC BY-NC research-only — the commercially-usable subset (parts of KITTI, ZOD, A2D2, BDD100K) is thin.

DatasetReleasedSizeSensor suiteTasksLicense2026 relevance
KITTI2012~6 hrs / ~50 sequences, Karlsruhe2 stereo cameras, Velodyne HDL-64E, GPS/IMU3D detection, tracking, stereo, optical flow, odometryCC BY-NC-SAFoundational; mostly historical now
KITTI-360202173.7 km, ~320k images, ~100k LiDAR scansFisheye + perspective + stereo cameras, Velodyne, SICK 2D, IMU/GPSDense 2D+3D semantic/instance segmentation across 19 classes, novel-view synthesisCC BY-NC-SAStandard for urban scene understanding (KITTI-360)
Cityscapes20165,000 fine + 20,000 coarse images, 50 German citiesStereo cameras only (no LiDAR)Semantic / instance segmentation (30 classes)Free for non-commercial / academicStill the canonical urban segmentation benchmark
Mapillary Vistas / Vistas 2.02017 / 202025k high-res images, worldwide, crowdsourcedCameras only, varied devicesSemantic (124 classes in v2.0) + instance seg (70 classes)Research licenseUsed for cross-geography robustness (Vistas paper)
nuScenes20191,000 scenes × 20 s = ~5.5 hrs, Boston + Singapore6 cameras, 1 spinning LiDAR, 5 radars, GPS/IMU3D detection, tracking, prediction, panopticCC BY-NC-SA 4.0Still the most cited AV benchmark — first dataset to include radar; >1,000 papers cite it (nuScenes)
nuScenes-Panoptic20221,000 scenes (re-annotated subset of nuScenes)Same as nuScenesLiDAR panoptic segmentation, multi-object panoptic trackingCC BY-NC-SAActive for LiDAR-based panoptic research
nuPlan20221,282 hours across Las Vegas, Boston, Pittsburgh, Singapore + 120 hrs raw sensor (~16 TB)8 cameras, 5 LiDARs (raw subset); auto-labeled tracks for full datasetClosed-loop ML planningCC BY-NC-SAReference for closed-loop motion planning; community ports active (e.g. SMART, nuPlan-R) (nuPlan)
Waymo Open – Perception2019 (v1.0)~1,950 segments × 20 s ≈ 11 hrs5 cameras, 5 LiDARs (1 mid-range + 4 short-range)2D/3D detection, tracking, segmentationWaymo Dataset License (research)Top-tier 3D perception benchmark (Waymo Open)
Waymo Open – Motion2021103,354 segments, 20 s each, with HD mapsObject tracks only (sensor-level not in this subset)Motion forecasting, interaction predictionSameDrives 2021–2025 prediction SOTA
Waymo Open – Sim Agents2023Built on Motion subsetSameMulti-agent behavior generation; uses WaymaxSame2025 winners hit ~0.785 realism, ~0.92 map-metric (TrajTok report)
Waymo Open – End-to-End (WOD-E2E)20254,021 segments (~12 hrs), curated long-tail (occurrence <0.03%)8 cameras (no LiDAR for E2E)Vision-only end-to-end driving with Rater Feedback ScoreSameBrand-new benchmark; first open-set long-tail E2E set (WOD-E2E paper)
Argoverse 12019113 sequences (sensor) + 324,557 5-s scenarios (motion forecasting)7 ring cameras, 2 stereo, 2 roof LiDARs3D tracking, motion forecasting, HD maps (290 km)CC BY-NC-SA 4.0Still used; superseded by AV2 for new work (Argoverse)
Argoverse 220221,000 sensor scenes (~4.2 hrs), 250k motion-forecasting scenarios, 20k LiDAR-only sequences (6M frames)7 ring cameras (2048×1550), 2 stereo, 2 32-beam LiDARs3D detection, motion forecasting, scene flow, point-cloud forecastingCC BY-NC-SA 4.0Current default for sensor-based AV2 challenges (AV2 paper)
Lyft / Woven Planet L52019 (perception) / 2020 (prediction)~1,118 hours, 16,000 mi, Palo Alto7 cameras, 3 LiDARsPerception, predictionCC BY-NC-SAEffectively deprecated. Lyft sold L5 to Toyota (Woven Planet) in 2021; Woven was folded back into Toyota Group in 2023. l5kit on GitHub still works but is unmaintained; not used for new SOTA work
ZOD (Zenseact Open Dataset)2023100k Frames + 1,473 Sequences (20 s each) + 29 Drives (multi-minute), 14 European countries, 2-yr collection8MP camera, 2x LiDAR (long-range Velodyne + short-range), radar (added Jan 2025), IMU/GNSS2D/3D detection up to 245 m, road segmentation, traffic-sign recognition, road classificationPermissive — research AND commercialMost commercially relevant open AV dataset — uniquely permissive license (ZOD)
ApolloScape2018100k+ image frames, 80k LiDAR points, ~1,000 kmRiegl mobile LiDAR, 2 high-res cameras at 30 fps, GPS/IMUScene parsing, lane segmentation, 3D detection/tracking, trajectory prediction, self-localization, stereo, instance segmentationCC BY-NC-SAStrong for China-domain transfer; less SOTA traction now (ApolloScape)
ONCE20211M LiDAR frames + 7M images, 144 hrs, 4 Chinese cities40-beam LiDAR + 7 cameras3D detection (with massive unlabeled set for self-supervised pretraining)ResearchLargest scale of any public LiDAR dataset by frame count (ONCE)
A2D2202041,277 annotated frames (12,497 with 3D boxes) + unlabeled sequences6 cameras, 5 LiDARs, full bus signalsSemantic + instance seg, 3D detection, CAN/bus extractionCC BY-ND 4.0 — commercial use permittedUseful for fusion and bus-signal research (A2D2)
BDD100K2018100,000 videos × 40 s = ~1,100 hoursPhone-grade cameras only (smartphone-collected)10 tasks: detection, tracking, lane marking, drivable area, full-frame semantic + instance seg, MOT, MOTS, scene taggingBSD 3-Clause-style (research), commercial via licenseLargest open driving-video corpus by hours (BDD)
OpenLane2022200k frames (built on Waymo Open)Waymo cameras3D lane detection (first large-scale 3D lane dataset)Inherits Waymo termsStill the 3D-lane reference
OpenLane-V22023 (NeurIPS D&B)2,000 annotated road scenesMulti-cameraTopology reasoning: lane centerlines + traffic elements + their relations (HD-map-on-the-fly)Apache 2.0The standard online HD-map / topology benchmark (OpenLane-V2)
CODA2022 (ECCV); W-CODA workshops at ECCV 2024~10,000 images, 43 object classes, drawn from KITTI / nuScenes / ONCEInherits sourcesCorner-case object detection (novel-class + novel-instance)ResearchStandard for OOD / corner-case evaluation. SOTA detectors get <12.8% mAR here, which is the headline result for "perception is unsolved" (CODA paper)

The takeaway: the public-AV stack has plenty of perception data, modest prediction data, very little closed-loop planning data, and almost no commercially-licensed data above 100 hours (ZOD is the exception). That gap is where vendor tooling sits.


B. Robotics public datasets

Robotics public datasets are 5–7 years behind AV in maturity, but the curve is steep. Open X-Embodiment (Oct 2023) was the watershed; AgiBot World Beta (Mar 2025) is plausibly the first robot dataset comparable in frame count to a large AV dataset.

DatasetReleasedScaleEmbodimentsTasksLicenseNotes
RoboNet2019~162k trajectories, ~15M frames7 robot platformsTabletop manipulation, video predictionResearchPre-LLM era; "ImageNet attempt #1" (RoboNet)
BridgeData V2202360,096 trajectories (50,365 expert + 9,731 scripted), 24 environmentsWidowX 250 6-DOF (single embodiment), 5 Hz, ~38 timesteps avg13 skills, language-conditioned manipulationMIT-styleThe most-used base for cross-task generalization in academic VLAs (BridgeData V2)
RT-1 / RT-2 datasets2022–2023RT-1: ~130k episodes Everyday Robots, 13 robotsMostly Everyday RobotsMobile manipulation in real kitchensRestricted (Google internal)The internal complement to Open X; partially folded into Open X
Open X-Embodiment (OXE)Oct 2023>1M trajectories pooled from 60 datasets, 22 embodiments, 527 skills, ~160k tasks; ~120 timesteps avg, 3–10 Hz22 robot platforms (Franka, xArm, WidowX, ALOHA, quadrupeds, etc.)Manipulation primarily; some navigationMixed (per source dataset)The de-facto pretraining corpus for VLAs (RT-X, OpenVLA, Octo, π0). The "ImageNet of robot learning" (OXE)
DROID2024 (RSS)76k successful trajectories, ~350 hours, 564 scenes, 86 tasks; 1.7 TBSingle embodiment (Franka), 18 robots, 13 institutions, 50 collectorsIn-the-wild diverse manipulationResearchHigher quality and diversity than OXE constituent sets; key for "wild" generalization (DROID)
BEHAVIOR-1K2022 (CoRL) → ongoing1,000 daily-life activities defined in BDDL (predicate logic), 50 scenes, 5,000+ object modelsSimulated (OmniGibson)Long-horizon mobile manipulation, householdFreeReference benchmark for task definition and long-horizon planning. 2025 BEHAVIOR Challenge demos on HF (BEHAVIOR)
Habitat datasets (HM3D, ReplicaCAD, MP3D)2020–2022HM3D: 1,000 scanned scenes; ReplicaCAD: ~100 articulated scenesSimulated agentsNavigation, rearrangement, embodied QAResearchThe dominant indoor-nav / embodied-AI sim assets (AI Habitat)
LIBERO2023 (NeurIPS)130 language-conditioned tasks, 4 suites (OBJECT, GOAL, SPATIAL, LIBERO-100=90 train + 10 test)Simulated FrankaLifelong knowledge transfer, language-conditioned manipulationMITNow the de-facto VLA evaluation suite. Reported by π0, OpenVLA, Octo, Pi-0.5, etc. (LIBERO)
AgiBot World – AlphaDec 202492,214 trajectories, ~595 hours, 36 tasks, ~8.5 TBAgiBot G1 humanoid + dual-armHumanoid manipulationCC BY-NCLargest Chinese open humanoid set at the time (AgiBot World)
AgiBot World – BetaMar 2025>1M trajectories, 2,976 hours, 217 tasks, 87 skills, 100 robots, 5 deployment domains, 3,000+ objectsAgiBot G1 (100 units)Humanoid manipulation incl. mobile / bimanual / dexterousCC BY-NCRobotics-side counterpart in scale to large AV datasets. IROS 2025 Best Paper Finalist (AgiBot Beta)
RoboMINDDec 2024 (arXiv)107k trajectories, 479 tasks, 96 object classes; 52,926 Franka + 25,170 UR-5e + 19,152 Tien Kung humanoid + 10,629 AgileX bimanual; 5k labeled failure demos; Isaac Sim digital twin4 embodimentsMulti-embodiment manipulation w/ failure analysisResearch (HF)Beijing Innovation Center + PKU. RoboMIND 2.0 (Dec 2025) adds bimanual mobile data (RoboMIND)
Hugging Face LeRobot Hub2024–ongoingThousands of community datasets in the LeRobotDataset format (v3.0 streaming, file-based)All major embodiments incl. SO-100, Koch, ALOHA, Franka, StretchAnything users uploadPer-datasetThe aggregator. π0 / π0.5 integrated 2025; LIBERO + 130 tasks integrated. Closest thing to "GitHub for robot data" (LeRobot, v3.0 blog)

Takeaway: every "X for robot learning" dataset has 1–2 OOM less data than the smallest commercial AV fleet collects in a week. Cross-embodiment is the lever, because no single embodiment has enough data alone.


C. Public benchmarks and leaderboards

Benchmarks turn datasets into competitions. The 2024–2025 trend in both AV and robotics has been a shift from open-loop (predict next waypoint, compare to log) to closed-loop (run the agent in sim and watch), because open-loop metrics correlate poorly with on-road performance.

C.1 AV leaderboards

  • nuScenes detection / tracking / prediction (open-loop): still the most-submitted 3D perception leaderboard, NDS and AMOTA as headline metrics (nuScenes).
  • Waymo Open Dataset Challenges — 6th annual edition in 2025, four tracks: Vision-Based End-to-End Driving, Scenario Generation, Interaction Prediction, Sim Agents. 2025 submissions closed May 22; leaderboards remain open (2025 blog). The new Rater Feedback Score for E2E compares predicted trajectories to human-rater preference labels rather than log replay — a meaningful methodological move.
  • Argoverse Motion Forecasting (AV1: 324k scenarios; AV2: 250k) — minADE / minFDE / Miss Rate / brier-FDE on 6-mode multimodal forecasts.
  • CVPR Workshop on Autonomous Driving (WAD) — 8th edition 2025, run by OpenDriveLab; hosts the Autonomous Grand Challenge with tracks for NAVSIM v2, Planning-Oriented AD, and E2E frontiers; ~$100k prize pool (CVPR 2025 OpenDriveLab).
  • CARLA Leaderboard 2.0 — closed-loop sim eval. Driving Score = Route Completion × Infraction Penalty. Notoriously hard; most public submissions score sub-30.
  • Bench2Drive (NeurIPS 2024 D&B) — built on CARLA L2.0. 220 short routes (5 × 44 scenarios × 23 weathers × 12 towns), 2M annotated training frames. First benchmark to disentangle multi-ability scoring (cut-in vs detour vs overtaking) so you can see where a stack fails (Bench2Drive).
  • nuPlan planning challenge — closed-loop ML planning on real logs. Ran once in 2023; the devkit is no longer maintained by Motional (after Aptiv pulled back). Community forks (Horizon Robotics' is most active) and NAVSIM v2 (CVPR 2025) effectively supersede it (NAVSIM v2).

C.2 Robotics benchmarks

  • Open X-Embodiment evaluations — no single OXE leaderboard; papers report on source-dataset splits + LIBERO + DROID held-out tasks. The fragmentation is itself a problem.
  • LIBERO — the most-reported VLA benchmark in 2024–2025. Standard practice: success rate on Spatial / Object / Goal / Long. LIBERO-PRO (Oct 2025, arXiv 2510.03827) argues the original splits saturate and proposes harder variants.
  • CALVIN — language-conditioned long-horizon manipulation in sim, 7-DOF Franka + desk; 24 hrs of teleop play, 20K language directives. Metric: instructions-completed-in-a-row (1 → 5).
  • AgiBot World Challenge 2025, BEHAVIOR Challenge 2025, MimicBench / RoboBench / SimplerEnv / Robosuite — second-tier sim benchmarks; mostly for ablations.

The robotics-evaluation field is fragmented and contested — nothing yet with the cultural force of ImageNet-2012 or COCO. LIBERO is the closest, but it's sim-on-a-single-embodiment, and papers regularly disagree on what real-world success rates "should" look like.


D. Industry standards and regulatory frameworks

This is the layer that matters for shipping a real product, and where Applied Intuition has product-market fit — their tooling is the bridge between research-grade ML and ISO-26262-grade systems.

D.1 ASAM standards (simulation interop)

ASAM (Association for Standardization of Automation and Measuring Systems) maintains the open formats that let driving simulators and tool vendors interoperate. If you ship "scenario X to OEM Y," you express it in ASAM formats.

StandardPurposeLatest versionStatus
ASAM OpenDRIVELogical road network description (lanes, geometry, signs, signals) — fed into simulators1.9.0, 2026-03-11 (spec)Universal in industry sims
ASAM OpenSCENARIO 1.x (XML)Scenario description for dynamic content (actors, triggers, maneuvers) — XML-based1.3 (current XML)Wide adoption
ASAM OpenSCENARIO DSL (a.k.a. OSC 2.x)New DSL approach (human-readable, parametric, composable) — supersedes XML for advanced scenario authoringDSL 2.1.0 (2024-03-07); 2.2.0 in public review early 2026Growing adoption in scenario-based testing; Foretellix and Applied Intuition both support it
ASAM OSI (Open Simulation Interface)Interface for sensor/environment/ground-truth data between simulator and SUT (System Under Test)OSI 3.xStandard for sensor-model co-simulation
ASAM OpenLABELAnnotation format for objects and scenarios (JSON-based)1.0+Adopted but not universal

Together: describe a road, populate with actors, run in any compliant simulator, exchange labels. The foundation Applied Intuition's simulation product is interoperable on.

D.2 Functional safety and SOTIF

StandardScopeKey idea
ISO 26262 (2018, 2nd ed.)Functional safety of E/E systems in road vehiclesASIL classification, hazard analysis, fault tolerance against system failures
ISO 21448 (SOTIF, 2022)Safety of the Intended Functionality — covers SAE Levels 1–5; requires field monitoring after deploymentRisk from functional insufficiencies and foreseeable misuse in the absence of failures (e.g. an ML perception model fails on glare). Distinct from 26262 (faults) and ISO/SAE 21434 (cyber) (ISO 21448)
ISO/SAE 21434 (2021)Cybersecurity engineering for road vehiclesThe CSMS process backbone for UN R155

For ML-heavy AV systems, SOTIF is the more interesting standard — essentially "how to argue a probabilistic perception/planning stack is safe enough" — and it's the backbone of the AV safety case.

D.3 UN / UNECE WP.29 regulations

UN regs are the international type-approval framework (EU, UK, Japan, Korea, ~40 others; US uses FMVSS self-certification).

  • UN R157 (ALKS) — adopted Jan 2021, in force Jan 2023. First international type-approval for an SAE L3 function. Original cap: 60 km/h; 01 Series Amendment raised it to 130 km/h (lane-change-capable systems only) (UNECE). 2024 supplement added EMC resilience.
  • UN R155 (Cybersecurity / CSMS) — adopted 2020. Certified Cyber Security Management System mandatory for type approval. All new vehicles sold in UN-1958 contracting states from July 2024 must comply.
  • UN R156 (Software Updates / SUMS) — R155 companion; governs the OTA lifecycle.
  • WP.29 — the UN Working Party 29 body that drafts and amends the above through its GRVA and GRSG/GRSP working parties.

D.4 US regulatory framework

  • NHTSA AV TEST Initiative (2020) — voluntary public reporting of testing locations and incidents.
  • NHTSA AV STEP (NPRM Jan 15, 2025) — proposed ADS-equipped Vehicle Safety, Transparency, and Evaluation Program; voluntary, with independent assessment of safety cases (Federal Register).
  • NHTSA AV Framework (April 24, 2025) — three principles: prioritize safety, remove barriers, enable commercial deployment (DOT).
  • California DMV / CPUC — testing + deployment permits + ride-service licensing. As of late 2025, only Waymo holds a full deployment + driverless paid passenger permit; Zoox has driverless pilot permits. CA DMV released a revised regulatory proposal Dec 3, 2025 (CNBC).
  • Texas SB 2807 (2025) — requires commercial AV operating authorization from Sept 1, 2025. Free, non-expiring.
  • Arizona — ADOT self-certification; permissive. Waymo's largest ODD (~315 sq mi metro Phoenix); Tesla supervised-Robotaxi approval Nov 2025.

D.5 EU AI Act and other frameworks

The EU AI Act (adopted 2024, most provisions live Aug 2026) classes AVs as high-risk AI, but Article 25(4) routes them through sectoral type-approval law (UN regs + EU 2018/858 + EU 2019/2144) rather than the standard high-risk obligations. OEMs end up with two stacks: AI-Act-style documentation (data governance, bias monitoring, transparency logs) flowing through type approval (Squire Patton Boggs).

  • SAE J3016 (J3016_202104, Apr 2021) — the canonical six-level (L0–L5) taxonomy. Joint SAE / ISO TC204/WG14 effort. The reference for "what level is this?" debates.
  • PEGASUS (DE, completed mid-2019) — produced the scenario-based testing methodology now industry-standard (PEGASUS).
  • ENABLE-S3 (EU, 2016–2019) — earlier validation-of-automated-systems work that fed into PEGASUS.
  • Hi-Drive (EU Horizon, 2022–2026) — large-scale on-road testing across multiple EU countries; complements PEGASUS's sim focus with field data.
  • SET Level (DE) — successor to PEGASUS for L4/L5 simulation pipelines.

E. Deep open problems

The standards above describe the rules of the game. The problems below are why the game isn't won.

E.1 The long-tail / safety-validation problem

The RAND "Driving to Safety" (2016) result: to demonstrate with 95% confidence that an AV's fatality rate is within 20% of the human baseline (1.09 fatalities per 100M VMT) requires 8.8 billion miles of testing; to show better than human, hundreds of billions (RAND RR-1478). Waymo's 170M cumulative rider-only miles are two OOM short. You cannot drive your way to safety. This is the structural case for simulation, scenario-based testing (PEGASUS, NAVSIM, Bench2Drive, Waabi World), and synthetic data — and the macro case for Applied Intuition. Corollary: edge-case discovery is its own discipline. CODA's <12.8% mAR for SOTA detectors is the headline that "pretrained on nuScenes" does not generalize.

E.2 The sim-to-real gap

Three sub-gaps: (1) sensor realism — does synthesized lidar/camera match physical noise statistics? (NVIDIA Cosmos, Waymo Sim, Waabi World, 3DGS reconstruction are converging on "yes for most, no for edge cases"); (2) behavior realism — do other agents drive like humans? This is the Sim Agents problem; 2025 winners hit ~0.785 realism against an unknown ceiling; (3) distribution realism — does the sim cover the long tail in the right proportion? Failure on any one collapses sim value.

E.3 Open-loop vs closed-loop evaluation

Open-loop ("predict 3 s, compare to log") is cheap and reproducible but doesn't compound errors — a planner drifting 1 cm/step looks fine open-loop and crashes closed-loop. Closed-loop is faithful but expensive, hard to standardize, and requires credible sim agents (see E.2). Bench2Drive and NAVSIM v2 are current best attempts at the middle. Waymo's RFS is a different angle: anchor to human preference, not log replay.

E.4 Data quantity vs quality (curation > collection)

Post-2024 consensus shifted from more to better. Tesla's shadow-mode triggers, Waymo's hard-mining, DROID's deliberate diversity curation, Open X's per-source quality filtering — all point the same way. Applied Intuition's Data Explorer / Axion reflect this: the rate-limiter isn't log volume, it's finding the 0.01% of frames that change model behavior.

E.5 Multi-agent realism

Single-vehicle planning in benign conditions is largely solved. Multi-agent interactive behavior — merging, unprotected lefts, occluded pedestrians, construction zones — is where stacks fail. The Sim Agents Challenge exists because without realistic NPCs, closed-loop sim is unfalsifiable. Arguably the hardest current AV sub-problem.

E.6 Generalization

Cross-city — a Phoenix stack doesn't work in Mumbai; Waymo's scaling law is HD-map + ODD + fleet-ops + weeks of fine-tuning per city. Cross-weather — fog, snow, heavy rain remain perception failure modes. Cross-embodiment (robotics) — Franka-trained policies don't transfer to xArm without retraining; Open X, RoboMIND's 4-embodiment design, and AgiBot World are explicit attacks on this.

E.7 End-to-end vs modular: the pendulum

  • 2018–2022: modular wins (Waymo, Cruise, Aurora) — interpretable, debuggable, certifiable.
  • 2022–2024: end-to-end wins (Tesla FSD v12 deleted ~300k lines of C++; Wayve / Waabi built E2E from day one).
  • 2025–2026: convergence. Waymo's foundation model is "modular transformer-based" — separate transformers per stage, but each is a neural net. Tesla FSD v14 is fully E2E but uses auxiliary tasks for interpretability. Wayve's GAIA-2 (2025) pairs E2E with explicit safety filters. The dichotomy is dissolving toward "neural everywhere, modular interfaces for safety arguments." Applied Intuition's tooling survives the transition because the interfaces survive even as implementations go neural.

F. North-star goals across the field

Stand back and every research direction above aims at one of five long-term outcomes:

  1. Robotaxi at scale. Driverless paid passenger service in arbitrary cities. Today: Waymo + Apollo Go + early Zoox / Tesla pilots. Open question: how fast can ODD expand?
  2. Trustable end-to-end AV. A single neural net, demonstrably safe under SOTIF, certified under R157+, deployable at consumer scale. Today: Tesla FSD v14 (un-certified), Wayve, Waabi (trucking).
  3. Generalist robot policy across embodiments. π0/π0.5, RT-2, OpenVLA — one policy controlling Franka, xArm, ALOHA, humanoids — given language.
  4. True digital-twin sim. Near-zero domain gap on sensors, behavior, and distribution — enough to replace a meaningful fraction of real miles for both training and validation. Waabi World and the Waymo World Model (Feb 2026, built on DeepMind Genie 3) are the most credible attempts.
  5. Data infrastructure that auto-converts drives/demos into training fuel. Closing the loop from "fleet log" → "labeled tensor" → "trained model" → "validated improvement" without human-in-the-loop bottlenecks.

(5) is a horizontal capability that all four other goals depend on. That is the Applied Intuition thesis.


G. Implications for Applied Intuition (and you)

Applied Intuition's portfolio (per the 2025 in-review post) covers:

  • Simulation — ASAM-OpenSCENARIO/OpenDRIVE/OSI compliant, scenario authoring, sensor sim. (D.1)
  • Data Explorer / Data Intelligence — fleet-log search, mining, curation, embedding retrieval. (E.4)
  • Axion Tooling & Infrastructure — all-domain data engine + sim toolkit. (F.5)
  • Acuity Ground / Air / Maritime — end-to-end autonomy stacks. (F.2 across non-passenger domains)
  • ADAS (expansion 2025), Foundation models / voice / GenAI — adjacencies.

Highest-leverage gaps for the user's Data Intelligence team:

  1. Closed-loop evaluation that's interpretable, auditable, and ASIL-mappable. Bench2Drive / NAVSIM v2 are research-grade; the productized OEM-safety-case version is wide open.
  2. Multi-agent / sim-agents realism. The gating problem for sim-to-validate. Whoever ships realistic NPCs — data-driven, not heuristic — owns closed-loop testing. (Sim Agents Challenge is the research analog.)
  3. Cross-embodiment data infrastructure for robotics. AI is mostly AV today, but the same primitives (search, curation, embedding retrieval, scenario mining, replay) are arguably more valuable in robotics where nobody has solved them. LeRobot is the OSS comparable; nobody has the enterprise version.
  4. Long-tail / corner-case mining at fleet scale. Tesla's shadow-mode and Waymo's hard-mining are internal. Productized edge-case mining is a real wedge — and CODA's <12.8% mAR quantifies the headline claim.
  5. Auto-labeling pipelines that survive SOTIF audit. AI's Mining massive autonomy datasets is already moving here — pretrained FMs auto-labeling fleet logs at human-level cost-per-frame. The bridge from "auto-labeled" to "ISO-21448-defensible" is the unsolved piece.

Thesis: the field has converged on the diagnosis ("data and evaluation are the bottleneck") much faster than the cure. Most public datasets and benchmarks codify the old regime (open-loop, log-replay, perception-heavy). The opportunity at AI Data Intelligence is to be in the room while the new regime — closed-loop, scenario-based, curation-first, multi-embodiment, SOTIF-defensible — is being defined and standardized.


Sources