Open Problems, Public Datasets, and Benchmarks in AV and Embodied AI

Sandbox brief for the Applied Intuition Data Intelligence team. The aim of this doc is to map the external scaffolding of the field — the public datasets, benchmarks, simulation/data standards, and regulatory frameworks that any AV or generalist-robotics team has to either consume, contribute to, or work around. Then, to enumerate the deep open problems that explain why the field still needs companies like Applied Intuition.

Last verified: 2026-05-08.

A. AV public datasets

Public AV datasets are how the academic community benchmarks itself and how every AV team bootstraps perception before touching their own (much larger) production logs. Two structural facts: (i) the largest open dataset (BDD100K, ~1,100 hrs) is a rounding error against Tesla's billions of fleet-miles or Waymo's tens of millions of rider-only miles; (ii) most are CC BY-NC research-only — the commercially-usable subset (parts of KITTI, ZOD, A2D2, BDD100K) is thin.

Dataset	Released	Size	Sensor suite	Tasks	License	2026 relevance
KITTI	2012	~6 hrs / ~50 sequences, Karlsruhe	2 stereo cameras, Velodyne HDL-64E, GPS/IMU	3D detection, tracking, stereo, optical flow, odometry	CC BY-NC-SA	Foundational; mostly historical now
KITTI-360	2021	73.7 km, ~320k images, ~100k LiDAR scans	Fisheye + perspective + stereo cameras, Velodyne, SICK 2D, IMU/GPS	Dense 2D+3D semantic/instance segmentation across 19 classes, novel-view synthesis	CC BY-NC-SA	Standard for urban scene understanding (KITTI-360)
Cityscapes	2016	5,000 fine + 20,000 coarse images, 50 German cities	Stereo cameras only (no LiDAR)	Semantic / instance segmentation (30 classes)	Free for non-commercial / academic	Still the canonical urban segmentation benchmark
Mapillary Vistas / Vistas 2.0	2017 / 2020	25k high-res images, worldwide, crowdsourced	Cameras only, varied devices	Semantic (124 classes in v2.0) + instance seg (70 classes)	Research license	Used for cross-geography robustness (Vistas paper)
nuScenes	2019	1,000 scenes × 20 s = ~5.5 hrs, Boston + Singapore	6 cameras, 1 spinning LiDAR, 5 radars, GPS/IMU	3D detection, tracking, prediction, panoptic	CC BY-NC-SA 4.0	Still the most cited AV benchmark — first dataset to include radar; >1,000 papers cite it (nuScenes)
nuScenes-Panoptic	2022	1,000 scenes (re-annotated subset of nuScenes)	Same as nuScenes	LiDAR panoptic segmentation, multi-object panoptic tracking	CC BY-NC-SA	Active for LiDAR-based panoptic research
nuPlan	2022	1,282 hours across Las Vegas, Boston, Pittsburgh, Singapore + 120 hrs raw sensor (~16 TB)	8 cameras, 5 LiDARs (raw subset); auto-labeled tracks for full dataset	Closed-loop ML planning	CC BY-NC-SA	Reference for closed-loop motion planning; community ports active (e.g. SMART, nuPlan-R) (nuPlan)
Waymo Open – Perception	2019 (v1.0)	~1,950 segments × 20 s ≈ 11 hrs	5 cameras, 5 LiDARs (1 mid-range + 4 short-range)	2D/3D detection, tracking, segmentation	Waymo Dataset License (research)	Top-tier 3D perception benchmark (Waymo Open)
Waymo Open – Motion	2021	103,354 segments, 20 s each, with HD maps	Object tracks only (sensor-level not in this subset)	Motion forecasting, interaction prediction	Same	Drives 2021–2025 prediction SOTA
Waymo Open – Sim Agents	2023	Built on Motion subset	Same	Multi-agent behavior generation; uses Waymax	Same	2025 winners hit ~0.785 realism, ~0.92 map-metric (TrajTok report)
Waymo Open – End-to-End (WOD-E2E)	2025	4,021 segments (~12 hrs), curated long-tail (occurrence <0.03%)	8 cameras (no LiDAR for E2E)	Vision-only end-to-end driving with Rater Feedback Score	Same	Brand-new benchmark; first open-set long-tail E2E set (WOD-E2E paper)
Argoverse 1	2019	113 sequences (sensor) + 324,557 5-s scenarios (motion forecasting)	7 ring cameras, 2 stereo, 2 roof LiDARs	3D tracking, motion forecasting, HD maps (290 km)	CC BY-NC-SA 4.0	Still used; superseded by AV2 for new work (Argoverse)
Argoverse 2	2022	1,000 sensor scenes (~4.2 hrs), 250k motion-forecasting scenarios, 20k LiDAR-only sequences (6M frames)	7 ring cameras (2048×1550), 2 stereo, 2 32-beam LiDARs	3D detection, motion forecasting, scene flow, point-cloud forecasting	CC BY-NC-SA 4.0	Current default for sensor-based AV2 challenges (AV2 paper)
Lyft / Woven Planet L5	2019 (perception) / 2020 (prediction)	~1,118 hours, 16,000 mi, Palo Alto	7 cameras, 3 LiDARs	Perception, prediction	CC BY-NC-SA	Effectively deprecated. Lyft sold L5 to Toyota (Woven Planet) in 2021; Woven was folded back into Toyota Group in 2023. l5kit on GitHub still works but is unmaintained; not used for new SOTA work
ZOD (Zenseact Open Dataset)	2023	100k Frames + 1,473 Sequences (20 s each) + 29 Drives (multi-minute), 14 European countries, 2-yr collection	8MP camera, 2x LiDAR (long-range Velodyne + short-range), radar (added Jan 2025), IMU/GNSS	2D/3D detection up to 245 m, road segmentation, traffic-sign recognition, road classification	Permissive — research AND commercial	Most commercially relevant open AV dataset — uniquely permissive license (ZOD)
ApolloScape	2018	100k+ image frames, 80k LiDAR points, ~1,000 km	Riegl mobile LiDAR, 2 high-res cameras at 30 fps, GPS/IMU	Scene parsing, lane segmentation, 3D detection/tracking, trajectory prediction, self-localization, stereo, instance segmentation	CC BY-NC-SA	Strong for China-domain transfer; less SOTA traction now (ApolloScape)
ONCE	2021	1M LiDAR frames + 7M images, 144 hrs, 4 Chinese cities	40-beam LiDAR + 7 cameras	3D detection (with massive unlabeled set for self-supervised pretraining)	Research	Largest scale of any public LiDAR dataset by frame count (ONCE)
A2D2	2020	41,277 annotated frames (12,497 with 3D boxes) + unlabeled sequences	6 cameras, 5 LiDARs, full bus signals	Semantic + instance seg, 3D detection, CAN/bus extraction	CC BY-ND 4.0 — commercial use permitted	Useful for fusion and bus-signal research (A2D2)
BDD100K	2018	100,000 videos × 40 s = ~1,100 hours	Phone-grade cameras only (smartphone-collected)	10 tasks: detection, tracking, lane marking, drivable area, full-frame semantic + instance seg, MOT, MOTS, scene tagging	BSD 3-Clause-style (research), commercial via license	Largest open driving-video corpus by hours (BDD)
OpenLane	2022	200k frames (built on Waymo Open)	Waymo cameras	3D lane detection (first large-scale 3D lane dataset)	Inherits Waymo terms	Still the 3D-lane reference
OpenLane-V2	2023 (NeurIPS D&B)	2,000 annotated road scenes	Multi-camera	Topology reasoning: lane centerlines + traffic elements + their relations (HD-map-on-the-fly)	Apache 2.0	The standard online HD-map / topology benchmark (OpenLane-V2)
CODA	2022 (ECCV); W-CODA workshops at ECCV 2024	~10,000 images, 43 object classes, drawn from KITTI / nuScenes / ONCE	Inherits sources	Corner-case object detection (novel-class + novel-instance)	Research	Standard for OOD / corner-case evaluation. SOTA detectors get <12.8% mAR here, which is the headline result for "perception is unsolved" (CODA paper)

The takeaway: the public-AV stack has plenty of perception data, modest prediction data, very little closed-loop planning data, and almost no commercially-licensed data above 100 hours (ZOD is the exception). That gap is where vendor tooling sits.

B. Robotics public datasets

Robotics public datasets are 5–7 years behind AV in maturity, but the curve is steep. Open X-Embodiment (Oct 2023) was the watershed; AgiBot World Beta (Mar 2025) is plausibly the first robot dataset comparable in frame count to a large AV dataset.

Dataset	Released	Scale	Embodiments	Tasks	License	Notes
RoboNet	2019	~162k trajectories, ~15M frames	7 robot platforms	Tabletop manipulation, video prediction	Research	Pre-LLM era; "ImageNet attempt #1" (RoboNet)
BridgeData V2	2023	60,096 trajectories (50,365 expert + 9,731 scripted), 24 environments	WidowX 250 6-DOF (single embodiment), 5 Hz, ~38 timesteps avg	13 skills, language-conditioned manipulation	MIT-style	The most-used base for cross-task generalization in academic VLAs (BridgeData V2)
RT-1 / RT-2 datasets	2022–2023	RT-1: ~130k episodes Everyday Robots, 13 robots	Mostly Everyday Robots	Mobile manipulation in real kitchens	Restricted (Google internal)	The internal complement to Open X; partially folded into Open X
Open X-Embodiment (OXE)	Oct 2023	>1M trajectories pooled from 60 datasets, 22 embodiments, 527 skills, ~160k tasks; ~120 timesteps avg, 3–10 Hz	22 robot platforms (Franka, xArm, WidowX, ALOHA, quadrupeds, etc.)	Manipulation primarily; some navigation	Mixed (per source dataset)	The de-facto pretraining corpus for VLAs (RT-X, OpenVLA, Octo, π0). The "ImageNet of robot learning" (OXE)
DROID	2024 (RSS)	76k successful trajectories, ~350 hours, 564 scenes, 86 tasks; 1.7 TB	Single embodiment (Franka), 18 robots, 13 institutions, 50 collectors	In-the-wild diverse manipulation	Research	Higher quality and diversity than OXE constituent sets; key for "wild" generalization (DROID)
BEHAVIOR-1K	2022 (CoRL) → ongoing	1,000 daily-life activities defined in BDDL (predicate logic), 50 scenes, 5,000+ object models	Simulated (OmniGibson)	Long-horizon mobile manipulation, household	Free	Reference benchmark for task definition and long-horizon planning. 2025 BEHAVIOR Challenge demos on HF (BEHAVIOR)
Habitat datasets (HM3D, ReplicaCAD, MP3D)	2020–2022	HM3D: 1,000 scanned scenes; ReplicaCAD: ~100 articulated scenes	Simulated agents	Navigation, rearrangement, embodied QA	Research	The dominant indoor-nav / embodied-AI sim assets (AI Habitat)
LIBERO	2023 (NeurIPS)	130 language-conditioned tasks, 4 suites (OBJECT, GOAL, SPATIAL, LIBERO-100=90 train + 10 test)	Simulated Franka	Lifelong knowledge transfer, language-conditioned manipulation	MIT	Now the de-facto VLA evaluation suite. Reported by π0, OpenVLA, Octo, Pi-0.5, etc. (LIBERO)
AgiBot World – Alpha	Dec 2024	92,214 trajectories, ~595 hours, 36 tasks, ~8.5 TB	AgiBot G1 humanoid + dual-arm	Humanoid manipulation	CC BY-NC	Largest Chinese open humanoid set at the time (AgiBot World)
AgiBot World – Beta	Mar 2025	>1M trajectories, 2,976 hours, 217 tasks, 87 skills, 100 robots, 5 deployment domains, 3,000+ objects	AgiBot G1 (100 units)	Humanoid manipulation incl. mobile / bimanual / dexterous	CC BY-NC	Robotics-side counterpart in scale to large AV datasets. IROS 2025 Best Paper Finalist (AgiBot Beta)
RoboMIND	Dec 2024 (arXiv)	107k trajectories, 479 tasks, 96 object classes; 52,926 Franka + 25,170 UR-5e + 19,152 Tien Kung humanoid + 10,629 AgileX bimanual; 5k labeled failure demos; Isaac Sim digital twin	4 embodiments	Multi-embodiment manipulation w/ failure analysis	Research (HF)	Beijing Innovation Center + PKU. RoboMIND 2.0 (Dec 2025) adds bimanual mobile data (RoboMIND)
Hugging Face LeRobot Hub	2024–ongoing	Thousands of community datasets in the LeRobotDataset format (v3.0 streaming, file-based)	All major embodiments incl. SO-100, Koch, ALOHA, Franka, Stretch	Anything users upload	Per-dataset	The aggregator. π0 / π0.5 integrated 2025; LIBERO + 130 tasks integrated. Closest thing to "GitHub for robot data" (LeRobot, v3.0 blog)

Takeaway: every "X for robot learning" dataset has 1–2 OOM less data than the smallest commercial AV fleet collects in a week. Cross-embodiment is the lever, because no single embodiment has enough data alone.

C. Public benchmarks and leaderboards

Benchmarks turn datasets into competitions. The 2024–2025 trend in both AV and robotics has been a shift from open-loop (predict next waypoint, compare to log) to closed-loop (run the agent in sim and watch), because open-loop metrics correlate poorly with on-road performance.

C.1 AV leaderboards

nuScenes detection / tracking / prediction (open-loop): still the most-submitted 3D perception leaderboard, NDS and AMOTA as headline metrics (nuScenes).
Waymo Open Dataset Challenges — 6th annual edition in 2025, four tracks: Vision-Based End-to-End Driving, Scenario Generation, Interaction Prediction, Sim Agents. 2025 submissions closed May 22; leaderboards remain open (2025 blog). The new Rater Feedback Score for E2E compares predicted trajectories to human-rater preference labels rather than log replay — a meaningful methodological move.
Argoverse Motion Forecasting (AV1: 324k scenarios; AV2: 250k) — minADE / minFDE / Miss Rate / brier-FDE on 6-mode multimodal forecasts.
CVPR Workshop on Autonomous Driving (WAD) — 8th edition 2025, run by OpenDriveLab; hosts the Autonomous Grand Challenge with tracks for NAVSIM v2, Planning-Oriented AD, and E2E frontiers; ~$100k prize pool (CVPR 2025 OpenDriveLab).
CARLA Leaderboard 2.0 — closed-loop sim eval. Driving Score = Route Completion × Infraction Penalty. Notoriously hard; most public submissions score sub-30.
Bench2Drive (NeurIPS 2024 D&B) — built on CARLA L2.0. 220 short routes (5 × 44 scenarios × 23 weathers × 12 towns), 2M annotated training frames. First benchmark to disentangle multi-ability scoring (cut-in vs detour vs overtaking) so you can see where a stack fails (Bench2Drive).
nuPlan planning challenge — closed-loop ML planning on real logs. Ran once in 2023; the devkit is no longer maintained by Motional (after Aptiv pulled back). Community forks (Horizon Robotics' is most active) and NAVSIM v2 (CVPR 2025) effectively supersede it (NAVSIM v2).

C.2 Robotics benchmarks

Open X-Embodiment evaluations — no single OXE leaderboard; papers report on source-dataset splits + LIBERO + DROID held-out tasks. The fragmentation is itself a problem.
LIBERO — the most-reported VLA benchmark in 2024–2025. Standard practice: success rate on Spatial / Object / Goal / Long. LIBERO-PRO (Oct 2025, arXiv 2510.03827) argues the original splits saturate and proposes harder variants.
CALVIN — language-conditioned long-horizon manipulation in sim, 7-DOF Franka + desk; 24 hrs of teleop play, 20K language directives. Metric: instructions-completed-in-a-row (1 → 5).
AgiBot World Challenge 2025, BEHAVIOR Challenge 2025, MimicBench / RoboBench / SimplerEnv / Robosuite — second-tier sim benchmarks; mostly for ablations.

The robotics-evaluation field is fragmented and contested — nothing yet with the cultural force of ImageNet-2012 or COCO. LIBERO is the closest, but it's sim-on-a-single-embodiment, and papers regularly disagree on what real-world success rates "should" look like.

D. Industry standards and regulatory frameworks

This is the layer that matters for shipping a real product, and where Applied Intuition has product-market fit — their tooling is the bridge between research-grade ML and ISO-26262-grade systems.

D.1 ASAM standards (simulation interop)

ASAM (Association for Standardization of Automation and Measuring Systems) maintains the open formats that let driving simulators and tool vendors interoperate. If you ship "scenario X to OEM Y," you express it in ASAM formats.

Standard	Purpose	Latest version	Status
ASAM OpenDRIVE	Logical road network description (lanes, geometry, signs, signals) — fed into simulators	1.9.0, 2026-03-11 (spec)	Universal in industry sims
ASAM OpenSCENARIO 1.x (XML)	Scenario description for dynamic content (actors, triggers, maneuvers) — XML-based	1.3 (current XML)	Wide adoption
ASAM OpenSCENARIO DSL (a.k.a. OSC 2.x)	New DSL approach (human-readable, parametric, composable) — supersedes XML for advanced scenario authoring	DSL 2.1.0 (2024-03-07); 2.2.0 in public review early 2026	Growing adoption in scenario-based testing; Foretellix and Applied Intuition both support it
ASAM OSI (Open Simulation Interface)	Interface for sensor/environment/ground-truth data between simulator and SUT (System Under Test)	OSI 3.x	Standard for sensor-model co-simulation
ASAM OpenLABEL	Annotation format for objects and scenarios (JSON-based)	1.0+	Adopted but not universal

Together: describe a road, populate with actors, run in any compliant simulator, exchange labels. The foundation Applied Intuition's simulation product is interoperable on.

D.2 Functional safety and SOTIF

Standard	Scope	Key idea
ISO 26262 (2018, 2nd ed.)	Functional safety of E/E systems in road vehicles	ASIL classification, hazard analysis, fault tolerance against system failures
ISO 21448 (SOTIF, 2022)	Safety of the Intended Functionality — covers SAE Levels 1–5; requires field monitoring after deployment	Risk from functional insufficiencies and foreseeable misuse in the absence of failures (e.g. an ML perception model fails on glare). Distinct from 26262 (faults) and ISO/SAE 21434 (cyber) (ISO 21448)
ISO/SAE 21434 (2021)	Cybersecurity engineering for road vehicles	The CSMS process backbone for UN R155

For ML-heavy AV systems, SOTIF is the more interesting standard — essentially "how to argue a probabilistic perception/planning stack is safe enough" — and it's the backbone of the AV safety case.

D.3 UN / UNECE WP.29 regulations

UN regs are the international type-approval framework (EU, UK, Japan, Korea, ~40 others; US uses FMVSS self-certification).

UN R157 (ALKS) — adopted Jan 2021, in force Jan 2023. First international type-approval for an SAE L3 function. Original cap: 60 km/h; 01 Series Amendment raised it to 130 km/h (lane-change-capable systems only) (UNECE). 2024 supplement added EMC resilience.
UN R155 (Cybersecurity / CSMS) — adopted 2020. Certified Cyber Security Management System mandatory for type approval. All new vehicles sold in UN-1958 contracting states from July 2024 must comply.
UN R156 (Software Updates / SUMS) — R155 companion; governs the OTA lifecycle.
WP.29 — the UN Working Party 29 body that drafts and amends the above through its GRVA and GRSG/GRSP working parties.

D.4 US regulatory framework

NHTSA AV TEST Initiative (2020) — voluntary public reporting of testing locations and incidents.
NHTSA AV STEP (NPRM Jan 15, 2025) — proposed ADS-equipped Vehicle Safety, Transparency, and Evaluation Program; voluntary, with independent assessment of safety cases (Federal Register).
NHTSA AV Framework (April 24, 2025) — three principles: prioritize safety, remove barriers, enable commercial deployment (DOT).
California DMV / CPUC — testing + deployment permits + ride-service licensing. As of late 2025, only Waymo holds a full deployment + driverless paid passenger permit; Zoox has driverless pilot permits. CA DMV released a revised regulatory proposal Dec 3, 2025 (CNBC).
Texas SB 2807 (2025) — requires commercial AV operating authorization from Sept 1, 2025. Free, non-expiring.
Arizona — ADOT self-certification; permissive. Waymo's largest ODD (~315 sq mi metro Phoenix); Tesla supervised-Robotaxi approval Nov 2025.

D.5 EU AI Act and other frameworks

The EU AI Act (adopted 2024, most provisions live Aug 2026) classes AVs as high-risk AI, but Article 25(4) routes them through sectoral type-approval law (UN regs + EU 2018/858 + EU 2019/2144) rather than the standard high-risk obligations. OEMs end up with two stacks: AI-Act-style documentation (data governance, bias monitoring, transparency logs) flowing through type approval (Squire Patton Boggs).

SAE J3016 (J3016_202104, Apr 2021) — the canonical six-level (L0–L5) taxonomy. Joint SAE / ISO TC204/WG14 effort. The reference for "what level is this?" debates.
PEGASUS (DE, completed mid-2019) — produced the scenario-based testing methodology now industry-standard (PEGASUS).
ENABLE-S3 (EU, 2016–2019) — earlier validation-of-automated-systems work that fed into PEGASUS.
Hi-Drive (EU Horizon, 2022–2026) — large-scale on-road testing across multiple EU countries; complements PEGASUS's sim focus with field data.
SET Level (DE) — successor to PEGASUS for L4/L5 simulation pipelines.

E. Deep open problems

The standards above describe the rules of the game. The problems below are why the game isn't won.

E.1 The long-tail / safety-validation problem

The RAND "Driving to Safety" (2016) result: to demonstrate with 95% confidence that an AV's fatality rate is within 20% of the human baseline (1.09 fatalities per 100M VMT) requires 8.8 billion miles of testing; to show better than human, hundreds of billions (RAND RR-1478). Waymo's 170M cumulative rider-only miles are two OOM short. You cannot drive your way to safety. This is the structural case for simulation, scenario-based testing (PEGASUS, NAVSIM, Bench2Drive, Waabi World), and synthetic data — and the macro case for Applied Intuition. Corollary: edge-case discovery is its own discipline. CODA's <12.8% mAR for SOTA detectors is the headline that "pretrained on nuScenes" does not generalize.

E.2 The sim-to-real gap

Three sub-gaps: (1) sensor realism — does synthesized lidar/camera match physical noise statistics? (NVIDIA Cosmos, Waymo Sim, Waabi World, 3DGS reconstruction are converging on "yes for most, no for edge cases"); (2) behavior realism — do other agents drive like humans? This is the Sim Agents problem; 2025 winners hit ~0.785 realism against an unknown ceiling; (3) distribution realism — does the sim cover the long tail in the right proportion? Failure on any one collapses sim value.

E.3 Open-loop vs closed-loop evaluation

Open-loop ("predict 3 s, compare to log") is cheap and reproducible but doesn't compound errors — a planner drifting 1 cm/step looks fine open-loop and crashes closed-loop. Closed-loop is faithful but expensive, hard to standardize, and requires credible sim agents (see E.2). Bench2Drive and NAVSIM v2 are current best attempts at the middle. Waymo's RFS is a different angle: anchor to human preference, not log replay.

E.4 Data quantity vs quality (curation > collection)

Post-2024 consensus shifted from more to better. Tesla's shadow-mode triggers, Waymo's hard-mining, DROID's deliberate diversity curation, Open X's per-source quality filtering — all point the same way. Applied Intuition's Data Explorer / Axion reflect this: the rate-limiter isn't log volume, it's finding the 0.01% of frames that change model behavior.

E.5 Multi-agent realism

Single-vehicle planning in benign conditions is largely solved. Multi-agent interactive behavior — merging, unprotected lefts, occluded pedestrians, construction zones — is where stacks fail. The Sim Agents Challenge exists because without realistic NPCs, closed-loop sim is unfalsifiable. Arguably the hardest current AV sub-problem.

E.6 Generalization

Cross-city — a Phoenix stack doesn't work in Mumbai; Waymo's scaling law is HD-map + ODD + fleet-ops + weeks of fine-tuning per city. Cross-weather — fog, snow, heavy rain remain perception failure modes. Cross-embodiment (robotics) — Franka-trained policies don't transfer to xArm without retraining; Open X, RoboMIND's 4-embodiment design, and AgiBot World are explicit attacks on this.

E.7 End-to-end vs modular: the pendulum

2018–2022: modular wins (Waymo, Cruise, Aurora) — interpretable, debuggable, certifiable.
2022–2024: end-to-end wins (Tesla FSD v12 deleted ~300k lines of C++; Wayve / Waabi built E2E from day one).
2025–2026: convergence. Waymo's foundation model is "modular transformer-based" — separate transformers per stage, but each is a neural net. Tesla FSD v14 is fully E2E but uses auxiliary tasks for interpretability. Wayve's GAIA-2 (2025) pairs E2E with explicit safety filters. The dichotomy is dissolving toward "neural everywhere, modular interfaces for safety arguments." Applied Intuition's tooling survives the transition because the interfaces survive even as implementations go neural.

F. North-star goals across the field

Stand back and every research direction above aims at one of five long-term outcomes:

Robotaxi at scale. Driverless paid passenger service in arbitrary cities. Today: Waymo + Apollo Go + early Zoox / Tesla pilots. Open question: how fast can ODD expand?
Trustable end-to-end AV. A single neural net, demonstrably safe under SOTIF, certified under R157+, deployable at consumer scale. Today: Tesla FSD v14 (un-certified), Wayve, Waabi (trucking).
Generalist robot policy across embodiments. π0/π0.5, RT-2, OpenVLA — one policy controlling Franka, xArm, ALOHA, humanoids — given language.
True digital-twin sim. Near-zero domain gap on sensors, behavior, and distribution — enough to replace a meaningful fraction of real miles for both training and validation. Waabi World and the Waymo World Model (Feb 2026, built on DeepMind Genie 3) are the most credible attempts.
Data infrastructure that auto-converts drives/demos into training fuel. Closing the loop from "fleet log" → "labeled tensor" → "trained model" → "validated improvement" without human-in-the-loop bottlenecks.

(5) is a horizontal capability that all four other goals depend on. That is the Applied Intuition thesis.

G. Implications for Applied Intuition (and you)

Applied Intuition's portfolio (per the 2025 in-review post) covers:

Simulation — ASAM-OpenSCENARIO/OpenDRIVE/OSI compliant, scenario authoring, sensor sim. (D.1)
Data Explorer / Data Intelligence — fleet-log search, mining, curation, embedding retrieval. (E.4)
Axion Tooling & Infrastructure — all-domain data engine + sim toolkit. (F.5)
Acuity Ground / Air / Maritime — end-to-end autonomy stacks. (F.2 across non-passenger domains)
ADAS (expansion 2025), Foundation models / voice / GenAI — adjacencies.

Highest-leverage gaps for the user's Data Intelligence team:

Closed-loop evaluation that's interpretable, auditable, and ASIL-mappable. Bench2Drive / NAVSIM v2 are research-grade; the productized OEM-safety-case version is wide open.
Multi-agent / sim-agents realism. The gating problem for sim-to-validate. Whoever ships realistic NPCs — data-driven, not heuristic — owns closed-loop testing. (Sim Agents Challenge is the research analog.)
Cross-embodiment data infrastructure for robotics. AI is mostly AV today, but the same primitives (search, curation, embedding retrieval, scenario mining, replay) are arguably more valuable in robotics where nobody has solved them. LeRobot is the OSS comparable; nobody has the enterprise version.
Long-tail / corner-case mining at fleet scale. Tesla's shadow-mode and Waymo's hard-mining are internal. Productized edge-case mining is a real wedge — and CODA's <12.8% mAR quantifies the headline claim.
Auto-labeling pipelines that survive SOTIF audit. AI's Mining massive autonomy datasets is already moving here — pretrained FMs auto-labeling fleet logs at human-level cost-per-frame. The bridge from "auto-labeled" to "ISO-21448-defensible" is the unsolved piece.

Thesis: the field has converged on the diagnosis ("data and evaluation are the bottleneck") much faster than the cure. Most public datasets and benchmarks codify the old regime (open-loop, log-replay, perception-heavy). The opportunity at AI Data Intelligence is to be in the room while the new regime — closed-loop, scenario-based, curation-first, multi-embodiment, SOTIF-defensible — is being defined and standardized.

Sources

AV datasets: KITTI-360, Mapillary Vistas paper, nuScenes, nuPlan, Waymo Open, WOD-E2E, 2025 Waymo Open Challenges, Argoverse, Argoverse 2 paper, ZOD, ApolloScape, ONCE, A2D2, BDD100K, OpenLane-V2, CODA.
Robotics datasets: Open X-Embodiment, DROID, BridgeData V2, RoboNet, BEHAVIOR-1K, LIBERO, AgiBot World, AgiBot World Beta on HF, RoboMIND, LeRobot, LeRobotDataset v3.0.
Benchmarks: Bench2Drive, TrajTok 2025 Sim Agents, NAVSIM v2, LIBERO-PRO, CVPR 2025 OpenDriveLab, CALVIN.
Standards: ASAM OpenSCENARIO 2.0 page, ASAM OpenSCENARIO DSL 2.1.0 spec, ASAM OpenDRIVE 1.9.0 spec, ASAM OpenLABEL, ISO 21448:2022 SOTIF.
Regulation: UNECE R157 130 km/h amendment, Hogan Lovells on R155/R156, NHTSA AV STEP NPRM, DOT AV Framework April 2025, Squire Patton Boggs on EU AI Act for automotive, Volvo Autonomous on EU AI Act, SAE J3016_202104 (UNECE wiki copy), PEGASUS, California DMV AV permits.
Research framing: RAND "Driving to Safety" RR-1478, Waymo World Model (Feb 2026), Waabi: The next frontier — generative AI in the physical world, Wayve Sequoia interview on AV 2.0.
Applied Intuition: 2025 in review, AI for mining massive autonomy datasets, Series F at $15B valuation, Products page, OpenSCENARIO V2 explainer.