Robotics & Embodied Foundation Models: A Landscape Survey

Last verified: 2026-05-08. This document was originally drafted from training-data knowledge (January 2026 cutoff) and has since been verified and enriched against primary web sources (company blogs, arXiv, IEEE Spectrum, TechCrunch, Bloomberg, etc.). Inline [Source] citations link to verifying sources. Claims that could not be independently verified are flagged [unverified].

This document is a learning-oriented map of the embodied-foundation-model and humanoid-robotics landscape, organized for someone joining Applied Intuition's Data Intelligence team. The lens throughout is: where does the data come from, how is it labeled, and where does synthetic data fit?

A. The VLA (Vision-Language-Action) Thread

A VLA model takes pixels + a natural-language instruction and outputs robot actions (joint targets, end-effector deltas, or discrete tokens). The genre crystallized at Google in 2022–2023 and has since become the dominant recipe for "generalist" manipulation policies.

A.1 Google RT-1 → RT-2 → RT-X / Open X-Embodiment

RT-1, late 2022: 35M-parameter Transformer trained on ~130k teleop episodes collected over 17 months across 13 Everyday Robots arms. Tokenized 7-DoF arm + base actions. Established that scaling laws hold for manipulation given enough demos. RT-1.

RT-2, mid-2023: fused a pretrained VLM (PaLI-X / PaLM-E backbones) with action tokens. Key trick — cast actions as text tokens and co-fine-tune on web VQA + robot trajectories — let RT-2 inherit semantic generalization (e.g., "pick up the extinct animal" → dinosaur toy). RT-2.

Open X-Embodiment / RT-X, October 2023: a multi-institution dataset and model effort. The dataset aggregated ~1M+ real-robot trajectories from 22 embodiments across 21 institutions, pooling 60 pre-existing datasets covering 527 manipulation skills in a shared (RLDS) schema. RT-1-X and RT-2-X showed positive transfer across embodiments — a single policy improving on multiple lab-specific baselines. The field's biggest "ImageNet moment" claim for robotics. Open X-Embodiment paper (arXiv:2310.08864), project page.

A.2 OpenVLA (Stanford / Berkeley / TRI / MIT)

OpenVLA, June 2024 (arXiv:2406.09246): 7B-parameter open-source VLA on Llama-2-7B + DINOv2 + SigLIP visual encoders, trained on 970k OXE manipulation trajectories on a 64×A100 cluster for 15 days. Outperformed RT-2-X (55B) by +16.5% absolute task success rate despite being 7× smaller, and shipped weights + training code + LoRA fine-tuning recipes. The de-facto open baseline for VLA research. Authors span Stanford, UC Berkeley, TRI, Google DeepMind, Physical Intelligence, and MIT. OpenVLA project, HF weights.

A.3 Octo (Berkeley)

Octo, early 2024: a small (27M / 93M) Transformer trained from scratch on 800k OXE episodes with a diffusion-policy action head. Emphasizes flexible conditioning (language or goal images), modality dropout, and easy fine-tuning. Popular for academic groups that don't want a 7B VLM in the loop. Octo.

A.4 Physical Intelligence — π0, π0-FAST, π0.5

Physical Intelligence (PI) — the Karol Hausman / Sergey Levine / Chelsea Finn startup — has shipped three notable models.

π0, paper posted October 31, 2024 (arXiv:2410.24164): flow-matching VLA on PaliGemma-3B + a separate "action expert" Transformer. Action chunks predicted via flow matching at 50 Hz. Trained on OXE + a large internal dexterous mobile manipulator dataset across multiple embodiments. Demos: laundry folding, table bussing, box assembly — long-horizon, contact-rich. Open-sourced via openpi GitHub. π0 blog.
π0-FAST, weights released February 2025 (paper January 2025): replaces flow matching with frequency-space action tokenization ("FAST" tokens, based on the Discrete Cosine Transform), making the policy purely autoregressive — easier to train with standard VLM infra and up to 5× faster to train. π0-FAST research page.
π0.5, April 22, 2025 (arXiv:2504.16054): the "open-world" version, co-trained on heterogeneous data — web data + cross-embodiment robot data + high-level semantic prediction + verbal supervision, aimed at generalization to unseen homes. Demoed cleaning kitchens and bedrooms in houses the robot had never seen during training — arguably the strongest public evidence to date that VLA generalization extends out-of-distribution. π0.5 blog.
π0.6, model card November 17, 2025 (PI06 model card PDF) — the latest in the family at time of verification.

What's distinctive about PI: flow-matching action heads at high frequency, deliberate cross-embodiment training (including bimanual mobile bases), open weights, and emphasis on long-horizon dexterous tasks rather than tabletop pick-and-place.

PI funding (verified): $70M seed (March 2024) → $400M Series A at $2.4B valuation (November 2024, Bezos / OpenAI / Thrive / Lux / Bond) → $600M Series B at $5.6B post-money (November 2025, led by CapitalG — Bloomberg, The Robot Report). A reported $1B round at ~$11B valuation is in talks as of March 2026 [unverified — reported, not closed].

A.5 Figure — Helix / Helix 02

Helix, February 20, 2025: Figure's in-house humanoid VLA, a two-system stack. "System 2" — an internet-pretrained VLM running at 7–9 Hz reasoning over scene + instruction. "System 1" — an 80M visuomotor Transformer at 200 Hz outputting continuous control of the entire upper body (wrists, torso, head, individual fingers). They communicate via a latent vector. Demoed two humanoids collaboratively putting groceries away from a single neural net. Runs entirely onboard low-power embedded GPUs. Trained on internal teleop data; full specifics unpublished. Figure Helix.

Helix 02 (announced January 27, 2026): adds full-body autonomy (locomotion + manipulation + balance as one continuous learned system) and introduces a "System 0" foundation layer that runs balance/coordination at kilohertz rates underneath System 1/2. Demoed end-to-end autonomous dishwasher loading and living-room cleanup. Figure Helix 02.

A.6 1X Technologies — World Model + Neural Stack

1X's NEO humanoid runs an end-to-end neural net. Their distinguishing public artifact is the 1X World Model (late 2024): an action-conditioned generative video model used for both evaluation (compare predicted vs. real video) and data augmentation. Trained on thousands of hours of EVE/NEO interaction video from homes via teleop and the employee fleet. NEO Beta was unveiled August 30, 2024, and the consumer NEO launched in October 2025 at a $20,000 price point — though early reporting noted that initial deployments rely heavily on remote human teleoperation rather than full autonomy (dronexl coverage). 1X raised a $100M Series B in January 2024 (EQT Ventures). 1X World Model.

A.7 DeepMind — RT-2-X, Gemini Robotics

RT-2-X was DeepMind's contribution to the Open X-Embodiment effort (RT-2 fine-tuned on the cross-embodiment mix).

Gemini Robotics & Gemini Robotics-ER, announced March 2025 (DeepMind blog): Gemini-2.0-based VLA models. The family has two members: a "Robotics" action model that emits robot actions, and an "Embodied Reasoning (ER)" model that emits 3D bounding boxes, grasps, trajectories, and code — a foundation for downstream policies. DeepMind partnered with Apptronik for humanoid integration and with other OEMs (Boston Dynamics, Agility, Agile Robots, Enchanted Tools) as trusted testers.

Subsequent releases:

Gemini Robotics On-Device (June 24, 2025): a VLA optimized to run locally on the robot, adapts to new tasks with as few as 50–100 demos. Ships with a Gemini Robotics SDK. On-Device blog.
Gemini Robotics 1.5 + Gemini Robotics-ER 1.5 (September 2025): the "thinking" generation. ER 1.5 is a high-level reasoner that orchestrates multi-step plans and hands off to the 1.5 action model. Cross-embodiment transfer demonstrated across ALOHA-2, Franka bi-arm, and Apptronik Apollo without retraining. Gemini Robotics 1.5 blog.
Gemini Robotics-ER 1.6 (DeepMind blog) — incremental upgrade to the ER reasoner.

A.8 Tesla Optimus

Public information on Optimus's stack is fragmentary. Known/confirmed: Optimus uses an end-to-end neural net for hand control and walking, shares perception infrastructure with the FSD vision stack (HydraNets, occupancy networks), and is trained partially on mocap suits + VR teleop at Tesla's Fremont facility. Tesla has not published a formal VLA paper. Gen 2 launched late 2023; Gen 3 was teased on Tesla's Q3 2025 earnings call with a target unveil in Q1 2026 [unverified production-ready date]. Reported Gen 3 specs: ~173 cm, ~57 kg, FSD-v13 end-to-end NN, 22-DoF tendon-driven hands (up from 11 in Gen 2), 2.3 kWh battery, target retail $20–25k. Performance in many public demos has been disclosed as teleoperated (e.g., the October 2024 "We, Robot" event). Treat capability claims with skepticism until independently demonstrated. Tesla AI.

A.9 Skild AI — Skild Brain

Skild AI (Deepak Pathak, Abhinav Gupta — CMU spinout) raised a $300M Series A on July 9, 2024 at a $1.5B valuation (Lightspeed, Coatue, SoftBank, Bezos Expeditions — Skild blog), followed by a ~$500M SoftBank-led raise in April 2025, and a ~$1.4B SoftBank-led round in January 2026 at a $14B valuation (Crunchbase News, TechCrunch). Their pitch is a single "general-purpose robot brain" ("Skild Brain") that generalizes across hardware: quadrupeds, manipulators, humanoids. They claim to have trained on a dataset "1000× larger than competitors" combining real, sim, and human video — specifics not disclosed. Few public demos beyond promo videos. Capability claims remain largely unverified. Skild AI.

A.10 Covariant — RFM-1 (now Amazon)

Covariant built warehouse pick-and-place arms running on a multimodal foundation model called RFM-1 (Robotics Foundation Model 1), announced March 11, 2024. RFM-1 ingests text, images, video, robot states, and actions as a unified token stream — an autoregressive 8B model trained on years of in-warehouse pick data plus internet video. On August 30, 2024, Amazon hired the three Covariant founders (Pieter Abbeel, Peter Chen, Rocky Duan) plus ~25% of staff, and signed a non-exclusive license to Covariant's models — a "reverse acquihire" structure (TechCrunch, GeekWire). A 2025 whistleblower disclosure put the deal value at ~$380M plus a $20M follow-on licensing fee — below Covariant's prior $625M valuation. Ted Stinson now leads the residual Covariant. Covariant RFM-1.

A.11 Apptronik + Apollo (DeepMind partnership)

Apptronik's humanoid Apollo (a 1.6m, 73kg general-purpose humanoid) entered commercial pilots starting in 2024. The Mercedes-Benz agreement (announced March 2024 — PR Newswire) is testing Apollo at Mercedes' Berlin-Marienfelde Digital Factory and a Hungarian site for assembly-kit delivery; Mercedes also took a low-double-digit million-euro stake. GXO launched a multi-phase Apollo R&D in 2024 (lab → distribution-center deployment). Both remain pilot/R&D phase, not full commercial production deployments. In December 2024, Apptronik announced a collaboration with Google DeepMind to build foundation models for Apollo, leveraging Gemini Robotics. Funding: $350M Series A in February 2025 (co-led by B Capital and Capital Factory, with Google as a participant — not lead — per TechCrunch); a Series A extension in February 2026 brought the total to $935M at a ~$5B valuation (CNBC). Apptronik + DeepMind.

B. Humanoid Hardware Companies & Data Strategies

Company	Robot	Status (verified May 2026)	Notable Customers / Pilots	Data Strategy
Figure	Figure 02 / 03	$1B+ Series C at $39B post-money valuation (Sept 2025, Figure announcement); BotQ factory targeting 12k/yr → 50k/yr; reportedly producing 1 robot/hr (The Robot Report)	BMW Spartanburg pilot (Jan 2024 commercial agreement; Figure 02 contributed to ~30k X3 production over 11 months and was retired Nov 2025; no current Figure robots on the line per BMW) (BMW press, WardsAuto); BMW Leipzig pilot beginning Dec 2025 → summer 2026	In-house teleop + Helix / Helix 02 VLA; building proprietary humanoid dataset
1X	NEO Beta / NEO	$100M Series B Jan 2024; consumer NEO launched Oct 2025 at $20k; deployed in "a few hundred to a few thousand" homes — largely teleoperated, not autonomous	Direct-to-consumer beta in employee/early-customer homes	Teleop in homes + 1X World Model for sim/eval
Tesla	Optimus Gen 2/3	Internal Fremont deployment; Gen 3 production-intent prototype unveil targeted Q1 2026 [unverified]	Tesla factories (small numbers)	VR teleop, mocap suits, FSD vision stack reuse
Apptronik	Apollo	$935M Series A total at $5B (Feb 2026); pilot/R&D at Mercedes & GXO	Mercedes-Benz (Berlin Digital Factory + Hungary, 2024); GXO Logistics multi-phase R&D (2024); Jabil [unverified specifics]	Teleop + DeepMind Gemini Robotics integration
Agility Robotics	Digit	"RoboFab" in Salem OR ([unverified] ~10k/yr capacity)	GXO Logistics multi-year RaaS at SPANX Atlanta DC (June 27, 2024 — industry-first humanoid RaaS deal, Agility press release); Amazon trials	Teleop + scripted skills + sim; tighter task scope (tote handling)
Boston Dynamics	Atlas (electric)	Electric Atlas unveiled April 2024 replacing hydraulic; TRI partnership for Large Behavior Models, Oct 16, 2024 (TechCrunch); Atlas+LBM long-horizon manipulation demo Aug 2025	Hyundai (parent) plant in Georgia; trusted-tester for Gemini Robotics On-Device	MPC + RL in sim; LBMs trained from teleop with TRI
Unitree	H1, G1, R1	H1/G1 sold widely as research platforms; G1 starts at ~$16k, ranges to ~$74k for EDU configs (The Robot Report)	Universities, research labs worldwide (de-facto research humanoid)	Sells hardware; not pursuing in-house foundation model
Sanctuary AI	Phoenix Gen 7	Total funding ~$140M (incl. $30M Canadian SIF + 2024 BDC/InBC tranche); Magna deal includes equity stake	Magna International pilot (April 2024); Mark's retail (BC) week-long pilot; small scale	"Carbon" cognitive architecture; teleop-driven dataset
Fourier Intelligence	GR-1, GR-2	Sold as research / rehab platform; Shanghai-based	Research labs, rehab clinics	Hardware-led; partnerships for software stack
UBTech	Walker S / S1 / S Lite	Pilots in Chinese auto factories	NIO F2 (2024 — first humanoid on auto assembly line), BYD, Geely/Zeekr, Dongfeng, FAW-VW (2024–25)	Internal + partner; less public on data approach
XPENG	Iron (Gen 2 unveiled Nov 5, 2025 at AI Day)	Announced 2024; next-gen unveiled Nov 2025 with 22-DoF hands, 3 in-house Turing chips (~2,250 TOPS), powered by XPENG VLA 2.0; mass production target end of 2026	XPENG factory pilots	Vertical integration with XPENG's auto AI stack
PNDbotics	Adam	Smaller Chinese player; commercial pilots claimed	Limited public deployments	(limited public info)

Pattern. The humanoid market split into three camps: (1) vertically integrated AI + hardware (Figure, 1X, Tesla, XPENG), (2) hardware + foundation-model partnership (Apptronik–DeepMind, BD–TRI), and (3) hardware-only / research platforms (Unitree, Fourier). The first formal commercial humanoid RaaS contract — Agility's GXO Logistics deal at the SPANX Atlanta DC (June 27, 2024) — remains the strongest empirical anchor for "humanoids actually working" in real environments, though scope is narrow (tote handoffs to conveyors). Most "deployment" announcements in 2024–25 are still pilots, not steady-state commercial production: BMW publicly retired the Figure 02 pilot in November 2025 with no current robots on its line, and Mercedes' Apptronik partnership has not graduated past pilot.

C. Manipulation, Mobile, and Locomotion Datasets

C.1 Open X-Embodiment (OXE)

The biggest aggregated robot dataset to date. The 2023 release pooled >1M trajectories from 60 datasets across 22 embodiments and 21 institutions, covering 527 manipulation skills [verified: arXiv:2310.08864]. Subsequent expansions have appeared (e.g., OXE-AugE in 2025 augments the OXE seed to >4.4M trajectories via robot augmentation), but the original ~1M figure is the headline; the "~2.4M" oft-cited "OXE 1.5" figure is unverified [unverified]. Standardized to a common RLDS schema. Limitations: very long tail (a few datasets dominate, many embodiments have a handful of episodes), and mostly tabletop manipulation. OXE project, GitHub.

C.2 DROID

DROID, March 2024 (arXiv:2403.12945): 76k successful demonstration trajectories / ~350 hours / 564 scenes / 52 buildings / 86 tasks, collected by 50 data collectors across 13 institutions on 3 continents over 12 months. Hardware: Franka Panda 7-DoF + 2 adjustable Zed 2 stereo cameras + wrist-mounted Zed Mini + Oculus Quest 2 teleop. Unlike OXE's heterogeneity, DROID is one platform / one teleop interface — much cleaner for training and evaluation. Released CC-BY 4.0 with code and pretrained checkpoints. DROID project.

C.3 Bridge / BridgeData V2

BridgeData V2 (Berkeley): ~60k demos across 24 environments on a low-cost WidowX arm, 2022–2023. Designed for cheap reproducibility, used heavily in academic VLA papers. Bridge V2.

C.4 RoboNet

2019, ~15M frames across 7 robot platforms — historically important as the first cross-embodiment effort, largely superseded by OXE.

C.5 MimicGen / Synthetic Demos

MimicGen (Nvidia / UTexas, 2023) generates new demos from a small seed by adapting reference trajectories to new object poses in simulation — ~10 source demos → 1000s of synthetic demos. DexMimicGen (2024) extends to bimanual dexterous tasks. The practical answer to "we can't afford 100k human demos per task." MimicGen.

C.6 BEHAVIOR-1K, RoboCasa, RoboHive, Habitat

BEHAVIOR-1K (Stanford, Fei-Fei Li) — 1000 long-horizon household activities in Omnigibson, grounded in human surveys of "what should a home robot do."
RoboCasa (Nvidia / UTexas, 2024) — Robosuite-based sim framework with 100+ kitchen layouts and procedurally generated objects, built for VLA scaling.
RoboHive — meta-benchmark unifying many manipulation envs under one API.
Habitat (Meta) — navigation + embodied AI; Habitat 3.0 (2023) added humanoid avatars for HRI research.

C.7 Industrial / Warehouse Data

Most warehouse-scale data is proprietary: Amazon Robotics, Symbotic, Ocado, Covariant (now Amazon), GXO all run live picks at billions-per-year scale but do not publish. Public warehouse benchmarks remain weak — a real opportunity area.

D. Sim-to-Real & Data Augmentation for Robots

D.1 Domain Randomization (Classic)

OpenAI's 2017–2019 work on robotic hands (Rubik's cube) established the playbook: train in sim with randomized physics, lighting, textures, and camera poses; the real world becomes "just another sample." Still the workhorse for locomotion (e.g., ANYmal, Spot, Unitree controllers all trained this way using Isaac Gym / Mujoco MJX in massive parallel envs).

D.2 Generative Video as World Model / Data Engine

A 2024–2025 wave: rather than randomize physics, generate plausible futures with video models.

1X World Model (2024) — action-conditioned video generation, used for both eval and training data. 1X.
Genie / Genie 2 / Genie 3 (DeepMind) — interactive world models. Genie 2 (Dec 2024) generated 3D-consistent rollouts from a single image. Genie 3 (announced Aug 5, 2025 — DeepMind blog) is the first real-time interactive general-purpose world model: text-prompted dynamic worlds at 24 fps, 720p, with multi-minute consistency through emergent memory. Initially research-preview; February 2026 rolled into Project Genie for Google AI Ultra users in the US.
RoboGen (CMU / MIT, 2023) — uses LLMs to propose tasks, scenes, and reward functions, then fills in 3D assets and trains policies in sim.
UniSim / Sora-style approaches — universal action-conditioned video models as a substitute for hand-built simulators. Still research-stage; the open question is whether they're physics-accurate enough for contact-rich manipulation.

D.3 Nvidia Stack — Isaac Sim, Isaac Lab, GR00T

This is now the dominant simulation stack outside of academic Mujoco.

Isaac Sim — GPU-accelerated physics on PhysX, photoreal rendering on Omniverse. Workhorse for industrial digital twins.
Isaac Lab — open-source RL training framework on top of Isaac Sim (replaces older Isaac Gym). Supports massively parallel envs (~thousands in one GPU) for locomotion and manipulation.
Project GR00T — Nvidia's "general-purpose humanoid foundation model" initiative announced at GTC March 2024.
- GR00T N1 (announced March 18, 2025 at GTC, Nvidia press, arXiv:2503.14734) — first open release. Dual-system architecture (slow VLM "System 2" + fast diffusion-transformer action head "System 1"), trained on a mix of real teleop, OXE, human video (egocentric), and Isaac-generated synthetic. Open weights.
- GR00T N1.5 (June 2025, demoed live at automatica 2025 on Franka FR3 dual-arm — Franka news) — substantial upgrade: 38.3% success rate on 12 DreamGen tasks vs 13.1% for N1. GR00T N1.5 research page.
- GR00T N2 — not announced as of verification date (May 2026). Earlier "rumored" framing was incorrect; no GR00T N2 has been officially released. [unverified — drop or downgrade].
GR00T-Mimic (announced GTC 2025) — Nvidia's pipeline for generating synthetic robot demonstrations from a handful of human demos using a combo of MimicGen-style replay + Cosmos generative video augmentation.
GR00T-Gen (announced GTC 2025) — generative pipeline for scenes, layouts, and tasks in Isaac (procedural homes, warehouses) — analogous to RoboCasa but tied into the Nvidia stack.
Cosmos (announced Jan 6, 2025 at CES, Nvidia press) — Nvidia's family of "world foundation models" (diffusion + autoregressive variants) released under an open model license. Trained on 9,000 trillion tokens drawn from 20M hours of real-world human, environment, industrial, robotics, and driving data. Three tiers: Nano (edge/low-latency), Super (baseline), Ultra (max quality, distillation source). Early adopters: 1X, Agile Robots, Agility, Figure, Foretellix, Fourier, Galbot, Hillbot, IntBot, Neura Robotics, Skild AI, Virtual Incision, Waabi, XPENG, Uber. Nvidia Cosmos.

D.4 3DGS / NeRF Capture for Sim

Increasingly common: scan a real scene with a phone or capture rig, reconstruct as a 3D Gaussian Splatting or NeRF asset, drop into Isaac/Mujoco for rendering. Robot Studios like PolyCam, Luma, and research projects like GauSim and SplatSim have made this a near-commodity workflow. The hard part is recovering accurate geometry + material + collision — rendering is great, physics is still mostly a hand crank.

D.5 Real-to-Sim Digital Twins

RoboCasa (UTexas / Nvidia) — 100+ kitchen scenes, procedurally generated, built specifically for VLA scaling experiments.
PolyGen / Genesis — Genesis (CMU 2024) is a unified differentiable physics + rendering engine that pitches itself as a generative simulator.
Industry: Applied Intuition's own simulator stack, Nvidia Drive Sim, dSPACE, etc., for AVs; analogous "robot digital twin" tooling is much less mature. This is one of the cleanest whitespace areas.

E. Whitespace and Bottlenecks

E.1 The Data Problem: No Internet for Robots

LLMs scaled because the internet is the dataset. Robotics has no equivalent. The OXE dataset is ~2M trajectories — orders of magnitude smaller than what's needed to match LLM-style emergence. The community's three workarounds are:

Teleoperation at scale. Cost is the binding constraint. A skilled teleoperator collects roughly 50–150 demos per hour per robot. At ~$30/hr fully loaded, 1M demos costs $250k–$1M and many robot-years. Companies like Figure, PI, and 1X are all running private teleop farms in this range.
Egocentric human video. Project Aria (Meta), Ego4D (3,670 hours), Ego-Exo4D (Meta + 15 universities, 2024) and HoloAssist all aim to harvest "what humans do with their hands" as a free supervisory signal. The catch: no action labels, no force feedback, no teleop calibration. Bridging the human-to-robot gap requires retargeting + sim or learned correspondence.
Synthetic / generative. MimicGen-style replay, GR00T-Mimic, and video world models. The honest answer is: we don't yet know how much of the gap synthetic can close for contact-rich manipulation.

E.2 Teleop Economics

The economics matter for whitespace-spotting. If you assume:

~$50/hr loaded cost per teleoperator (incl. equipment)
100 demos/hr good demos
Need ~10k demos per task to be reliable, ~100 tasks → 1M demos
→ ~$500k of pure teleop cost per "general-purpose" model, before failure/cleanup overhead

Scaling 10x gets prohibitive fast. This is why synthetic-data multipliers (50x–1000x from each real demo) and policy-conditioned scenario generation are existential.

E.3 Generalization Across Embodiments

OXE / RT-X showed positive transfer across similar embodiments (mostly 6–7 DoF arms with parallel grippers). Transfer to dexterous hands, mobile bases, humanoids is much weaker — the action spaces are too different. Cross-embodiment is partially solved for arms, mostly unsolved for legs and hands. The bet of Skild and PI is that with enough scale + the right tokenization, this works. Verdict pending.

E.4 Multi-Task vs Specialist

Empirically, in 2024–2025, specialist policies beat generalists on the specialist's task, but generalists win on novel tasks and on cost-of-ownership. This is the same pattern that played out in NLP from 2018–2022. The economics will likely favor a small number of generalist VLAs per industry vertical, fine-tuned per customer.

E.5 Evaluation Is Broken

This is the area where Applied Intuition–style tooling is most needed.

No standard benchmark for generalist policies. Each lab reports on its own real-robot tasks, often with non-reproducible setups. RoboArena / RoboHive / SimplerEnv try to standardize sim eval, but sim-real gap means sim numbers don't predict real performance well.
Real robot eval is expensive. A serious eval is dozens of trials across many tasks with human raters. Companies do this internally but rarely publish.
Long-horizon eval is even worse. No agreed metric for "how good was this 5-minute laundry-folding rollout?" Companies use partial-credit rubrics, time-to-completion, or task-success-at-step-N — all noisy.
Coverage / scenario generation. Like AVs in 2018–2020, the field needs to move from "pass rate on a fixed test set" to "what's our coverage of meaningful scenarios?" This is exactly the AV-validation pattern Applied Intuition built its business on.

F. Implications for Applied Intuition / Data Intelligence

The thesis: the AV-data tooling pattern transfers to robotics, but is much less mature and more fragmented. Concrete opportunity areas:

F.1 Labeling Robot Demos

Teleop episodes need cleaning (failed demos, off-policy moments, gripper miscalibrations). Multi-modal alignment (video + force-torque + proprioception) is non-trivial. VLAs need hierarchical language labels (goal → sub-skill → primitive); most labs do this ad-hoc. Embodiment normalization — mapping multiple robots into a canonical action schema — remains rough (OXE's schema is a starting point but limited).

F.2 Scenario Coverage for Manipulation

AV analog: "tested left-turn-on-red in fog?" Robot analog: "tested grasp-from-cluttered-shelf with reflective objects under low light?" Requires scene parametrization (objects, layouts, lighting, distractors) plus a coverage metric over that space, tied to scenario generation — procedural (RoboCasa, GR00T-Gen) or generative (Cosmos / video world models).

F.3 Policy Validation in Sim

Run a candidate policy against a battery of scenarios and surface regressions. Catch sim-real gap with dual eval (same policy in sim + on a small real-robot fleet; learn the residual). Replay-based validation as a sanity check.

F.4 Data Curation

Active learning over teleop pools (which demos are most informative?), difficulty / diversity scoring (avoid yet-another-pick-from-table), de-duplication, drift detection, embodiment-specific filtering.

F.5 Synthetic Generation as a Product

MimicGen-style replay augmentation (10x–1000x per real demo), generative video augmentation (Cosmos / 1X-WM style) for visual robustness, procedural scene generation for sim-based RL. Most labs roll these tools themselves, badly. Productizing them with metadata, lineage, and policy-impact tracking is a clear gap.

F.6 Long-Horizon Eval Tooling

Rubric-based human eval pipelines (RLHF tooling, but for episodes), task graph decomposition with per-step success scoring, failure mining (cluster by mode — perception miss, slip, planning, recovery — and feed back into the curriculum).

Bottom line for Data Intelligence. The primitives that matured for AVs in 2017–2024 — scenario libraries, coverage metrics, synthetic generation tied to scenario gaps, sim-real transfer harnesses, data curation pipelines — are now in early demand for robotics. The field is roughly 5–7 years behind AVs in tooling maturity but moving faster because the foundation-model recipe transfers cleanly from LLMs. The dirty work of mapping kinematics, sensor configs, and action spaces across robots is currently per-paper; a canonical embodiment-translation layer + dataset registry is overdue.