Physical AI
Research·Doc 04·~120 min

Labeling, Data Curation, and the Data Flywheel for Physical AI

Scope: the slice of Physical AI most directly relevant to Applied Intuition's Data Intelligence team — how labels get made, how datasets get curated, and how the resulting "data engine" becomes a competitive moat.

Last verified: 2026-05-08 — Funding rounds, headcounts, and recent industry events have been cross-checked against live web sources (TechCrunch, Bloomberg/CNBC, Fortune, official company press, arXiv). Items marked [unverified] could not be confirmed from public reporting and should be treated as analytical conjecture pending primary-source confirmation.


A. Labeling-platform players & status

The labeling market segmented hard between 2023 and 2025. Three camps emerged: (1) frontier-LLM data shops doing RLHF and expert reasoning data (Scale, Surge, Invisible, Mercor, Turing); (2) vision/AV labeling platforms with strong tooling for sensor data (Labelbox, Encord, Voxel51, V7, SuperAnnotate, Roboflow, Segments, Kognic); and (3) classical BPO-style labor providers (Sama, iMerit, Appen, CloudFactory). The lines blur — most of camp (2) now offers RLHF, and camp (1) increasingly competes for multimodal/agent training data — but the moats differ.

Scale AI — pivoted to defense and government after the Meta deal

Scale AI was the dominant frontier-data vendor through 2024. In June 2025, Meta invested $14.3 billion for a ~49% non-voting stake in Scale at a ~$29B post-money valuation, and Alexandr Wang left the CEO role to lead Meta Superintelligence Labs, alongside several Scale researchers (CNBC, June 12 2025, Fortune, June 13 2025, TechCrunch, Scale's own announcement). Jason Droege (CSO, ex-Uber Eats founder) became interim CEO.

The deal was structurally similar to Microsoft–Inflection and Amazon–Adept ("acqui-license-hire"): Meta got Wang + a chunk of Scale's research talent + a giant data contract; Scale stayed independent on paper. The fallout was immediate: OpenAI, Google, xAI, and Microsoft wound down or paused work with Scale, citing the conflict of interest of routing proprietary training-data demand through a Meta-affiliated vendor (TechCrunch — OpenAI drops Scale, Nasdaq — Why Google is fleeing Scale, TIME — How Meta's deal upended the data industry). Scale laid off ~200 in its data-labeling business in July 2025 (TechRepublic) and a CFO interview in November conceded it was "not a zombie company" (CNBC, Nov 2025). That accelerated the rise of Surge, Mercor, Invisible, and Turing on the RLHF side.

Scale's product lines as of late 2025:

  • Scale Nucleus — dataset management / curation tool for vision and AV; effectively the in-house competitor to FiftyOne and Encord Active. Underrated piece of the company.
  • Scale Rapid — self-serve labeling for smaller teams.
  • Scale Donovan — defense/intelligence LLM-with-data product.
  • Defense growth narrative — Scale was awarded the DIU Thunderforge prime prototype contract in March 2025 (AI for joint warfighting / theater planning, deployed first to INDOPACOM and EUCOM, partnering with Anduril Lattice and Microsoft) (CNBC, Breaking Defense, Scale blog). Reports also cite a 5-year, $100M CDAO OTA agreement and a $99M Army R&D contract (Scale–DoD Army R&D announcement). Defense and federal are now the public growth narrative.
  • AV / robotics labeling services — still operates, still serves OEMs and AV stacks, but no longer the strategic center. (Specific 2025 customer mix is [unverified]; the AV business is rarely broken out post-Meta deal.)

TL;DR: Scale is no longer the unambiguous market leader in foundation-model data; the commercial frontier-LLM business has been hollowed out, and defense/government plus Nucleus and AV labeling are the remaining pillars.

Surge AI — the new RLHF leader

Surge was founded in 2020 by Edwin Chen (ex-Google/Facebook/Twitter ML), bootstrapped through mid-2025 with no announced VC round. They became the de-facto standard for expert-tier RLHF data at Anthropic, OpenAI, and (increasingly) Google after Scale's Meta-induced conflict. Reporting in 2025 put Surge revenue at ~$1.2B, with ~110 employees, making them larger than Scale on the frontier-data line (Inc — bootstrapped to $1B revenue, Yahoo Finance / Bloomberg). In July 2025 Surge began its first external fundraise — reportedly targeting ~$1B in primary + secondary capital at a $15B+ valuation (The AI Insider cited a $25B target) (The AI Insider, July 2025, PYMNTS). The final close terms are [unverified] as of May 2026.

Strengths: SME network (PhDs, doctors, lawyers, code reviewers), faster turnaround, less leakage, no Big-Tech investor on the cap table. Weakness: thin tooling — Surge is a service business with a workforce platform, not a data lifecycle product. Not a direct Applied Intuition competitor.

Mercor — the post-Scale RLHF challenger

Mercor is the other post-Scale winner. Founded by Brendan Foody, Adarsh Hiremath, and Surya Midha (Thiel Fellows), it raised a $100M Series B in February 2025 (Felicis-led, $2B post-money) and a $350M Series C in October 2025 at ~$10B (Felicis again, with Benchmark and General Catalyst) (CNBC, TechCrunch). Sacra estimates Mercor's annualized revenue went from $35M in late 2024 to ~$500M run-rate by late 2025, hitting ~$1B annualized in early 2026 (Sacra). Mercor's pitch is a vetted-expert marketplace (>300k credentialed professionals — doctors, lawyers, engineers, scientists) supplying RLHF/SFT/RL-environment data to OpenAI, Anthropic, Meta, and Google DeepMind. Less of an AV/robotics player than a frontier-LLM data shop, but the model — vetted SMEs, RL-environment curation — is being copied across the industry.

Turing and Invisible — the "alternative neutrals"

  • Turing raised $111M at a $2.2B valuation in March 2025, pivoting from its developer-recruiting roots into LLM training services and "RL environments" for advanced agents (SiliconANGLE).
  • Invisible Technologies raised $100M at $2B+ in September 2025, leaning on its "Expert Marketplace" plus an existing BPO-style operations business (SiliconANGLE). Invisible reports work with >80% of the leading frontier labs and ~$134M revenue in 2024.

The whole "neutral data provider" tier exists because of the Scale–Meta deal; expect consolidation by 2027.

Labelbox — platform play, pivoted to "Frontier"

Labelbox raised its Series D in January 2022 (last announced round; total funding ~$189M per PitchBook) and quietly retrenched after that. In 2024–2025 they relaunched around Labelbox Frontier, an SME-network + tooling product for LLM/VLM eval and RLHF, plus continued vision tooling. In February 2026 Labelbox acquired Upcraft to expand its expert-network capacity for frontier AI (Labelbox press, Feb 2026 — verify on press page). They serve a broad mid-market and have stayed a generalist. Customer details (Walmart, P&G) are [unverified] from public reporting. Less of an AV story than Encord or Voxel51.

Encord — multimodal curation + active learning

Encord (London, founded 2020 by Eric Landau and Ulrik Stig Hansen) raised a $30M Series B in August 2024 led by Next47 (TechCrunch, Encord blog), then a $60M Series C led by Wellington Management in February 2026 at a ~$550M post-money valuation (Encord blog, TFN). The Series C narrative was explicitly "physical AI data infrastructure" — Encord reports its platform grew from 1 to 5 PB and physical-AI revenue grew 10× year-over-year. Three-product structure:

  • Annotate — labeling tool (good 3D/sensor support added in 2024).
  • Active — data curation, embedding-driven search, error analysis. Closest commercial analog to FiftyOne Enterprise.
  • Index / Apollo — multimodal foundation-model–assisted curation.

Encord is the most "Applied-Intuition-relevant" company in this list other than Voxel51: same buyer, similar pitch, and explicit physical-AI positioning since the Series C. Specific customer names beyond what's on their site are [unverified].

Voxel51 — open-source dataset visualization

Voxel51 (Ann Arbor, founded by Brian Moore and Jason Corso ex-University of Michigan) maintains FiftyOne — the open-source dataset visualization and curation library that became the lingua franca for vision data engineers between 2022 and 2025. They closed a $30M Series B in May 2024 led by Bessemer Venture Partners (Voxel51 press, TechCrunch / SiliconANGLE, VentureBeat), bringing total funding to ~$45M. (No Series C has been announced as of May 2026.) The funding is small relative to peers but FiftyOne's organic adoption is the moat — the GitHub repo has ~10k+ stars (verify current count) and is integrated into pipelines at Bosch, Berkeley DeepDrive, LG, plus many AV teams. Specific customer rosters beyond what Voxel51 publicizes are [unverified].

FiftyOne Enterprise is the commercial product: hosted, RBAC, plugin marketplace, integration with annotation vendors. It is the de-facto curation layer customers reach for first because there's no commercial blocker to start. This is the company most directly in Applied Intuition's lane on Data Intelligence.

V7 — labeling + agentic automation

V7 Labs (London) raised a $33M Series A in November 2022 (co-led by Radical Ventures and Temasek; the "Series B" framing in some reports appears to be a labeling discrepancy) and pivoted hard in 2024 toward V7 Go, an "agentic" document/data automation tool aimed at private-equity / finance / legal / insurance (Fortune, April 2024, V7 Go blog). The original V7 Darwin image/video labeling product still ships and has medical-imaging traction. The pivot makes V7 less of a pure labeling player and more of a vertical-agent company.

Roboflow — vision-labeling-meets-deployment for the long tail

Roboflow is Y-Combinator-grown. They closed a $40M Series B led by GV (Google Ventures) in November 2024 with Craft Ventures and YC participating (Fortune, Nov 2024, Roboflow blog), bringing total funding to ~$63M. (Earlier "$200M valuation" / "$300M+ valuation" claims in the original draft are not borne out by public reporting; valuations on the Series B were not officially disclosed.) Their wedge is the long-tail of vision — small teams, hobbyists, SMBs, defense — wanting end-to-end (label → train → deploy → host inference). They host Roboflow Universe (very large public-dataset repository — exact count [unverified]) and the popular supervision library. Less strategic for AV/AI Labs but huge on developer mindshare.

SuperAnnotate

SuperAnnotate (Armenia/SF) closed a $36M Series B in November 2024 led by Socium Ventures with NVIDIA, Databricks Ventures, and Glynn Capital participating (SuperAnnotate blog, Crunchbase), then a $13.5M Series B extension led by Dell Technologies Capital in July 2025 that brought the round total to $50M (SiliconANGLE). The company reported 5× software-revenue growth in 2024. Strong on multimodal LLM training data and structured labeling for enterprise; named customers include Databricks, Canva, Motorola Solutions.

CVAT — the open-source default

CVAT was open-sourced by Intel in 2018, spun out as CVAT.ai Corporation in 2022. Still the most-deployed self-hosted annotation tool in the world. Many of the labeling vendors above secretly run CVAT under the hood for some workflows.

Label Studio / HumanSignal (formerly Heartex)

Label Studio (open source) is the multimodal counterpart to CVAT — supports text, audio, time series, video, image, plus arbitrary plugin-defined types. The company rebranded from Heartex to HumanSignal in mid-2023 (HumanSignal announcement). They raised a $25M Series A in May 2022 led by Redpoint (HumanSignal press), with no major round announced since. In November 2024 HumanSignal acquired Erud AI to expand into multimodal data services and "Frontier Data Labs" novel-data collection. Strong adoption in NLP and audio teams; weaker than CVAT for dense 3D point cloud work.

Snorkel AI — programmatic labeling / weak supervision

Snorkel AI commercializes the Stanford weak-supervision research (Ré, Ratner et al.). Their pitch — write labeling functions (heuristics, distant supervision, model votes) and let Snorkel resolve them into probabilistic labels — became more important, not less, after foundation models, because Snorkel + GPT-4/Claude-class judge models is now a credible auto-labeling stack. They raised an $85M Series C in 2021 at ~$1B (Snorkel blog) and a $100M Series D in May 2025 at $1.3B led by Addition with Greylock and Lightspeed participating (BusinessWire) — total funding ~$238M. Strong in financial services, government, healthcare; the Series D pitch leaned into expert data and SME-augmented programmatic labeling, mirroring the broader market. Worth understanding deeply for Applied Intuition: programmatic labels are the alternative to "more humans."

Segments.ai — 3D AV labeling

Segments.ai (Belgium) is small but technically excellent at 3D point-cloud and multi-sensor labeling. Their tool was one of the first to do click-to-segment on lidar with a SAM-style approach. Punches above its weight on AV deals.

Deepen AI

Deepen AI (Mountain View) is a sensor-focused labeling and calibration vendor. Their differentiator is multi-sensor calibration toolkit + 3D labeling + safety case tooling. ASAM OpenLABEL contributors. Real AV customer base.

Kognic — AV-focused, Sweden

Kognic (Gothenburg) raised a $24M Series A co-led by Metaplanet and NordicNinja in August 2024 (Kognic blog), bringing total raised to ~$40M per Tracxn. (The "$37M Series B" claim in earlier industry literature appears to conflate Kognic's earlier raises under its prior brand "Annotell"; the most recent verified round is the 2024 Series A.) Built specifically for fused multi-sensor AV labels (camera + lidar + radar). Volvo, Zenseact are widely associated AV customers; Kodiak Robotics publicly chose Kognic in December 2024 for autonomous-trucking annotation (Kognic blog). In 2025 Kognic expanded into Language Grounding (vision-language annotation for VLM/VLA models) — a notable convergence with the foundation-model-as-labeler trend. The closest direct AV competitor to Applied Intuition's labeling pipeline if AI takes that on internally.

Understand.ai — acquired by dSPACE (not SAE)

Understand.ai (Karlsruhe) was acquired by dSPACE (the HiL/test-bench company) in 2019. Now part of dSPACE's data-and-validation toolchain. The user prompt's "SAE-acquired" memory is conflating with the acquirer name; dSPACE is correct.

Mighty AI — Uber, 2019

Mighty AI was acquired by Uber ATG in 2019 for ~$70–80M. Folded into Uber's self-driving group; ATG was later sold to Aurora. The Mighty AI team and tech were largely absorbed.

Sama, iMerit, Appen — BPO-style labor

  • Sama — Kenya/India workforce, ethical-labor positioning, ~$70M Series B in 2022 (Sama Wikipedia). Sama exited content moderation entirely in 2023, refocusing on computer-vision data annotation, after a string of high-profile cases (Meta moderator lawsuit in Kenyan courts, OpenAI/ChatGPT toxic-content labeling exposé). In 2026 Meta terminated its Sama contract following disclosure that Sama employees viewed private content from Ray-Ban Meta AI glasses, leading to >1,000 layoffs in Kenya (Cambridge Forum on AI Law / IHRB).
  • iMerit — India/US, total disclosed funding $24M from Omidyar, Khosla, Dell.org, BII per Tracxn. (The "$80M total" claim in earlier drafts is not borne out by Crunchbase / PitchBook / Tracxn data and is [unverified].) Acquired Ango (Ankara) in October 2023 to bring the Ango Hub annotation tool in-house. Strong in medical imaging (launched the ANCOR Annotation Copilot for Radiology in December 2024) and AV (opened a 12th global center, an Automotive AI Center of Excellence, in Coimbatore in 2024). FY March-2025 revenue ₹278 Cr ($33M).
  • Appen — publicly traded (ASX:APX), historically a 1M+ crowdworker network, was the giant of the 2010s. Google terminated its $82.8M Appen contract in January 2024, set to expire March 19 2024. The contract had represented ~⅓ of Appen's revenue (CNBC, Search Engine Journal). Stock dropped ~40% on the news; cumulative drawdown from peak (Aug 2020 ~AU$42) is >99%. Classic disrupted incumbent; pivoting to LLM data services to survive.

Quick comparison table

CompanyBest atTypical buyerRecent status (2024–26)
Scale AIDefense LLM data, Nucleus, AV labelingDoD, AV OEMs (post-Meta)Meta $14.3B / 49% non-voting June 2025; Wang→Meta SL; commercial frontier-LLM business hollowed out; Thunderforge + CDAO contracts
Surge AIExpert-tier RLHFAnthropic, OpenAI, Google~$1.2B revenue bootstrapped; $15B+ first fundraise initiated July 2025
MercorRLHF + vetted-expert RL dataOpenAI, Anthropic, Meta, GDM$350M Series C Oct 2025 at $10B; ~$1B annualized rev early 2026
TuringLLM training + RL environmentsFrontier labs$111M Mar 2025 at $2.2B
InvisibleExpert marketplace + ops>80% of leading labs$100M Sep 2025 at $2B+
LabelboxGeneral-purpose platformMid-market enterprise"Frontier" RLHF; acquired Upcraft Feb 2026
EncordMultimodal curation, active learningVision/medical/AV/physical-AI$30M Series B 2024; $60M Series C Feb 2026 ($550M val); explicit physical-AI positioning
Voxel51Open-source dataset infraVision/AV ML eng$30M Series B (May 2024, Bessemer-led); FiftyOne is the OSS standard
V7Medical imaging, agentic docsPharma + enterprise opsPivoted to V7 Go (agents)
RoboflowLong-tail vision E2ESMB, hobbyist, defense$40M Series B Nov 2024 (GV-led); Universe + supervision lib
SuperAnnotateEnterprise multimodalDatabricks, Canva, Motorola$36M Series B Nov 2024 + $13.5M Dell extension Jul 2025 = $50M B total
CVATOpen-source self-hostAnyone with a serverDefault OSS image/video tool
Label Studio / HumanSignalOSS multimodalNLP/audio teamsHeartex→HumanSignal 2023; acquired Erud AI Nov 2024
Snorkel AIWeak supervision + LLM judgesFSI, gov, healthcare$100M Series D May 2025 at $1.3B; "programmatic + LLM + SME" thesis vindicated
Segments.ai3D AV labelingAV stacksNiche but well-built (no recent funding round verified)
Deepen AISensor calibration + labelAV OEMsOpenLABEL contributor
KognicFused AV labels + VLA groundingVolvo, Zenseact, Kodiak$24M Series A Aug 2024; ~$40M total
Understand.aiAV labelingTier-1s via dSPACEOwned by dSPACE since 2019
Mighty AI(historical)(Uber ATG)Uber ~$70-80M acquisition June 2019; tech later moved to Aurora via Uber ATG sale (Dec 2020)
Sama / iMerit / AppenLabor at scaleAnyoneSama exited content moderation 2023; iMerit acquired Ango 2023; Appen lost Google contract Jan 2024, stock −99% from peak

B. Auto-labeling & model-assisted annotation

The 2023–2025 shift is that most labels are now machine-generated and human-verified, not human-drawn. The right framing for Applied Intuition is: human labeling is the backstop, auto-labeling is the front line, and the value is in the loop that converts disagreements into model improvements.

Tesla's auto-labeling pipeline

What Tesla has shown publicly (CVPR WAD 2021/2022, AI Day 2021/2022, AI Day "occupancy networks" segment, Karpathy talks):

  • Offline 4D auto-labeling — replay an event multi-pass, jointly solve detections + tracks + ego pose with non-causal future context, vastly higher accuracy than online perception. The output of this is the training label for the online model.
  • Multi-trip aggregation — many vehicles cross the same intersection; aggregate detections across trips, build a persistent map of static structure, use that as priors. Karpathy framed the label as "computed from the future."
  • Occupancy networks — replaced earlier voxel/cuboid representations; predict 3D occupancy + flow from 8 cameras, with offline labels from a fused-sensor reconstruction. The architecture (positional image encoding → attention → fused 4D occupancy volume → deconv to occupancy + flow) was detailed at AI Day 2022 (Think Autonomous deep dive).
  • Auto-labeling cluster scale — Tesla disclosed at AI Day 2022 that the auto-labeling pipeline replaces "~5M hours of manual labeling with ~12 hours on a cluster", trains the occupancy network on ~1.4B frames, and runs in-house on ~14k GPUs total (~4k for auto-labeling, ~10k for training). (These figures are from Tesla's own AI Day disclosure — useful directionally but not independently audited.)
  • Fleet-mining triggers — small on-vehicle classifiers detect rare/interesting events ("bicycle on a highway," "stopped emergency vehicle") and surface those clips for offline re-labeling. This is the single most valuable piece for the Data Intelligence pitch.
  • Label distillation — the offline 4D pipeline runs much heavier models (point-cloud transformers, multi-frame context); the online car runs a distilled student. Most production AV labels in 2025 are "model-as-teacher" outputs.

References: Tesla AI Day 2022 — occupancy networks, Karpathy CVPR WAD 2021 keynote, Awesome-Occupancy-Network paper list.

Waymo's auto-labeling research

Waymo publishes more than Tesla. Key papers to know:

  • Auto4D: Learning to Label 4D Objects from Sequential Point Clouds (Yang et al., CVPR 2021) — arXiv:2101.06586. The canonical academic reference for automatic 4D box auto-labeling.
  • Offboard 3D Object Detection from Point Cloud Sequences (Qi et al., CVPR 2021) — arXiv:2103.05073. Uses non-causal context to produce high-quality 3D boxes used as labels.
  • MultiPath++ (Varadarajan et al., 2021) — arXiv:2111.14973. Behavior prediction; relevant because behavior labels (intent, future trajectory) are themselves auto-extracted from logs.
  • MotionLM, EmerNet, Wayformer — Waymo's 2023–24 work pushing toward foundation models for driving, where the model itself becomes the label engine.
  • EMMA (Hwang et al., 2024) — end-to-end multimodal model for autonomous driving built on Gemini, arXiv:2410.23262. Implicitly: VLMs as drivers and as labelers.

Mobileye REM — crowdsourced HD-map flywheel

Mobileye's Road Experience Management (REM) is the cleanest large-scale data flywheel in AV. Mobileye disclosures cite ~65M+ Mobileye-equipped vehicles total (with REM-enabled crowdsourcing on a subset, often cited around 40M), each uploading ~10 KB/km of compressed semantic landmarks (lane edges, signs, traffic lights) to the cloud, aggregated into a self-healing HD map. Mobileye's CES 2021 disclosures cited ~25 km of data per day per active vehicle, totaling ~4B km in 2021 with ~9B km projected for 2022 and a stated long-term ambition to reach ~1B km/day globally (Mobileye REM page, Mobileye Wikipedia). The 2024 actual run-rate is [unverified] — Mobileye stopped reporting specific km-per-day numbers in recent earnings. Bandwidth is the trick — semantic compression instead of raw images. (The original draft's "~9M km of road per day" figure was an order-of-magnitude underestimate of Mobileye's claimed coverage and has been corrected.)

This is the Applied Intuition can't easily replicate this example to keep in mind: the moat is having the cars on the road.

Foundation models as labelers

This is the most active research area for Data Intelligence work. The tools you actually use in production circa 2025:

  • SAM and SAM 2 (Meta) — promptable segmentation. SAM 2 was released July 29, 2024 under Apache 2.0 with a unified image+video architecture, an 8× speedup for video annotations, and the SA-V dataset as the largest open video-segmentation corpus (Meta AI, Meta paper, GitHub). Meta also released SAM 2.1 (2024) and previewed SAM 3 (open-vocabulary, late 2025 — see Encord write-up and Edge AI Vision SAM3 coverage — the SAM3 release status as of May 2026 should be reverified before quoting). The default zero-shot mask labeler.
  • Grounding DINO — open-vocabulary detection from text prompts. Combined with SAM as Grounded-SAM 2, you get text → boxes → masks → temporal propagation in one pipeline.
  • OWL-ViT / OWLv2 — Google's open-vocabulary detector; faster than Grounding DINO at modest accuracy cost.
  • Florence-2 (Microsoft, released June 2024 under MIT) — unified vision foundation model in 232M and 771M sizes, doing captioning, detection, grounding, segmentation, and OCR via a single prompt-conditioned encoder-decoder; trained on FLD-5B (5.4B annotations, 126M images). Performs on par with much larger VLMs (VentureBeat coverage, Microsoft Research paper).
  • GPT-4o / GPT-5 / Gemini 1.5–2.5 / Claude 3.5–4 — VQA-style annotation: "is the pedestrian intending to cross? is the road wet? what is the truck carrying?" Behavior and semantic labels that used to require humans. The 2025 zero-shot quality on AV-style behavior labels is the practical reason a lot of labeling-team headcount is now flat or shrinking.
  • CLIP and successors (SigLIP / SigLIP-2, EVA-CLIP, DFN-CLIP) — embedding-based retrieval and weak labeling. Often the first auto-labeler in any pipeline.
  • DINOv2 and DINOv3 — self-supervised vision features that work especially well for embedding-based mining.
  • Cap3D (Luo et al., 2023) — automatic captioning of 3D assets; archetype of "language labels for non-language data."
  • OpenVLA / RT-2 / Voxposer / VLMaps / Octo — VLA models that label robot demonstrations with language goals.
  • Whisper / SeamlessM4T — for any audio in the data lake (sirens, voice instructions, ambient).
  • NVIDIA SAL ("Better Call SAL") — recent ECCV 2024 work on lifting SAM into lidar segmentation directly (NVIDIA SAL project). Useful pointer for AV-specific zero-shot 3D segmentation.

The pattern: prompt the foundation model → produce a candidate label → route a small fraction to humans for verification/correction → fine-tune downstream models on the verified set + use the foundation model's confidence as a curation signal.

NVIDIA NeMo Curator and Cosmos

  • NVIDIA NeMo Curator is an open-source GPU-accelerated data-curation library originally for text LLM pretraining (dedup, quality filtering, PII removal, language ID). In 2024–25 it grew to handle multimodal: video shot detection, captioning, embedding, dedup at petabyte scale.
  • NVIDIA Cosmos (announced CES, January 6 2025NVIDIA blog, press release, research paper) is NVIDIA's "world foundation model" platform for physical AI. Components shipped or announced:
    • Cosmos Predict (Predict-1, Predict-2, Predict-2.5) — generative world models that produce future video states from text/image/video/sensor inputs (cosmos-predict2 GitHub, cosmos-predict2.5 GitHub, Predict-2 NVIDIA blog).
    • Cosmos Reason — open, customizable reasoning model with chain-of-thought spatiotemporal awareness, used to curate synthetic data and judge physical-AI rollouts (Cosmos Reason curating-synthetic blog) — released in early access at GTC March 18 2025 (NVIDIA press).
    • Cosmos Transfer / Tokenizer / Curate / Guardrails — supporting modules for video tokenization, post-training, curation, and safety. Early adopters cited: 1X, Agility Robotics, Figure AI, Foretellix, Skild AI, Waabi, XPENG, Uber. If you join Applied Intuition, expect to integrate with (and probably also compete against) this stack — Foretellix in particular overlaps with Applied's scenario business.

Active learning workflows

The simplest, most effective active-learning loops in 2025:

  1. Uncertainty sampling — train v0, run on the unlabeled pool, label the lowest-confidence examples first.
  2. Disagreement — run v0 and a foundation-model auto-labeler (e.g., Grounded-SAM); label everything they disagree on.
  3. Embedding-distance to errors — find every unlabeled clip nearest to known model failures in CLIP/DINO embedding space.
  4. Coreset / k-center — pick examples that maximize diversity of the labeled pool. Useful early.
  5. Counterfactual / perturbation — find clips where small input perturbations flip predictions.

References: BAAL library, modAL, Encord Active docs, FiftyOne Brain "compute_mistakenness".

Self-training, pseudo-labels, cross-modal label propagation

  • Self-training / noisy student — train, predict on unlabeled, keep high-confidence predictions as labels, retrain. The Noisy Student Training paper (Xie et al., 2019, arXiv:1911.04252) is still the cleanest reference.
  • Cross-modal propagation — label something in the easy modality (camera mask via SAM) and project it through extrinsics/intrinsics into the hard modality (lidar segmentation). The dominant technique for "free" lidar segmentation labels in AV. References: Lidar SegmentAnything, Waymo/nuScenes 2D-to-3D label transfer papers.
  • Temporal propagation — label one frame, propagate by tracker (e.g., SAM 2) to N frames. Cuts video labeling cost ~10×.

C. Data curation, mining, and the "data engine"

Karpathy's "data engine"

Karpathy popularized the framing during his Tesla tenure (CVPR WAD 2021, "Software 2.0" essays). Three pieces, repeated forever:

  1. Deploy model into the fleet/sim/whatever.
  2. Mine the failures — automatically or via shadow-mode comparison.
  3. Re-label and retrain with the new examples included.

The insight that mattered wasn't the loop — everyone knew the loop. It was that the infrastructure to mine and curate the right 1% of incoming data is harder and more valuable than the infrastructure to train the model. Quote-paraphrase: "the dataset is the model." See his "Software 2.0" post and the Tesla CVPR talks.

Long-tail mining: error mining, hard-example mining, scenario discovery

Practical techniques, ordered by what teams actually run:

  • Shadow-mode A/B mining — run new model in shadow against deployed model, surface clips where they disagree above threshold.
  • Trigger-based mining — small, cheap classifiers ("emergency vehicle present", "construction zone", "child near road") run on-device or on ingest; flag clips for offline review. Tesla, Cruise, Waymo all do this.
  • Scenario taxonomies / ODD slices — score every clip against an Operational Design Domain taxonomy (weather, road type, density, ego maneuver, other-actor maneuver). Mining is "find the under-represented cells." See ASAM OpenSCENARIO DSL / OpenODD standards.
  • Embedding-space neighbors of known failures — given one labeled failure, find K-NN in CLIP/DINO/Hyperbolic-CLIP space. The most useful 5-line tool you will write.
  • CLIP-text mining — "find me clips that look like 'a child running between two parked cars at dusk'." Surprisingly effective; productized in Voxel51 FiftyOne Brain, Nucleus, Encord Active.
  • Counterfactual mining via simulation — take a real clip, perturb it in sim (NVIDIA DRIVE Sim, Applied Intuition's own simulator), find what breaks the model. Closes the loop with synthetic data.

Embedding-based search across petabytes

The standard architecture:

  1. Compute a small embedding (e.g., SigLIP or DINOv2 ViT-L patch features, pooled) per clip / per frame / per object track.
  2. Build an ANN index (FAISS, ScaNN, Vespa, Qdrant, Milvus) over billions of vectors.
  3. Expose two query modes — vector-by-example and text-by-CLIP. Optionally per-region queries.
  4. Layer metadata filters (sensor, geo, time, ODD bucket) for "find me X but only on rainy nights in San Jose."

This is what Nucleus, FiftyOne Brain, and Encord Active all are under the hood.

Deduplication and balancing

Naïve dedup (file hash) is useless for sensor data. Real dedup uses perceptual / embedding-distance dedup with a tunable threshold; you typically end up dropping 20–60% of incoming hours from a fleet of bored vehicles (highway driving, parking-lot dwell). Once de-duped, balance by ODD bucket, geography, weather, time-of-day, and rare-class presence. NeMo Curator's semdedup and MinHash-LSH modes are good references.

Data quality scoring

Per-clip score combining: sensor health, calibration drift, blur/exposure, geometric coverage, label confidence, model uncertainty, novelty (embedding distance to existing data). Cheap models on ingest, expensive checks at curation time. The 2024–25 best practice is train a learned-quality model on a small human-graded sample and apply it to the firehose.

Schema unification across sensor stacks

The non-glamorous problem that ate everyone's roadmap. Standards to know:

If you can't translate cleanly between OpenLABEL and your customer's internal schema, you cannot sell into AV.

Why "curation > collection" became conventional wisdom 2023–25

Three converging facts:

  1. Diminishing returns on raw scale. Chinchilla-era scaling laws assumed clean, near-IID data. Once you saturate the easy distribution, doubling raw hours adds nothing; doubling long-tail diversity still adds a lot.
  2. Curation tools got 100× better. SAM, CLIP, SigLIP, DINOv2, foundation-model judges. What used to be human-only is now machine-tractable.
  3. Frontier-lab evidence. DeepMind's "Phi" line of work, Meta's LIMA paper, and Apple's "DCLM" all showed small-but-curated > large-but-noisy. Same lesson holds in robotics (Open X-Embodiment, RT-X) and AV.

The slogan to internalize: the model architecture is mostly fixed; the dataset composition is the lever.


D. The labeling stack for different data modalities

Quick tour of what "a label" actually is per modality, and what is currently auto-labelable.

2D camera

  • Boxes (axis-aligned, oriented). Auto-labelable to high quality with Grounding DINO + class taxonomy.
  • Polygons / masks. Auto-labelable with SAM/SAM 2 from box prompts.
  • Keypoints (human pose, vehicle wheels). Auto-labelable via ViTPose / RTMPose; humans verify rare poses.
  • Polylines (lane lines). Mostly model-bootstrapped + human cleanup.
  • Attributes (color, occlusion, truncation, intent flags). VLM-labelable to ~human quality for coarse attributes; humans for fine intent.

3D lidar / radar

  • 3D cuboids, optionally with heading and velocity. Offline auto-labeling (Auto4D, Waymo's offboard detector) is now state-of-the-art, better than humans on most categories.
  • Point-cloud semantic segmentation. Cross-modal propagation from 2D masks (SAM → 3D) is dominant; legacy fully-3D human seg is reserved for rare classes.
  • Instance / panoptic in 3D. Same — 3D trackers + 2D mask propagation.
  • Tracking across frames. Auto via greedy / Hungarian / learned trackers (e.g., CenterPoint + ByteTrack, SimpleTrack); human only for ambiguous merges/splits.
  • Radar labels. Hard — sparse and noisy. Often labeled by association to lidar/camera labels in time-synced calibration.

Multimodal sensor fusion

  • Cross-sensor consistent IDs (the same vehicle in cam + lidar + radar). Achieved via offline 4D fused detector that gets all modalities; outputs become labels for each unimodal model.
  • Calibration/extrinsics labels — annoying to get right. Deepen and Kognic specialize here.

4D / temporal

  • Tracklets — per-object trajectory in time. Almost always from offline trackers, rarely human.
  • Future-trajectory / prediction labelsthe future is the label (replay logs to extract ground-truth future motion). Auto by construction.
  • Causality / interaction labels — "actor A yielded to actor B." Harder; current approaches use VLMs over rendered top-down views, then humans verify.

HD map labels

  • Lane geometry, lane connectivity, traffic signs, traffic lights, road markings. Aggregated from many trips (Mobileye REM, Waymo offboard mappers). Labels are the output of a SLAM + semantic-association pipeline, with humans as final QA on conflicts.
  • Map versioning / change detection — automatic via persistent diff of crowd-aggregated maps.

Robot manipulation demos

  • Teleop demonstrations — recorded action streams from human pilots (gloves, VR, leader-follower arms). Labels = the actions themselves. Examples: ALOHA / Mobile ALOHA (Stanford, 50 demos for many tasks), DROID (76k demos across labs), Open X-Embodiment (1M+ episodes aggregated).
  • Mocap demonstrations — Vicon/OptiTrack ground truth; expensive but precise.
  • Language annotations — typically attached at the episode level: a one-sentence task description ("pick up the red mug, place on shelf"). Auto via VLM captioning of the first/last frame is now common.
  • Sub-task / skill segmentation — auto via VLMs over video, verified by humans.

References: Open X-Embodiment dataset, LeRobot, BridgeData V2, DROID.

Behavior / intent labels

Used for prediction and planner training. Examples: lane-change intent, yielding intent, parked vs stopped, jaywalking intent, occluded-actor presence priors. Two paths in practice:

  1. Replay-derived: extract from future logs ("the vehicle did change lanes 3s later, so the label is lane_change_intent=True"). Cheap but only available in retrospect.
  2. VLM-judged: prompt a VLM with the clip; this works for coarse labels, less well for fine-grained intent.

Free-form text labels (driver gaze, scene captions, VLA training)

  • Scene captions — VLMs at scale (GPT-4V, Gemini, InternVL). Used heavily for VLA pretraining and for retrievable text indices over the data lake.
  • Driver gaze / attention — eye-tracker hardware + model fits; used for cabin / driver-monitoring training.
  • Step-by-step instructions for VLA — autogenerated by VLMs from teleop video; humans correct ~10–30%.

Reward / preference labels (RLHF/RLAIF)

This is the bridge between LLM-style alignment and Physical AI:

  • Preference labels — pairs of trajectories or model outputs, "A or B is better." Used in driving (smoothness, comfort, social compliance) and in manipulation (which grasp succeeded better).
  • Reward labels — scalar quality scores per trajectory.
  • AI feedback (RLAIF) — VLMs judging robot/driving rollouts. Current practice — Anthropic's Constitutional AI paper (the conceptual progenitor of RLAIF), Google's RLAIF paper, and NVIDIA/UPenn's Eureka — shows model-generated rewards can match human ones for many tasks.

E. Quality control & human-in-the-loop economics

Inter-annotator agreement (IAA)

The standard metrics: Cohen's κ (two annotators, categorical), Fleiss' κ (multi-annotator), Krippendorff's α (any setting, missing data tolerated). For geometric labels: mean IoU between independent annotators on the same item. Production teams set a per-class IAA floor (e.g., κ ≥ 0.8) and gate annotators against it.

Multi-pass review

Common patterns:

  • 2+1: two independent annotators, third reviewer if they disagree.
  • N-of-K: N independent labelers, take majority or aggregate (CrowdLab, Snorkel-style).
  • Auto-labeler + 1 human verify: cheapest in 2025 for routine classes.
  • Auto-labeler + targeted human: route only the auto-labeler's low-confidence outputs to humans (active QA).

The economic shift: review is now the dominant human cost, not initial drawing.

Rough orders of magnitude (verify before quoting):

  • 2D box (single object, common class): historically $0.05–0.15; now <$0.01 with auto-label + spot-check.
  • 2D polygon mask (single object): historically $0.50–$3; now $0.05–$0.50 with SAM-assist.
  • 3D cuboid in lidar (single object, single frame): historically $1–$5; now $0.10–$1 with offline-detector–assist.
  • Point-cloud semantic seg (per scan): historically $50–$200; now $5–$30 with cross-modal propagation.
  • Manipulation teleop demo (one episode, one task): $5–$50 depending on hardware and task; not really a "label" in the traditional sense.
  • RLHF preference comparison (general): $1–$5; expert SME ≥ $20.

The trend everywhere is 10–50× reduction in human-touch cost between 2022 and 2025, almost entirely from foundation-model assistance.

Where human review will / won't be replaced

Will be replaced (already mostly is):

  • Common-class boxes/masks/keypoints in everyday camera data.
  • Lane line drawing on highway driving.
  • Generic captioning of routine scenes.
  • Audio transcription.
  • Most preference labels for "obvious-better" comparisons.

Won't be (or only partially):

  • Safety-critical edge cases — child near road, emergency vehicle, unusual cargo. Models cluster on training distribution; humans cluster on novelty. You want humans here forever.
  • Subtle social/causal judgments — yielding intent, eye contact, ambiguous right-of-way.
  • Domain-expert SME labels — medical, legal, defense, frontier-science RLHF.
  • Calibration / metadata — not a "label" but human judgment on whether a sensor recording is even trustworthy.
  • Fine-grained reward signals for novel robot tasks where no good pretraining exists.

The robust strategy: budget for 5–15% of all data to flow to humans, forever, and let the auto-label pipeline grow in volume without growing the human team proportionally. That ratio is the lever the Data Intelligence team owns.


F. The "data flywheel" as competitive moat

Companies whose flywheel is the moat

  • Tesla — fleet of millions of vehicles continuously triggering rare-event uploads, plus offline 4D auto-labeling and weekly retraining. Best-known data flywheel in transport.
  • Mobileye — REM aggregates ~9M km/day of semantic landmarks across ~1M+ vehicles. Their HD-map advantage is purely the flywheel.
  • Waymo — fewer cars but vastly richer sensor stacks, well-organized scenario library, leading offboard auto-labeling. (Total cumulative rider-only miles by end of 2025 is [unverified] here — confirm from Waymo's most recent disclosures before quoting; mid-2025 reports were in the tens of millions of cumulative paid rider-only miles.) Different shape of flywheel — depth over breadth.
  • DeepMind / Google Robotics — RT-X / OpenX is partially a community move and partially an aggregation of Google's manipulation data plus 30+ partner labs.
  • Boston Dynamics / Toyota Research Institute / Physical Intelligence (PI) — fleet plus shared-data plays in manipulation. PI released π0 in October 2024 (π0 blog, arXiv 2410.24164) and π0.5 in April 2025 (π0.5 blog, arXiv 2504.16054) — π0.5 emphasizes open-world generalization via co-training on heterogeneous data (multiple robots, web data, high-level semantic prediction). PI also open-sourced π0 weights via the openpi repo in 2025 (Robot Report). PI has raised >$400M cumulatively as of mid-2025.
  • Anthropic / OpenAI on RLHF data — their internal HH/RLHF data is itself a moat; nobody else has years of preference labels at that scale.

Companies selling tools to enable a flywheel

  • Applied Intuition — sim + ADAS toolchain + (increasingly) data tooling for OEMs that want a flywheel but can't build one.
  • NVIDIA — Cosmos + NeMo Curator + Omniverse + DRIVE Sim + Isaac Sim. Selling picks and shovels at every layer.
  • Voxel51 — open-source dataset infrastructure as commoditized middleware.
  • Encord — multimodal curation as a hosted product.
  • Roboflow — long-tail vision flywheel-as-a-service.
  • Snorkel AI — programmatic-label flywheel for enterprise data.
  • Foxglove — log/visualization + MCAP, the "metadata layer" of a robotics flywheel.

The mid-2020s hypothesis

The data engine is the product.

i.e., most AV/robotics companies will fail not because their model architecture is wrong but because their data engine is too slow to close the loop on long-tail failures. Whoever sells the fastest, most opinionated data engine to the rest of the market becomes the equivalent of AWS for Physical AI training data.

Applied Intuition's bet is that they can be that company, combined with the simulator, by virtue of already being the toolchain inside many OEMs. That's what the Data Intelligence team is for.


What changed in 2024–26 (the meta-trend)

Three structural shifts in this market over the past 18 months are worth internalizing because they reshape who buys what:

  1. Neutrality is now a vendor differentiator. Post Meta–Scale, frontier labs explicitly choose data vendors that are not aligned with a competing AI lab. This is why Surge, Mercor, Turing, and Invisible all 10×'d in 2025, and why Labelbox doubled down on Frontier. Applied Intuition's pitch ("we're a toolchain vendor, not your AI competitor") is the same logic in AV/robotics. (TIME)
  2. Foundation models compress the human label budget. SAM 2, Grounding DINO, Florence-2, and frontier VLMs (GPT-4o, Gemini 2.5, Claude 4) now do at near-human quality what cost $0.50–$3/label in 2022. The labor providers that didn't pivot (Appen) are dead; the platform vendors that built foundation-model–assisted tooling (Encord, SuperAnnotate, Voxel51) are growing. The new bottleneck is review and curation infrastructure, not drawing.
  3. "Physical AI data infrastructure" is the explicit category. Encord's Series C, NVIDIA Cosmos's CES launch, and Kognic's VLA grounding pivot are all bets on the same thesis: VLA/world-model training requires fundamentally different data tooling than 2D classification. Applied Intuition is unusually well-positioned here because the simulator and the data engine are the same data lifecycle from the customer's perspective.

G. Implications for the user (what to get hands-on with)

A pragmatic ramp for an Applied Intuition Data Intelligence role, ordered by ROI for first 90 days.

Tier 1 — must be fluent

  1. FiftyOne — install, load BDD100K or nuScenes, run compute_uniqueness, compute_mistakenness, build an embedding index, do a CLIP-text query, write a small plugin. This is the closest analog to whatever Applied has internally.
  2. SAM 2 inference — run on a video, prompt with boxes, propagate masks. Understand temporal track quality and failure modes.
  3. Grounded-SAM 2 — text-prompt → boxes → masks pipeline. The default first-pass auto-labeler.
  4. nuScenes / Waymo Open Dataset / Argoverse 2 loaders — be comfortable with sensor calibration, ego-pose, time synchronization, OpenLABEL serialization.
  5. ASAM OpenSCENARIO and OpenDRIVE basics — read a concrete .xosc file end to end, write a simple cut-in scenario, understand the link to scenario-based testing.
  6. Embedding mining with FAISS or Qdrant — index ~1M vectors, do top-K, do range, do filtered queries with metadata.

Tier 2 — strong working knowledge

  1. Open X-Embodiment loader + LeRobot — see how manipulation datasets are structured. Skim DROID and BridgeData V2.
  2. Snorkel weak supervision tutorial — write 5 labeling functions over a real text or image task, see how the LF aggregator resolves them.
  3. NVIDIA NeMo Curator — run the dedup + quality-filter pipeline on a small video corpus.
  4. NVIDIA Cosmos Curate — understand the inference graph (caption → tag → motion → embed).
  5. MCAP and Foxglove Studio — open a robot bag, write a custom message schema, sync multiple topics.
  6. Active learning — re-implement uncertainty + k-center coreset on CIFAR or KITTI to actually feel the curves. BAAL is fine for this.
  7. CLIP / SigLIP / DINOv2 embedding — compute and visualize a 1M-image manifold; know which embedding works for which task (CLIP for text-aligned, DINOv2 for fine-grained visual nearest-neighbors).

Tier 3 — read-once-and-internalize

Soft signals to develop

  • Opinions about cost-per-label — what should it cost, why, where will it go.
  • Opinions about flywheel speed — how fast does a customer's loop close right now, what's the bottleneck (ingest, mine, label, train, deploy), how do you cut it in half.
  • Comfort talking standards — OpenLABEL, OpenSCENARIO, OpenDRIVE, MCAP, ROS 2, Open X-Embodiment. Customers grade vendors on standards literacy.
  • Comfort talking simulator-data interplay — you're at Applied Intuition. The differentiator over Voxel51 / Encord is that the same tool builds scenarios and curates real data. Practice telling that story.

Sources

Primary references used above. Web-verified May 2026 unless marked [unverified].

Industry events and funding (2024–2026)

Companies (homepages and product pages)

Auto-labeling and foundation models

NVIDIA Cosmos / NeMo

Standards

Robotics datasets and policies

Curation methodology

Active-learning libraries