Physical AI
Research·Doc 08·~30 min

Portfolio Connections and Gap Analysis

A tech-expert audit of the PhysicalAI sandbox as a system. Where does each artifact connect? What's covered well? What's missing, and how do we close it? How does this set someone up for the Applied Intuition Data Intelligence role — and where will they still be exposed?

Last verified: 2026-05-08. Read this after you've gone through 00-overview.md and at least one of the deeper docs. It assumes the vocabulary.


A. Reading the portfolio as one system, not eight folders

The instinct when you see a /docs/ folder with ten files and a /projects/ folder with eighteen subfolders is to treat them as a reading list. That's the wrong frame. The right frame is: this is a single connected pipeline that mirrors, in miniature, what an Applied Intuition customer's data engine looks like end-to-end.

After the round-2 reorganization (see §H' below), folder numbers and the recommended learning order are aligned. Project 01 is the first project to do; project 18 is the last. No mental-mapping required.

graph TD
    subgraph "Knowledge layer (docs/)"
        D00[00 Overview]
        D01[01 AV industry]
        D02[02 Robotics FMs]
        D03[03 Sim & synthetic]
        D04[04 Labeling & curation]
        D05[05 World models]
        D06[06 Datasets & standards]
        D07[07 Roadmap]
        D08[08 This doc]
        D09[09 Frontier outlook]
    end
 
    subgraph "Phase A — Data fluency"
        P01[01 FiftyOne mining]
        P02[02 MCAP/ROS plumbing]
    end
 
    subgraph "Phase B — Labeling fundamentals"
        P03[03 SAM2 auto-label]
        P04[04 BEVFormer 3D]
        P05[05 VLM zero-shot labeling]
    end
 
    subgraph "Phase C — Production hygiene"
        P06[06 Privacy + provenance]
    end
 
    subgraph "Phase D — Simulation & world models"
        P07[07 CARLA scenarios]
        P08[08 3D Gaussian Splatting]
        P09[09 CARLA + Cosmos Transfer]
        P10[10 Cosmos Predict rollout]
    end
 
    subgraph "Phase E — Robotics adjacency"
        P11[11 OpenVLA Bridge]
        P12[12 LIBERO eval]
    end
 
    subgraph "Phase F — Behavior + closed-loop"
        P13[13 Argoverse motion]
        P14[14 Waymax sim-agents]
        P15[15 Bench2Drive closed-loop]
    end
 
    subgraph "Phase G — Active learning + capstone"
        P16[16 Active learning]
        P17[17 BDD100K capstone]
    end
 
    subgraph "Phase H — Strategy"
        P18[18 Strategy memo]
    end
 
    D00 --> D07
    D01 --> P01
    D02 --> P11
    D03 --> P07
    D03 --> P09
    D04 --> P03
    D04 --> P05
    D04 --> P16
    D05 --> P10
    D05 --> P08
    D06 --> P04
    D06 --> P02
    D06 --> P15
    D06 --> P13
    D06 --> P14
    D07 --> P01
 
    P01 --> P02
    P02 --> P03
    P03 --> P04
    P04 --> P05
    P05 --> P06
    P06 --> P07
    P07 --> P08
    P08 --> P09
    P09 --> P10
    P10 --> P11
    P11 --> P12
    P12 --> P13
    P13 --> P14
    P14 --> P15
    P15 --> P16
    P16 --> P17
    P17 --> P18

The flow: the descriptive docs (01–06) define the field; the synthesis (00) gives you the four-loop frame; the roadmap (07) sequences the practice; the projects (01–18) are the practice itself; the strategy memo (project 18) is where technical work becomes judgment; this doc (08) audits the whole thing; and the frontier doc (09 in /docs/) tells you what's next.

The capstone (project 17) is not just the last project — it is the integration node. It deliberately reuses the curation primitives from project 01, the auto-labeling from project 03, the 3D detection from project 04 (new), the active-learning loop from project 16 (new), and feeds an artifact (the case study) into the strategy memo.

If you want to stress-test the portfolio: pick any two projects and ask "what would the data flow look like between them in production?" If you can answer concretely with sensor frames, label schemas, and ODD tags, the portfolio is doing its job.


B. What the portfolio covers well

Let me name these honestly, with limits.

B.1 2D perception and curation. Projects 01, 02, and 08 give the user a thorough loop: load real data into FiftyOne, embed it, mine by similarity, auto-label with foundation models, retrain, measure delta. This is the daily-job surface for an Applied Intuition data engineer working with camera-heavy customer logs. The vocabulary (CLIP-text query, mistakenness, ODD slice, embedding-distance mining) transfers directly to Data Explorer / Axion.

B.2 Scenario-based simulation as a discipline. Project 07 forces the user to write .xosc files by hand, parameterize a cut-in, and feel the combinatorial nature of scenario coverage. The OpenSCENARIO 1.x XML / 2.x DSL distinction is internalized. This is the conceptual core of Simian.

B.3 The hybrid sim pattern that's actually winning. Project 09 (CARLA + Cosmos Transfer) is the exact pattern Applied Intuition's Neural Sim productizes. The user feels both halves — classical sim for ground truth + world model for photoreal — and measures the gap. This is the difference between someone who has read the Cosmos blog post and someone who has run the workflow.

B.4 World-model action-conditioning failure modes. Project 10 has the user measure lateral and longitudinal drift, not read about it. After this project the user can speak credibly about "Cosmos Predict drifts ~1.2m laterally at 5s with prompt-encoded trajectory" instead of repeating marketing.

B.5 Robotics policy fluency. Projects 06 and 07 cover the VLA recipe end-to-end: stream OXE-format data via LeRobotDataset v3.0, LoRA-fine-tune OpenVLA, evaluate on LIBERO, and surface the LIBERO-PRO critique. The user can hold a credible conversation with a humanoid OEM's data-platform team.

B.6 Strategic articulation. Project 18 is the only piece that forces the user to commit to an opinion. Three clean memo prompts; a template that disciplines the writing; honest constraint that "saying everything matters" is failure.

B.7 The descriptive docs as reference material. The 32k+ words in docs 01–06 are the kind of thing you'd want as a reference handbook in your first three months on the team. They survive the work itself because they're sourced and dated.


C. Critical gaps and how the new projects close them

Here's where I'm most useful. As a tech expert reviewing this portfolio, the holes I see are not minor — they would actually leave the user exposed in a Data Intelligence interview, and certainly in week-one customer conversations. Five gaps stand out and each is being filled by a new project.

C.1 Tier-1 gaps (the new projects close these)

Gap 1: 3D perception & sensor fusion. Every existing project is camera-only / 2D. Real AV perception is 3D multi-sensor — the bird's-eye-view (BEV) representation has been the dominant architectural choice since 2022. BEVFormer, BEVFusion, and Tesla's occupancy networks all live here. Without hands-on with this layer, the user cannot speak credibly about the bulk of perception work at any AV company.

→ Filled by Project 04 — BEVFormer 3D detection on nuScenes. This project also introduces sensor calibration (intrinsics, extrinsics, ego frame, lidar frame) — a foundational concept no other project covered.

Gap 2: The data-plumbing layer. The existing projects use HuggingFace abstractions, FiftyOne object models, and CARLA's Python client. Real AV/robotics logs are MCAP files or ROS 2 bags. Applied Intuition's Logfile Studio is fundamentally about reading + slicing + querying this layer. Without fluency here, every higher-level pitch is hand-waving.

→ Filled by Project 02 — MCAP / ROS 2 / Foxglove plumbing. The artifact is a small CLI tool (tag_mining.py) that does scenario-tag mining over a corpus of MCAP files — the kernel of Logfile Studio at industrial scale.

Gap 3: Real-to-sim via 3D Gaussian Splatting. Project 09 covered the generative half of synthetic data (Cosmos Transfer photo-realizes a CARLA render). The other half — Wayve PRISM-1, Waymo + NVIDIA EmerNeRF, Tesla 4D-GS, the Street Gaussians lineage — is real-to-sim: take a real drive, reconstruct it, render from new viewpoints. Applied Intuition's Neural Sim is in this exact space. Missing this means missing half of the simulation matrix.

→ Filled by Project 08 — 3D Gaussian Splatting real-to-sim. Includes the actor-removal demo (counterfactual generation) that ties directly to the Neural Sim narrative.

Gap 4: Closed-loop planner evaluation. Project 07 ran scenario-based testing with CARLA's rule-based autopilot. The much harder, much more important problem is closed-loop eval with a learned planner — open-loop log-replay metrics consistently underestimate failure modes (Waymo's argument against nuPlan-as-only-eval), and Bench2Drive (NeurIPS 2024) is the open analog of what Validation Toolset does for OEMs.

→ Filled by Project 15 — Bench2Drive closed-loop eval. Reuses the CARLA install from project 07 and connects the learned planner from project 11's OpenVLA fine-tune as a stretch path.

Gap 5: Active learning with formal uncertainty. The capstone (project 17) does mining via CLIP-distance-to-known-failure — useful, but informal. Production data engines use uncertainty sampling, BALD, k-center coreset, and BADGE. Active learning is the inner loop of every data engine and the technique behind Tesla's trigger-conditioned uploads. Without this, the user can't speak credibly about labeled-data efficiency.

→ Filled by Project 16 — Active learning loop. The single artifact is a learning-curve plot comparing labeled-data efficiency across methods — directly translatable to a customer conversation about cost-per-label.

C.2 Tier-2 gaps (documented, deferred — a follow-up roadmap)

These matter, but the user has 14 projects already. Each is tractable as a 1–2 day extension or a future iteration.

Behavior prediction and motion forecasting. Argoverse Motion Forecasting and the Waymo Open Motion benchmark are top public datasets. None of the existing projects touch behavior prediction, which is half of AV planning. Recommended fill: a focused 2-day project on Argoverse 2 motion forecasting with MultiPath++ as the baseline.

Sim-agents realism. The Waymo Sim Agents Challenge is the canonical hard problem in closed-loop sim — without realistic NPC behavior, your sim eval is unfalsifiable. Recommended fill: implement a baseline reactive agent on Waymax (Waymo's official sim agents framework) and compare to log-replay agents.

Reward modeling / preference data for E2E driving. Per /docs/04 §D.4, this is the AV equivalent of RLHF. Active research, no productized standard. Recommended fill: a small project that has the user collect preference labels on driving trajectories (sim-generated or mined from real logs) and fits a reward model. Connects to the Constitutional-AI / RLAIF lineage.

Privacy / face-and-plate blurring. Every commercial AV log goes through PII removal before training. EU GDPR + state laws + AI-Act requirements make this non-optional. Recommended fill: a small notebook running a face-detector + plate-detector on BDD frames and emitting a redacted variant. 1 day of work, important for production credibility.

Data lineage / provenance / SOTIF artifacts. ISO 21448 SOTIF expects training-data provenance. With world-model-generated data flooding the curation pipeline, lineage tracking is a 2026–2027 wedge product. Recommended fill: a notebook that wraps each artifact in a provenance manifest (data lineage graph) using OpenLABEL extensions.

On-device inference, distillation, quantization. The gap between an offline auto-labeler (large) and an online perception network (small) is distillation + quantization — the FSD-style pipeline. Recommended fill: a notebook that distills a YOLOv8-large model into a YOLOv8-nano with quantization-aware training, measured on TensorRT.

Multi-agent sim integration. None of the projects use Waymax (Waymo's sim agents framework). For someone going deeper into closed-loop, this is the canonical sim. Recommended fill: port project 15's planner eval to Waymax for the comparison.

C.3 Tier-3 gaps (nice-to-have, beyond the role's core)

Foundation-model training infra (FSDP, DeepSpeed, Megatron-LM). Important if the user pivots to model-team work; less central to a Data Intelligence role.

Defense modalities. Applied Intuition Defense + EpiSci is a real pillar but the public-data analogs (no DROID-equivalent for defense) make a hands-on project hard. Read-only learning via the EpiSci publications + CDAO AEP RFI documents is more tractable.

Multimodal LLM eval harness (e.g., for VLA judging). As VLMs increasingly auto-label, evaluating the auto-labelers becomes a meta-problem. Defer.


D. Engineering hygiene gaps

These aren't research gaps; they're code-quality gaps in the portfolio.

D.1 Cross-project shared utilities

Every project today has its own setup script, its own embedding code, its own dataset loader. After 14 projects this should be refactored into a shared _lib/ library:

  • _lib/datasets/ — unified loaders for nuScenes, BDD100K, Argoverse 2, Bridge V2, LIBERO. Each emits a common Sample schema.
  • _lib/sensors/ — sensor-frame transforms (intrinsics, extrinsics, ego-frame, lidar-frame), used heavily in projects 04, 07, 08, 09.
  • _lib/embeddings/ — CLIP, SigLIP, DINOv2 wrappers with caching.
  • _lib/openlabel/ — OpenLABEL schema serialization for cross-project label exchange.
  • _lib/odd/ — the ODD tag schema (weather × time × scene × road) used by 01 and 08, formalized.

Recommendation: do this after completing the projects, not before. Premature abstraction would slow you down. By the time you hit project 17 (capstone), the duplication is obvious and the right shape is clear.

D.2 Reproducibility

Different projects use different tooling — uv (project 09), pip-venv (most), conda-style references (project 11's optional Octo path). Three improvements would help:

  • Per-project Dockerfiles for the heavy ones (04, 05, 12, 13). Cosmos Transfer and 3DGS especially benefit from a known-good image.
  • A Makefile per project with standard targets: make install, make test-smoke, make run-notebook. Makes "I'll come back to this in 3 weeks" actually work.
  • A devcontainer.json for VSCode users so the environment comes up identically on a fresh clone.

D.3 Experiment tracking

Project 11 mentions Weights & Biases but doesn't enforce it. Recommendation: any project that trains anything should auto-log to W&B (or MLflow) in a project-namespaced workspace. Five minutes per project; major recovery on "what did I run two weeks ago?"

D.4 Smoke tests

A simple GitHub Actions config that runs the first 3 cells of each notebook with a 10-row data subset would catch regressions when libraries upgrade. Doesn't need to be elaborate. ~50 lines of YAML.

D.5 Notebook conversion convention

The project READMEs say to use jupytext --to notebook notebook.py to convert. A repo-level Makefile target — make notebooks — that bulk-converts all of them would help the user who wants .ipynb files for a more familiar workflow.


E. The critical chain (what to do if you only have 6 weeks instead of 12)

Not every learner has 12 weeks. If you have to compress, here is the priority order I'd defend:

Tier 1 — non-negotiable (weeks 1–3):

  1. Project 01 — FiftyOne scenario mining (Phase A.1)
  2. Project 02 — MCAP plumbing (Phase A.2 — the data-plumbing fluency the role demands)
  3. Project 03 — SAM2 auto-labeling (Phase B.1)
  4. Project 04 — BEVFormer 3D detection (Phase B.2 — the perception layer the role demands)
  5. Project 06 — Privacy + provenance (Phase C.1 — production hygiene the role expects)

Tier 2 — strongly recommended (weeks 4–5): 6. Project 07 — CARLA scenarios (Phase D.1) 7. Project 15 — Bench2Drive closed-loop (Phase F.3 — skip Phase D.3/D.4 in compressed mode if forced; closed-loop is more important than hybrid sim) 8. Project 16 — Active learning (Phase G.1)

Tier 3 — strongly recommended (week 6): 9. Project 17 — BDD100K capstone (Phase G.2 — cannot skip; it integrates everything) 10. Project 18 — Strategy memo (Phase H.1)

Optional under compression: projects 05, 08, 09, 10, 11, 12, 13, 14 — each is valuable, but each is also adjacent rather than central to the daily AV-data-engine surface area. If you have 16 weeks, do all of them. If you have 6, defer them until you've shipped the critical chain.

One thing not to skip even under extreme compression: Project 18 (the memo). It is the highest-LTV writing artifact in the portfolio and the one most likely to differentiate the user in a hiring conversation.


F. How to use this audit document

F.1 If you're starting today

Read this doc, then reread 00-overview.md with the system frame in mind. Pick your time budget (12 weeks vs 6 weeks compressed). Print the Mermaid diagram and the project dependency map and put it on the wall.

F.2 If you're halfway through and want to course-correct

Look at the four-loop diagram in 00-overview.md. For each loop, name the project that gave you the strongest fluency. If any loop is empty or weak, prioritize a Tier-1 project that fills it.

F.3 If you've finished all 14 projects

Re-read 09-research-frontier-and-outlook.md. Pick 2 of the Tier-2 gaps to fill as 1–2 day side projects. Update the strategy memo (project 18) with what you've learned. Begin contributing to one of the OSS repos in your stack.

F.4 Before an Applied Intuition interview

Walk through this doc and the project list. For each project, prepare a 2-minute version: what you built, what you measured, what surprised you, what you'd build differently. The interviewers will follow up on the specifics — particularly the numerical findings (e.g., "Cosmos Predict drifted 1.2m at 5s", "BADGE beat random by 18% at 25% labels", "BEVFormer NDS dropped 14 points in fog at night"). Specificity is the single biggest signal.


G. What a future audit should check

Six months from now, this audit should be re-run. Things to look for:

  1. Cosmos Predict 3 / GAIA-4 release. Project 10's prompt-encoded action conditioning is a workaround; if Cosmos ships native trajectory inputs, project 10 needs a rewrite.
  2. SAM 3 / SAM 4 stabilization. Project 03's two-stage Grounding-DINO + SAM 2 pipeline collapses to one stage if SAM 3's open-vocabulary path is production-ready.
  3. mmdetection3d obsolescence. Project 04 uses BEVFormer/BEVFusion; if mmdet3d drift continues, the install dance gets worse, and a Hugging Face Transformers-based BEV implementation may be more durable.
  4. LIBERO saturation. Per LIBERO-PRO, the original splits are saturated. Project 12 may need to migrate to LIBERO-PRO or a successor benchmark by late 2026.
  5. CARLA 0.10 (UE5) maturity. Project 07 targets 0.9.15 (UE4); if 0.10 stabilizes, both 03 and 13 should migrate.
  6. A new dominant labeling primitive. If Frontier VLMs (GPT-6, Gemini 3, Claude 5) collapse the labeling stack into "send the clip to the API and get every annotation back", the labeling-platform vendor map in /docs/04 reshapes meaningfully.

The portfolio is a living artifact. Every six months, run this audit, retire what's become obsolete, add what's become foundational.


H'. Round-2 audit (completed 2026-05-08)

After the Tier-1 gap-fills, a second evaluation pass identified two structural issues that needed addressing. Both have now been resolved.

H'.1 The numerical ordering originally didn't match the pedagogical ordering

The first 14 projects were created in the order they were scoped, not the order they should be learned. Several gap-fill projects (originally numbered 10–14) belonged earlier in the learning narrative than their creation-order numbers suggested. The MCAP / ROS data-plumbing project, originally numbered 11, is foundational and belongs right after the FiftyOne project. The 3D detection project, originally 10, is a peer to the 2D auto-labeling project, not a post-capstone topic. The closed-loop eval project, originally 13, belongs after CARLA scenarios but before the capstone. The active-learning project, originally 14, belongs immediately before the capstone as the inner-loop technique it integrates.

Resolution: all 18 project subfolders were renamed so that folder numbers match the recommended learning order. The bijective mapping:

Old folderNew folderTheme
01-fiftyone-scenario-mining01-fiftyone-scenario-mining(unchanged)
11-mcap-ros2-plumbing02-mcap-ros2-plumbingdata plumbing
02-sam2-auto-labeling03-sam2-auto-labeling2D auto-label
10-bevformer-3d-detection04-bevformer-3d-detection3D detection
18-vlm-zero-shot-labeling05-vlm-zero-shot-labelingVLM labeling
17-privacy-and-provenance06-privacy-and-provenanceprivacy + provenance
03-carla-scenarios07-carla-scenariosscenario sim
12-gaussian-splatting-real2sim08-gaussian-splatting-real2sim3DGS real-to-sim
04-cosmos-transfer-hybrid-sim09-cosmos-transfer-hybrid-simhybrid sim
05-cosmos-predict-rollout10-cosmos-predict-rolloutworld-model rollout
06-openvla-bridge-finetune11-openvla-bridge-finetuneVLA fine-tune
07-libero-eval12-libero-evalVLA eval
15-argoverse-motion-forecasting13-argoverse-motion-forecastingmotion forecasting
16-waymax-sim-agents14-waymax-sim-agentssim agents
13-bench2drive-closed-loop15-bench2drive-closed-loopclosed-loop eval
14-active-learning-loop16-active-learning-loopactive learning
08-bdd100k-data-engine17-bdd100k-data-enginecapstone
09-strategy-memo18-strategy-memostrategy

All cross-references inside notebooks, READMEs, and setup scripts were updated by automated transform (sentinel-token round-trip to avoid double-replacement). From this point forward, project 01 → project 18 reads in pedagogical order. No mental remapping required.

H'.2 Four Tier-2 gaps were worth filling now, not deferring

Four of the Tier-2 gaps were promoted to Tier-1 after weighing them against the role's daily surface area:

  • Behavior prediction / motion forecastingProject 13 (Argoverse 2 motion forecasting). Half of AV planning was untouched; this fills it. Connects perception (project 04) to closed-loop planning (project 15). Connects to docs/04 §D.2 on "labels from the future."
  • Sim-agents realism (Waymax)Project 14 (Waymax sim-agents). The gating problem for any closed-loop claim per docs/06 §E.5 — "without realistic NPCs, closed-loop sim is unfalsifiable." Project 15 (Bench2Drive) uses CARLA's traffic manager, which is rule-based; Waymax is data-driven and is the canonical hard problem.
  • Privacy + data provenanceProject 06 (Privacy + provenance pipeline). Required for any commercial AV log; SOTIF demands lineage; AI customers will ask in week one. Bundles two related concerns (PII redaction + OpenLABEL provenance manifests) into one cohesive project — both are about treating data as audit-able.
  • VLM-based zero-shot labelingProject 05 (VLM zero-shot labeling). The labeling frontier in 2026 is no longer Grounded-SAM (project 03 covers that); it is Gemini/GPT/Claude doing zero-shot semantic + behavior labels at near-human quality. Without empirical exposure to where VLMs win and lose vs traditional pipelines, the user can't make production routing decisions credibly.

Resolution: four new projects (now in the 05, 06, 13, 14 slots above) created with full README + notebook + requirements + setup, slotted into phases B (project 05), C (project 06), and F (projects 13 and 14) of the canonical sequence.

H'.3 What's now still in the deferred bucket (and why deferring is right)

After Round 2, the remaining deferred items in §C.2 are:

  • On-device inference / distillation / quantization — important but more central to FSD-style deployment teams than to the Data Intelligence team specifically.
  • Reward modeling / preference data for E2E driving — active research, not yet productized broadly; less daily-surface relevant.
  • End-to-end auto-driving model implementation — UniAD-style; would heavily overlap with project 15 (Bench2Drive).
  • Multi-modal occupancy networks (Tesla pattern) — covered conceptually in docs/04 §D.1; projects 04 (BEVFormer) and 08 (3DGS) together cover the technical primitives.

These are all legitimate 1–2 day extensions when the user actually hits a customer use case requiring them. They don't need to be in the standing portfolio.

H'.4 Total portfolio after Round 2

18 projects across 8 phases, ~16,600 lines of notebook code, ~91,000 words combined across docs and project READMEs. The compressed critical path is 10 projects (~6 weeks); the full path is 18 projects (~16 weeks at 10 hr/week).

The portfolio covers, comprehensively: data plumbing (01, 02), labeling 2D-3D-frontier (03, 04, 05), production hygiene (06), classical and world-model simulation (07, 08, 09, 10), robotics adjacency (11, 12), behavior prediction (13), sim-agent realism (14), closed-loop eval (15), active learning (16), integration (17), and strategy (18).


H. The forward direction

Read 09-research-frontier-and-outlook.md for the full forecast. The TL;DR for what to track:

  • Generalist robot policy at scale — does OpenVLA-style scaling continue past π0.6 / Gemini Robotics 1.5? What's the next OXE?
  • World models as primary simulator — does Cosmos / GAIA close the action-conditioning gap to <10 cm at 5s by 2027?
  • SOTIF-defensible auto-labeling — does ISO ship a new ML-perception standard? Who builds the first certified ML-data lineage tool?
  • Real-to-sim digital twins via 4DGS at fleet scale — does someone productize PRISM-1 / EmerNeRF as turnkey?
  • The "neutral data tier" consolidation — does Mercor / Surge / Turing / Invisible consolidate, or does a sixth player emerge?

The bet for the user joining Applied Intuition's Data Intelligence team: the data engine as competitive moat thesis is right; the question is whose data engine wins, and whether Applied Intuition's is structurally separable from the foundation-model ecosystem (Nvidia, Meta, Google) selling around it.

That's the right question to walk in with.