Learning Roadmap with Mini-Projects

A sequenced 8-to-12-week plan for getting fluent enough in Physical AI to walk into Applied Intuition's Data Intelligence team and add value from week 1. Mixes reading, hands-on building, and writing. Each phase produces a tangible artifact; each project has a definite "done" criterion.

Last verified: 2026-05-08.

How to use this roadmap

Three principles up front:

Build artifacts, don't just read. The roadmap is structured so every phase ends with code or writing you could put in a portfolio. Reading without artifacts is the most common way smart people stall on a new domain.
Map your work to the four-loop diagram in 00-overview.md. Every project below is tagged COLLECT / CURATE / LABEL & TRAIN / EVAL so you know which part of Applied Intuition's customer surface you're touching.
Time-box ruthlessly. The estimated weeks are upper bounds. If a project starts blowing past, write up what you tried, mark it as a known-incomplete, and move on. The point is breadth-first competence, then depth where the role demands it.

Default time budget: ~10 hours/week. Double if you can; halve and stretch the calendar if not.

Phase 0 — Orient (Week 1)

Goal: internalize the field's mental model and vocabulary.

0.1 Read all six research docs in order

00-overview.md — 30 min
01-av-industry-and-data.md — 90 min
02-robotics-foundation-models.md — 60 min
03-simulation-and-synthetic-data.md — 90 min (this is the competitor map)
04-labeling-and-data-curation.md — 120 min (the most important doc for the role)
05-world-models-and-generative.md — 60 min
06-open-problems-and-benchmarks.md — 60 min

0.2 Read the seminal external artifacts

Karpathy, "Software 2.0" — 15 min
Karpathy CVPR WAD 2021 keynote — 35 min
Open X-Embodiment / RT-X paper (arXiv:2310.08864) — 60 min, skim sections 3 & 5
Cosmos paper (arXiv:2501.03575) — 60 min, focus on data-curation appendix
GAIA-2 paper (arXiv:2503.20523) — 30 min, skim
π0.5 paper (arXiv:2504.16054) — 45 min
RAND "Driving to Safety" RR-1478 — 20 min, just the executive summary; this is the "you can't drive your way to safety" paper
Auto4D paper (arXiv:2101.06586) — 30 min

Phase 0 deliverable

Write a one-page mental-model document (pinned in your notes, not a public artifact) that answers, in your own words:

What are the four loops in physical AI and what's slow about each?
What is the difference between a "world model" in the Dreamer sense vs the Sora sense vs the GAIA sense?
Where, specifically, does Applied Intuition compete with Nvidia and where do they complement?
What is the one sentence I would use to describe the Data Intelligence team's job?

If you can't write each answer crisply in a paragraph, re-read the relevant doc section. That's the diagnostic.

Phase 1 — AV data fluency (Weeks 2–3)

Goal: be able to load, query, and curate AV sensor data the way Applied Intuition's customers do.

Project 1.1 — nuScenes in FiftyOne with embedding-driven scenario mining

Loops touched: COLLECT, CURATE.

What to build:

Install FiftyOne and the nuScenes mini-split.
Load the dataset into a FiftyOne Dataset with full sensor metadata (camera, lidar, radar, ego-pose, ODD attributes).
Compute per-clip CLIP embeddings using fiftyone.brain.compute_embeddings; index in a FAISS-backed similarity field.
Run compute_uniqueness and compute_mistakenness on a small detection model's predictions.
Issue text-CLIP queries against your index for at least three scenarios:
- "child near road at dusk"
- "emergency vehicle with flashing lights"
- "construction zone with cones and a worker"
For each query, retrieve top-K, eyeball the precision, and write up what works and what doesn't.

Done criterion: you can demo a 5-minute walkthrough — load → embed → CLIP-text query → cluster → mistakenness — without needing to look up commands. You can articulate why CLIP is a poor proxy for some scenarios (e.g., it doesn't understand "ambulance with siren on" since audio isn't in the embedding) and how you'd compose a better mining query.

Why it matters: this is the single most important hands-on for the role. Applied Intuition's Data Explorer / Axion is, at its core, this same pipeline at petabyte scale. The vocabulary you build here transfers directly.

Project 1.2 — Auto-label a clip with SAM 2 + Grounding DINO

Loops touched: LABEL & TRAIN.

What to build:

Pick a 30-second nuScenes clip with multiple actors.
Run Grounding DINO with prompts like "car . pedestrian . cyclist . traffic light . sign".
Pipe the resulting boxes into SAM 2 for masks and propagate across frames.
Compare to ground-truth nuScenes annotations: report per-class precision/recall and false-positive rate.
Identify three failure modes and characterize them (occlusion, distance, unusual class, lighting).

Done criterion: a Jupyter notebook + a 1-page write-up of where auto-labeling beats human cost-per-label and where it still needs review. You should be able to give a credible answer to "should we still pay humans to draw 3D cuboids on cars in clean weather?" (Spoiler: largely no — but the word "largely" is doing a lot of work.)

Phase 1 reading

SAM 2 paper — 30 min
Tesla AI Day 2022 occupancy networks segment — 45 min
04 §B (Auto-labeling & model-assisted annotation) re-read with the projects in mind

Phase 2 — Simulation and scenario authoring (Weeks 4–5)

Goal: speak ASAM-OpenSCENARIO/OpenDRIVE fluently, run a closed-loop sim, and understand exactly where classical sim and world models meet.

Project 2.1 — Author and run an OpenSCENARIO cut-in scenario in CARLA

Loops touched: COLLECT (synthetic), EVAL.

What to build:

Install CARLA 0.10+ (or use the Docker image).
Write a small .xosc (OpenSCENARIO 1.x XML) for a cut-in: ego at 25 m/s, lead vehicle at 20 m/s, lateral cut-in at TTC = 1.5 s, on a straight 3-lane segment loaded from a CARLA OpenDRIVE map.
Run it. Capture front-camera + lidar + ground-truth tracks.
Vary one parameter (TTC, weather, lead-vehicle speed) over a 10×10 grid. Plot the failure surface for the default CARLA autopilot.

Done criterion: you have a parameter-swept failure surface as a heatmap and you can read it. You can explain why "scenario coverage" is qualitatively different from "test coverage" in software engineering.

Project 2.2 — Hybrid sim: CARLA scene rendered through Cosmos Transfer

Loops touched: COLLECT (synthetic), LABEL & TRAIN.

What to build:

Render the scenario from 2.1 in CARLA with full ground-truth labels (boxes, masks, depth).
Pull a Cosmos Transfer 2.5 checkpoint.
Feed the CARLA render through Cosmos Transfer with a "rainy night, city" prompt. Save the photoreal output with the original CARLA labels propagated.
Train a small detector (YOLOv8 or DETR) on three variants — pure CARLA, CARLA+Cosmos, real nuScenes — and benchmark each on real nuScenes test set.
Write up the gap closure (or lack thereof).

Done criterion: quantitative numbers — mAP on nuScenes test for each training set — plus your interpretation of whether Cosmos Transfer is a meaningful sim2real bridge or a marketing artifact. This is the workflow Applied Intuition customers will ask about; having a number in your back pocket is gold.

Optional extension: repeat with Parallel Domain's open Step assets or Anyverse Studio trial to compare classical-render-with-physics-noise vs. world-model-photorealization.

Phase 2 reading

03-simulation-and-synthetic-data.md re-read sections D and E
Cosmos-Transfer1 paper (arXiv:2503.14492) — 30 min
ASAM OpenSCENARIO XML 1.x intro and DSL 2.x intro — read the first 20 pages of each spec
Bench2Drive paper (arXiv:2406.03877) — 30 min, this is the closest open analog to Applied Intuition's eval workflow

Phase 3 — World models hands-on (Week 6)

Goal: form a calibrated opinion on whether world models replace, augment, or extend classical sim — informed by running one yourself.

Project 3.1 — Cosmos Predict driving rollout on nuScenes

Loops touched: COLLECT (synthetic), EVAL.

What to build:

Pull a small Cosmos Predict 2 checkpoint (the 2B variant runs on a single H100 / A100 80GB).
Take a real nuScenes scene's first frame + ego trajectory.
Generate 5 seconds of forward video conditioned on the action sequence.
Overlay against the ground-truth video. Measure pixel-level drift (LPIPS, SSIM) and trajectory-following error (lateral and longitudinal).
Write up the failure modes you see — object permanence, counting, identity drift, action-conditioning fidelity.

Done criterion: you can show the side-by-side and articulate when you'd trust Cosmos Predict for data augmentation, when you'd trust it for closed-loop eval, and when you wouldn't.

Project 3.2 (optional) — 1X World Model Challenge sample

Loops touched: COLLECT (synthetic), LABEL & TRAIN.

1X opened a world-model challenge with real humanoid teleop data. Train a tiny next-frame predictor (~100M params) for a few hours; produce action-conditioned rollouts; compare to the held-out video. This is the cleanest way to feel why robotics WMs are harder than driving WMs (the camera moves with the body in 6DoF).

Phase 3 reading

05-world-models-and-generative.md re-read sections D (the physics debate) and F (threat to classical sim)
VideoPhy paper (arXiv:2406.03520) — 25 min, the reality-check on whether WMs understand physics
Cosmos-Drive-Dreams paper (arXiv:2506.09042) — 30 min

Phase 4 — Robotics adjacency (Week 7)

Goal: know enough about robotics data and VLAs to be the AV-team person other AV-team people consult on the robotics question.

Project 4.1 — Open X-Embodiment loader and a tiny VLA

Loops touched: COLLECT, LABEL & TRAIN.

What to build:

Set up the LeRobot environment with LeRobotDataset v3.0 streaming.
Load BridgeData V2 (a single-embodiment subset of OXE). Visualize a few episodes — actions, language goals, camera streams.
Fine-tune OpenVLA-7B with LoRA on Bridge V2 for 1–2k steps on a single GPU. (If 7B is too large, use Octo-base 27M instead — same workflow, easier hardware.)
Evaluate on a held-out subset — language-conditioned task success rate.
Write up: where does the policy generalize (objects), where does it not (novel scenes), and what would you do differently.

Done criterion: working notebook + numbers + a paragraph on what cross-embodiment transfer feels like in practice.

Project 4.2 — LIBERO eval

Loops touched: EVAL.

Run LIBERO with the OpenVLA or Octo checkpoint you fine-tuned. Report success rate across the four LIBERO suites (Spatial, Object, Goal, Long).

Done criterion: four success-rate numbers and a one-paragraph reflection on why "saturated benchmark" claims are usually premature (LIBERO-PRO makes this argument formally).

Phase 4 reading

02-robotics-foundation-models.md re-read sections E (whitespace and bottlenecks) and F (implications)
V-JEPA 2 paper (arXiv:2506.09985) — 30 min, the LeCun camp's case against pixel WMs

Phase 5 — Capstone: build a mini data-flywheel (Weeks 8–9)

Goal: integrate everything into the smallest end-to-end loop that demonstrates the Applied Intuition pitch in miniature.

Project 5 — Mini data-engine on BDD100K (or any AV dataset of choice)

Loops touched: all four.

What to build:

Collect. Start with BDD100K (~1100 hours, mostly commercially-usable). Subsample 10 hours.
Curate. Build a FiftyOne dataset, compute embeddings, slice by ODD attributes (weather, time-of-day, road type), measure imbalance.
Label. Run Grounded-SAM 2 to auto-label a held-out subset; compare to BDD's official labels; quantify cost-per-label saved.
Train. Fine-tune a small detector (YOLOv8m or DETR-base) on the auto-labeled vs. ground-truth-labeled splits; benchmark on a real test set.
Eval. Slice the eval by ODD bucket; identify the worst-performing slice (it will be a long-tail one — fog at night, or unusual signage).
Mine. From the unlabeled BDD pool, retrieve clips closest to the worst-performing slice's failure cases (CLIP-embedding distance to known errors).
Re-label, retrain. Add ~100 mined clips, auto-label, retrain. Measure delta on the worst slice.
Document the loop. Write up the iteration as a 4-page case study with numbers, charts, and a 1-figure architecture diagram.

Done criterion: the case-study document. This is the artifact you mention in the first 1:1 with your manager. It is the AI Data Intelligence pitch, run end-to-end, by you, in miniature. If you can do this on BDD, you can talk credibly about doing it on a customer's logs.

Phase 6 — Strategic literacy (Week 10+, ongoing)

Goal: be useful in product / strategy conversations, not just technical ones.

Project 6.1 — Write a 2-page strategy memo

Pick one of:

"Where Applied Intuition's Data Intelligence team should focus 2026–2027." Position relative to Nvidia Cosmos, Foretellix, Voxel51, Encord. Use the four-loop frame from 00-overview.md.
"How Cosmos changes the AI roadmap in the next 18 months." Be specific about which products are threatened (rendering layer of Spectral) vs. which are reinforced (Validation Toolset, scenario authoring, Defense).
"The robotics opportunity for AI." How does the AV data-engine port to humanoids and AMRs? What's the analog of Simian for manipulation? What's missing in LeRobot that AI could productize?

Constraints: 2 pages, no longer; opinion in the first paragraph; specific recommendations, not vague gestures.

Done criterion: could you hand this to a director on day 1 of the job and have them learn something? If yes, ship it; if no, rewrite.

Project 6.2 — Track and engage with primary sources

Subscribe / set alerts:

Goal: 30 minutes a day skimming, ~1 hour a week deep-reading one new paper or post. After 2 months, you'll have organic, current opinions.

Project 6.3 (optional, ongoing) — Open-source contribution

Make at least one merged PR to one of:

Voxel51 FiftyOne
Hugging Face LeRobot
Open X-Embodiment or related dataset infrastructure
Bench2Drive
carla-simulator/carla

A documentation fix or a small dataset loader is fine — the goal is signaling fluency in the same OSS the team uses, plus building a footprint that's discoverable when teammates Google your name.

What "done" looks like overall

After 8–12 weeks you should have:

One mental-model 1-pager (Phase 0 deliverable, private).
A FiftyOne + nuScenes notebook demonstrating embedding-driven scenario mining.
A SAM2 + Grounding DINO auto-labeling notebook with quantitative comparison to ground truth.
A CARLA OpenSCENARIO scenario file plus a failure-surface heatmap.
A Cosmos Transfer hybrid-sim experiment with mAP numbers on nuScenes.
A Cosmos Predict rollout study with drift measurements.
An OpenVLA fine-tuning run on Bridge V2 with LIBERO eval numbers.
A mini data-engine case study on BDD100K showing one full flywheel turn.
A 2-page strategy memo on a real Applied Intuition question.
(Optional) one merged OSS PR.

That portfolio gets you well past "junior on the team" and lets you participate in design discussions from day 1. It also generates concrete examples you can discuss in interviews and at conferences.

Common failure modes to watch for

Reading without building. This material rewards reading; it punishes pure reading. If a project is hard, don't escape into more reading.
Optimizing the wrong thing. Goal is breadth-then-depth, not chasing SOTA on any one benchmark. If your YOLO numbers are 2% below SOTA but your write-up explains the long-tail story crisply, you win.
Underestimating standards. ASAM OpenSCENARIO/OpenDRIVE/OSI/OpenLABEL look like XML drudgery. They're the lingua franca with OEM customers. Read at least one spec end-to-end.
Skipping the strategy memo. Engineers can be allergic to writing. The memo is the highest-ROI artifact of the entire roadmap because it forces you to have an opinion.
Treating Nvidia as the enemy. It isn't. Many AI customers run Cosmos and Simian. The right disposition is "Cosmos-aware tooling" — which you cannot build without using Cosmos.

Adjustments by background

Tune the roadmap based on where you start.

Strong ML/CV background, new to AV: focus weight on Phases 1–2 and the standards reading. Skim Phase 4 (robotics) — you'll see VLAs as "RL with a VLM frontend" and the field will feel familiar quickly.
Strong robotics/RL background, new to AV: invert the above. Phases 1–2 are where the new vocabulary is. The AV "data engine" is more elaborate than what most robotics labs run.
Strong infra/data engineering background, new to ML: Phase 0 will be heavier; allocate 1.5–2 weeks. Once that's done, you'll have an advantage on Phase 5 (the flywheel project), since data-pipeline craft is the hardest part of the role and ML is in some sense the easy part.
Strong product/strategy background, light on code: keep the depth on Phase 6 and the reading list; lean on collaborators or AI assistants for Phases 1–5 implementation but still write the notebooks yourself, because you need the muscle memory.