Physical AI
← All projects
Project 17Phase GActive learning + capstone·Hardware: Workstation GPU
COLLECTCURATELABELEVAL

Project 17: BDD100K mini data-engine — CAPSTONE

What this project is. A complete miniature of the data-engine Applied Intuition's Data Intelligence team builds at industrial scale, implemented end-to-end on the public BDD100K driving dataset. You will collect, curate, label, train, evaluate by slice, mine the long tail, auto-label the mined data, retrain, and write up the delta in a 4-page case study. Time budget: roughly two weekends of focused work.

Why this is the capstone. Projects 01–07 each touched one sub-loop: FiftyOne curation, SAM-2 auto-labeling, CARLA scenarios, Cosmos sim2real, OpenVLA fine-tune, LIBERO eval. This project chains them. If you can run this notebook end-to-end and explain the numbers in the case study, you can credibly walk into a Data Intelligence interview and discuss Data Explorer, Axion, the customer flywheel, and the long-tail Pareto without hand-waving.

Goal

Run one full turn of the AI data flywheel — collect → curate → label → train → evaluate-by-slice → mine → re-label → retrain — on BDD100K, and produce a measurable delta in mAP@0.5 on the worst-performing operational slice between iteration 0 (initial training) and iteration 1 (after mining + retraining).

The capstone has one number you point to: the iter-0 → iter-1 delta on the worst slice. Everything else (the architecture diagram, the slice-distribution charts, the cost-per-label analysis, the discussion of how this scales) frames that number.

Loops touched

This is the only project in the roadmap that touches all four loops:

  1. COLLECT — subsample 10 hours-equivalent of BDD100K (~36K frames), stratified by ODD attributes so rare slices aren't lost.
  2. CURATE — load into FiftyOne, embed with CLIP ViT-B-32 (and optionally DINOv2 for comparison), tag every sample with weather / timeofday / scene from BDD's native annotations, and measure the imbalance.
  3. LABEL & TRAIN — fine-tune YOLOv8m on the curated training pool using BDD's ground-truth labels. Then auto-label a held-out subset with Grounded-SAM-2 and quantify the precision/recall against ground-truth — i.e. the "what fraction of human-labeling could you skip" number.
  4. EVAL — evaluate the detector on a slice-balanced eval pool, compute per-slice mAP, identify the worst slice, mine clips closest to its failure cases via CLIP-embedding distance, auto-label the mined clips, and retrain. Re-evaluate. Report the delta.

Why this matters for AI Data Intelligence

This project is the AI Data Intelligence pitch in miniature. AI's public products — Data Explorer (slice-aware data tooling) and Axion (closed-loop training) — are the industrial-scale version of exactly this notebook:

  • Data Explorer ↔ §2 of the notebook. Customers query their fleet data by ODD attribute, find a regression slice, and pull frames into a curated training set. That's dataset.match_tags("foggy") plus a similarity search. We do this on 36K frames; AI customers do it on petabyte-scale logs.
  • Axion ↔ §6–7. Closed-loop training: pick worst slice, mine related data, auto-label, retrain, re-evaluate. That's the back half of the notebook. The fact that one iteration moves the slice metric is the entire product.
  • The customer flywheel. Every AV company's pitch deck has the same circular diagram: drive → log → mine → label → train → deploy → drive. This notebook is one clockwise revolution of that diagram.

If you finish this and can answer "how would you change this for a customer with 10 PB of footage?", you have working intuition for the job. (Section 8 of the case-study template is exactly that question.)

A subtler point: the value of a data engine only shows up in slice-aware metrics. The overall mAP barely moves between iterations because the worst slice is, by definition, rare. If you only watch the population number you would conclude "this didn't help" and stop investing. The internal habit a Data Intelligence team needs to instill in customers is to look at slice tables, not headline numbers — and the case study you produce here is a one-page version of that argument you can show in interviews.

What's deliberately out of scope

Three temptations to resist:

  • Beating SOTA on BDD. A YOLOv8m on 2.5K frames will not beat the bdd100k-models leaderboard. That's fine. The artifact is the workflow, not the absolute number.
  • Multi-model ensembles. One model, two iterations, one slice. If you find yourself reaching for ensembling, you're optimizing the wrong axis.
  • Production-grade data versioning. No DVC, no MLflow, no W&B unless you genuinely already use them. A clean outputs/ tree with numbered run folders is enough for the case study. Real customers do need DVC-equivalents — note that as a future-work bullet, then move on.

Prerequisites

Strictly required:

  • Project 01 complete. You should be comfortable opening a FiftyOne app, computing visualizations, and using compute_similarity to do k-NN retrieval. We will not re-explain that here.
  • Project 03 complete. You should have run Grounded-SAM-2 (or equivalent grounded-detection + segmentation) at least once and understand the prompt-tuning loop.
  • Phase 0 reading internalized. Especially:
    • docs/01-av-industry-and-data.md — for the long-tail framing.
    • docs/04-labeling-and-data-curation.md — for the labeling pyramid and the cost-per-label argument.
    • docs/06-open-problems-and-benchmarks.md — for what BDD100K is actually a benchmark for and what it's missing.

Recommended:

  • Skim docs/07-learning-roadmap.md Phase 5 — it justifies why this project sits where it does.
  • Project 12 (LIBERO eval) gives you eval-discipline reps; the per-slice eval here borrows that mindset.

Hardware

  • Workstation GPU, 12 GB VRAM minimum. RTX 4070, 4070 Ti, 4080, 4090, A4000+, or any single A/H100 all work. Apple-silicon Macs can run the FiftyOne curation and slicing sections, but YOLO training and Grounded-SAM-2 inference will be too slow to be practical (estimate 10–20× the times below).
  • ~150 GB free disk if you grab the full 100K-images bundle. ~50 GB if you stick to the 10K subset (the recommended fast path).
  • CPU + 32 GB RAM is plenty; FiftyOne is the heaviest CPU consumer and it's IO-bound on the embedding pass.

End-to-end wall-clock on a single RTX 4070 (12 GB):

  • Curation + embedding: ~20 min.
  • YOLOv8m fine-tune iter 0 (50 epochs, 2.5K imgs): ~2 h.
  • Grounded-SAM-2 auto-label of 100 mined frames: ~5 min.
  • YOLOv8m fine-tune iter 1: ~2 h.
  • Eval, mining, plotting: ~30 min total.

Plan for ~6 hours of compute, spread over two sittings.

Setup

cd projects/17-bdd100k-data-engine
chmod +x setup.sh
./setup.sh        # creates .venv, installs deps, clones Grounded-SAM-2,
                  # downloads SAM 2.1 + Grounding DINO + YOLOv8m weights
source .venv/bin/activate

setup.sh does not download BDD100K (license requires login + manual acceptance). You need to do that yourself:

  1. Make a free academic account at https://bdd-data.berkeley.edu/. The license permits research and not-for-profit use; commercial use requires contacting Berkeley OTL.
  2. Download the 10K Images bundle (~6 GB) from the portal — this is a labeled subset that is enough for the capstone walkthrough. If you want the full 10-hour video-equivalent, also download the 100K Images bundle (~70 GB).
  3. Download the Detection 2020 labels (bdd100k_det_20_labels_trainval.zip, ~80 MB).
  4. Unpack so the layout matches:
    data/bdd100k/
      images/
        10k/
          train/<id>.jpg
          val/<id>.jpg
          test/<id>.jpg
        100k/                # only if you grabbed the full bundle
          train/<id>.jpg
          val/<id>.jpg
      labels/
        det_20/
          det_train.json
          det_val.json

The notebook will refuse to run §1 if the labels are missing, with a helpful error pointing back here.

Steps (mapped to notebook sections)

The notebook (notebook.py) is organized as 8 numbered sections, each self-contained, each with a checkpoint at the end. Resuming from any section just requires having the previous section's outputs on disk.

  1. §1 — Collect. Load BDD100K via FiftyOne's loader, attach native weather / timeofday / scene tags, do a stratified subsample to ~36K frames (10 hours-equivalent at ~1 frame per video). Save the FiftyOne dataset to disk.
  2. §2 — Curate (embed + slice). Compute CLIP ViT-B-32 embeddings on every kept frame (with a --use-dinov2 flag for the alternative). Build a slice index: cartesian product of weather × timeofday × scene, with frame counts. Plot the long-tail distribution.
  3. §3 — Train pool / eval pool. Stratified split: 2500 frames for training, 500 frames for eval (balanced across the top 20 most common slices), the rest as the unlabeled mining pool. Convert BDD's det_20 JSON into YOLO format for the training pool. Emit the YOLO dataset YAML.
  4. §4 — Train iter 0. Fine-tune YOLOv8m for 50 epochs on the GT training pool. Save weights to outputs/runs/iter0/.
  5. §5 — Eval iter 0 (per-slice). Evaluate on the eval pool; compute per-slice mAP@0.5 by filtering predictions/GT by slice tag before calling Ultralytics' validator. Identify the worst slice (argmin mAP among slices with ≥15 eval frames).
  6. §6 — Mine. Take the eval-set failures from the worst slice (frames where iter-0 mAP=0 or recall<0.3), average their CLIP embeddings into a "failure prototype", run k-NN against the unlabeled mining pool, return the top-100 mined frames.
  7. §7 — Auto-label + retrain (iter 1). Run Grounded-SAM-2 on the 100 mined frames with a BDD-class prompt list. Convert to YOLO format. Spot-check 20 frames against the original BDD labels (if the mined frame happened to be in the labeled split — often a few are) and report auto-label precision/recall. Retrain YOLOv8m on train ∪ mined; weights → outputs/runs/iter1/.
  8. §8 — Eval iter 1 + delta. Re-run the per-slice evaluation; diff against iter 0; produce the headline table and the outputs/figures/slice_delta.png chart. Pickle everything needed for the case study.

Each section ends with a # %% [markdown] ## Checkpoint cell that prints what was saved, what to verify visually, and what's safe to delete if you need to restart from that section.

Subsampling strategy (justified)

10 hours of BDD ≈ (10 × 3600 s) / (40 s per video) = 900 videos. Each video has one labeled keyframe, so naively that's 900 frames — too few for a credible detector training run. Instead, we treat "10 hours equivalent" as a budget on raw video time and target ~36K frames at roughly one frame per ~1 second of source video, sourced as follows:

  1. Anchor on the labeled 10K subset (10K frames, ~6 GB). This is the labeled split with detection annotations. Use it as both the training pool source (via stratified sampling, ~2.5K frames) and the eval pool source (slice-balanced, 500 frames).
  2. Optionally extend with the unlabeled 100K subset (one keyframe per video, 100K total) for the mining pool. This is the pool that "iter 1 mines from"; we don't need labels here because Grounded-SAM-2 will produce them. Take a 33K-frame stratified sample of the unlabeled-to-us frames (i.e. all 100K minus the 10K that overlap with labeled splits, then stratified by weather × timeofday).
  3. Stratification is non-negotiable. Pure-random subsampling of BDD lands you in a 50%+ clear/daytime/city street set and you'll never see fog or snow at training time. Stratify by the cartesian product of weather × timeofday (28 buckets after dropping undefined); cap each bucket at its share of frames in the source pool plus a min-floor of 50 frames per slice so rare buckets aren't lost.

Total: ~36K frames if you grab the unlabeled extension, ~10K frames if you only use the labeled subset. Both work — the unlabeled extension just gives the mining step in §6 more candidates to pull from.

Done criterion

You're done when all of the following exist:

  1. A 4-page case study at outputs/case_study/case_study_v1.md (or .pdf), filled in from case_study_template.md. Real numbers, real charts, no placeholders.
  2. A 1-figure architecture diagram at outputs/figures/architecture.png — a clean version of the Mermaid diagram in §2 of the case study.
  3. A measurable delta on the worst slice. The headline table in §7 of the case study has at least one slice with non-trivial positive Δ mAP after one full flywheel turn. "Non-trivial" means either (a) ≥+0.05 mAP@0.5 on a slice with ≥15 eval frames, or (b) a clearly-explained reason why the delta is small (e.g. auto-label quality dominated, or mined frames were off-distribution — both are publishable findings).
  4. An honest limitations section. §9 of the case study, six bullets minimum. Don't gold-plate; be candid about the toy-scale problems.

The bar is not "I trained a great detector". The bar is "I can show a DI hiring panel a clean before/after table on the worst slice and discuss what would change at 1000× scale."

Common pitfalls

  1. BDD label-format quirks. The det_20 JSON has objects with category strings that include "traffic light" (with space) and "traffic sign" (with space) — a naive category.replace(" ","_") will diverge from the YOLO names list. The notebook handles this in to_yolo_label(); if you reimplement, mirror that mapping exactly.
  2. ODD-tag schema choices. The temptation is to invent your own weather buckets ("clear+overcast as one"). Don't, on the first pass — use BDD's native attributes verbatim so you can defend the slice analysis without anyone questioning your taxonomy. Reserve schema redesign for a v2 of the project.
  3. CLIP vs. DINOv2 for mining. CLIP's text-image alignment is helpful when you can describe the failure mode in words ("foggy highway"), but it is style-biased — it can retrieve cinematic "foggy" content that isn't actually similar driving conditions. DINOv2 gives stronger pure-visual nearest neighbours but you lose text grounding. The notebook lets you toggle. Both are weaker than a domain-pretrained embedding would be at customer scale — note this in the case study.
  4. Train/test leakage by video. BDD's 100K-images set has one keyframe per 40-second video, so frame-level splits are reasonably safe. But the 10K subset has some near-duplicate scenes between train and val. If you use the 10K subset for both training pool and eval pool (which is the default fast path), spot-check that no eval frame's filename prefix shows up in the training pool. Better: hold out by video_id if you've parsed it from filenames.
  5. YOLOv8 dataset YAML paths. Ultralytics resolves train: / val: paths relative to the YAML's directory, not relative to cwd. The notebook writes the YAML next to the data and uses relative paths — replicate that, or you'll get cryptic "no labels found" errors at training time.
  6. Auto-label prompt drift. Grounded-SAM-2 with the prompt "car . truck . bus . pedestrian . bicycle . motorcycle . traffic light . traffic sign . train . rider" does well on the first 6 classes and poorly on train (rare in BDD anyway), rider (collides with pedestrian+bicycle), and traffic sign (over-fires on text and signage). Don't naively trust auto-label class IDs; the notebook reports per-class precision/recall against GT and you should expect 70–85% for the head classes.
  7. Grad-CAM / explanation visualizations. Optional but recruiter-impressive. The notebook includes a stub. Skip on the first pass; revisit when you're polishing the case study.
  8. Cost-per-label numbers are illustrative. Real human-labeling costs vary 5–20× across vendors and class complexity. Quote a range in the case study and cite the assumption.

Further reading


When you finish: commit the case study, push, and move on to project 18 (strategy memo). The strategy memo will reference this project's numbers as concrete evidence. Keep the iter-0 → iter-1 table handy.

Files in this project

  • README.md
  • case_study_template.md
  • notebook.py
  • requirements.txt
  • setup.sh

Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.

Memo prompts

case_study_template.md

BDD100K Mini Data-Engine — Case Study

Fill this in as you work through notebook.py. Target length: ~4 pages (≈1500–2000 words) plus one architecture figure and 4–6 charts. Save the final version as outputs/case_study/case_study_v1.md and export to PDF.

The point of this document is not to claim a SOTA detector. The point is to walk through one full turn of the AI Data Intelligence flywheel — collect → curate → label → train → eval → mine → re-label → retrain — on a public driving dataset, and report the delta the loop produced on the worst-performing slice. A reader from Applied Intuition's Data Intelligence team should be able to skim this in five minutes and recognize the same workflow they sell to customers, just at 1/1000 scale.


1. Problem statement (≈200 words)

Setting. One paragraph: what the toy "AV team" you're playing the role of is trying to ship. Example: "An L2+ highway pilot for a customer operating in the Pacific Northwest. The customer's ODD spans clear, overcast, and rainy daytime conditions plus daytime fog (uncommon but critical) and nighttime urban driving. The detector under evaluation is a YOLOv8m-class model trained on a fixed budget of ~3000 labeled frames."

Hypothesis. One sentence: which slice you suspect will be the worst, and why. Reference Phase 0 reading on long-tail distributions.

Success metric. The number you'll point to at the end: mAP@0.5 on the worst slice, before vs. after one mining + retraining iteration. Per-class breakdowns are secondary; the headline is the slice delta.


2. Pipeline architecture

Replace the ASCII below with a real figure (outputs/figures/architecture.png) once you have the loop running. Mermaid source for editing:

flowchart LR
    A[BDD100K raw<br/>~100k frames] -->|stratified subsample<br/>10 hours ≈ 36k frames| B[FiftyOne dataset]
    B -->|CLIP ViT-B-32<br/>embeddings| C[Slice index<br/>weather × time × scene]
    C -->|hold-out by slice| D[Train pool<br/>~2.5k frames]
    C -->|hold-out by slice| E[Eval pool<br/>~500 frames<br/>balanced by slice]
    D -->|GT labels: BDD det_20<br/>auto-labels: GSAM-2| F[YOLOv8m fine-tune<br/>iter 0]
    F -->|val on E| G[Per-slice mAP table]
    G -->|argmin slice| H[Worst slice<br/>e.g. fog/night]
    H -->|CLIP nearest-neighbour<br/>in unlabeled pool| I[100 mined clips]
    I -->|GSAM-2 auto-label| J[YOLOv8m fine-tune<br/>iter 1<br/>train ∪ mined]
    J -->|val on E| K[Per-slice mAP table v2]
    G --> L[Δ mAP per slice]
    K --> L

Tip: the figure is the resume artifact. Make it clean. Export from Mermaid Live Editor as SVG, then PNG at 1200 px wide.


3. Numbers per stage

Fill this table from the print-outs in notebook.py. Numbers below are illustrative placeholders; replace them.

StageInputOutputTimeDisk
Collect100K BDD frames36K frames (10 hr equiv)5 min6 GB
Curate (embed)36K frames36K × 512-dim CLIP vectors18 min on RTX 407075 MB
Slice36K frames60 (weather × time × scene) buckets<1 min
Train pool selection36K2.5K (stratified)<1 min
Eval pool selection36K500 (balanced 25/slice for top 20 slices)<1 min
Auto-label (GSAM-2)1.0K mined framesdet labels (10 classes)28 min8 MB
Train iter 0 (YOLOv8m, GT labels)2.5K frames × 50 epochsweights2 h 10 min200 MB
Eval iter 0500-frame eval poolper-slice mAP2 min
Mineunlabeled 33K + worst-slice errors100 candidate clips4 min
Train iter 1 (YOLOv8m, GT + auto)2.6K frames × 50 epochsweights2 h 15 min200 MB
Eval iter 1500-frame eval poolper-slice mAP2 min

4. Curation findings — slice imbalance

Insert outputs/figures/slice_distribution.png (a stacked bar chart of weather × time-of-day from FiftyOne).

ODD attributes used. Following BDD's native annotation:

  • weather ∈ {clear, overcast, rainy, snowy, foggy, partly cloudy, undefined}
  • timeofday ∈ {daytime, night, dawn/dusk, undefined}
  • scene ∈ {city street, highway, residential, parking lot, gas stations, tunnel, undefined}

Imbalance observations (fill in your numbers):

  • Most common slice: e.g. clear / daytime / city streete.g. 38% of frames.
  • Rarest operational slices (excluding undefined): e.g. foggy / daytime / highway (0.4%), snowy / night / residential (0.2%).
  • Number of slices with <50 frames: e.g. 9 of 60.

This is the long-tail Pareto that motivates the entire data-engine business. Reference: roadmap doc 04-labeling-and-data-curation §3.


5. Iteration 0 — baseline detector

Training config. YOLOv8m, 50 epochs, imgsz=640, batch=16, optimizer AdamW, cosine LR, warmup 3 epochs, COCO weights as init.

Overall metrics. Fill in:

metricvalue
mAP@0.5 (overall)e.g. 0.42
mAP@0.5:0.95 (overall)e.g. 0.24
recall@IoU≥0.5 (car)e.g. 0.71
recall@IoU≥0.5 (pedestrian)e.g. 0.55

Per-slice mAP@0.5 (top 8 slices, sorted ascending). This is the table that drives the rest of the case study.

slice (weather / time / scene)n_evalmAP@0.5
foggy / daytime / highwaye.g. 18e.g. 0.11
snowy / night / city streete.g. 22e.g. 0.18
rainy / night / city streete.g. 25e.g. 0.21
overcast / dawn-dusk / highwaye.g. 25e.g. 0.27
clear / night / city streete.g. 25e.g. 0.34
rainy / daytime / city streete.g. 25e.g. 0.39
clear / daytime / highwaye.g. 25e.g. 0.51
clear / daytime / city streete.g. 25e.g. 0.58

The overall number hides everything. The worst slice is 3-5× worse than the best slice. This gap is the value proposition of a data-engine: a generic detector training run cannot see this gap; only a slice-aware eval can.


6. Worst-slice analysis

Slice chosen: fill in, e.g. foggy / daytime / highway.

Failure-mode hypothesis. One paragraph. Examples: "On 18 fog-highway frames the detector misses ~70% of vehicles >50 m away. Class confusion is minimal — the issue is recall, not precision. Backbone activations on these frames look diffuse (visualized via Grad-CAM in outputs/figures/gradcam_fog.png), suggesting the model never saw enough low-contrast vehicle silhouettes during training."

Visual evidence. Insert 4–6 thumbnails from outputs/figures/worst_slice_failures.png showing predictions vs. GT.

Mining strategy. We embed the failure cases themselves (the 18 fog frames) with CLIP ViT-B-32, average the embeddings to produce a "fog prototype" vector, and run k-NN against the 33K-frame unlabeled pool. Top-100 is what we mine.

Why this works at toy scale. CLIP captures coarse weather/lighting cues even though it was never trained on driving labels. At industrial scale, dedicated AV-pretrained embeddings (e.g. DINOv2 fine-tuned on internal video) close the gap further — but the workflow is identical.


7. Mining + auto-label + retraining (iteration 1)

Mined frames inspected. Insert outputs/figures/mined_frames_grid.png (a 10×10 grid). Sanity check: how many of the 100 are actually fog/highway? e.g. 78/100 — the rest are overcast highway, which is fine; they're still informative.

Auto-label cost vs. ground-truth cost.

labeling sourceframeswall-clock$ at $0.50/frame human-equiv
Human (BDD baseline)100~3 hours$50
Grounded-SAM-2 auto-label100~3 minutes<$0.10 (compute only)

Auto-label quality (sampled 20 frames against human). Fill in.

metricvalue
precision (matched IoU≥0.5)e.g. 0.84
recalle.g. 0.71
classes most often missede.g. traffic light, bicycle

Retraining config. Same as iter 0 but training set = 2500 GT + 100 auto-labeled mined frames. Validation set unchanged.

Per-slice mAP@0.5 — iter 1 vs iter 0. This is the headline table.

sliceiter 0iter 1Δ
foggy / daytime / highway (mined)e.g. 0.11e.g. 0.19+0.08
snowy / night / city streete.g. 0.18e.g. 0.180.00
rainy / night / city streete.g. 0.21e.g. 0.22+0.01
overcast / dawn-dusk / highwaye.g. 0.27e.g. 0.29+0.02
clear / night / city streete.g. 0.34e.g. 0.340.00
rainy / daytime / city streete.g. 0.39e.g. 0.40+0.01
clear / daytime / highwaye.g. 0.51e.g. 0.510.00
clear / daytime / city streete.g. 0.58e.g. 0.57-0.01
overalle.g. 0.42e.g. 0.43+0.01

Key reading of the table. Targeted mining lifts the worst slice without regressing the head of the distribution. Overall mAP barely moves (+0.01) because the worst slice is rare — and that is exactly the point: at population-level metrics, this work would look invisible. You only see the value in slice-aware eval. This is the central argument of the AI Data Intelligence pitch.


8. Discussion — what this looks like at AI scale

The data-engine in this notebook is ~36K frames, one detector, one worst slice, one iteration. Applied Intuition's customers operate the same flywheel at orders of magnitude larger. Here is what changes.

1. Scale of the unlabeled pool. At customer scale, the "unlabeled pool" is petabytes of fleet logs, not 33K frames. Embedding-based mining must run on a vector index that supports streaming inserts (LanceDB, Milvus, or a custom Spanner-backed index). The notebook's FAISS-CPU step would become a managed service.

2. Embedding choice. We used vanilla CLIP because it works on a laptop. At scale, customers train driving-specific video embeddings (e.g. self-supervised DINOv2 on raw fleet footage) so that "fog prototype" actually generalizes to their truck fleet, not the generic internet. The Phase 0 reading on representation learning becomes load-bearing here.

3. Slice schema. Our ODD schema (weather × time × scene) is hand-coded. At scale, ODD definitions are a contract between AV team and platform — typed schemas, versioned, tested. AI's "Data Explorer" essentially exposes this contract as a queryable surface.

4. Auto-label quality gates. Our auto-labels were spot-checked on 20 frames. At scale, every auto-label flows through a confidence-based router: top-confidence labels go straight to training, mid-confidence go to human review, low-confidence are discarded. This is the labeling pyramid from doc 04 §5.

5. Retraining cadence. Our iter 0 → iter 1 was a Tuesday afternoon. At scale, retraining is a nightly or weekly job triggered by slice-regression alerts; the slice-mAP dashboard is a first-class product surface.

6. The customer's job-to-be-done. From their POV the workflow is: "my detector regressed on fog at night last week — show me the new fog-night clips, label them, retrain, ship." Everything else is plumbing. This notebook is that plumbing in miniature.


9. Limitations

  • Sample size. 36K frames is two orders of magnitude smaller than a realistic AV training set. The slice-mAP estimates have high variance; the iter 0 → iter 1 delta should be interpreted as direction, not magnitude.
  • One iteration. A real flywheel runs many turns; we stopped at one. Returns diminish but the workflow is the same.
  • Single random seed. Reproducing the train run with three seeds would tighten error bars; deferred for time.
  • Auto-label classes. Grounded-SAM-2 with a 10-class prompt list systematically under-detects train (rare, prompt ambiguous with "train tracks") and rider (category collision with pedestrian + bicycle). A real pipeline would tune the prompt + post-process.
  • Test-set leakage risk. We held out by frame, not by video. Frames from the same 40-second clip can land in train and val. Holding out by video_id would be more honest; flagged as future work.
  • CLIP ≠ driving expert. CLIP nearest-neighbour mining drifts toward visual style (color, lighting) more than semantic content (e.g. it retrieves snowy mountain landscapes as "foggy" if the lighting matches). A real customer would use a domain-trained embedding.

10. Reproduction notes

  • Code: notebook.py in this directory.
  • Dataset: BDD100K 10K subset, downloaded May 2026; SHA-256 of label archive recorded in outputs/case_study/data_provenance.txt.
  • Compute: single RTX 4070 (12 GB VRAM), ~6 hours wall-clock end-to-end.
  • Random seed: 42 throughout (set at the top of notebook.py).

Author: <your name>. Date: <YYYY-MM-DD>. Project 17 of the Physical AI roadmap. Part of preparation for Applied Intuition Data Intelligence.