Project 17: BDD100K mini data-engine — CAPSTONE
What this project is. A complete miniature of the data-engine Applied Intuition's Data Intelligence team builds at industrial scale, implemented end-to-end on the public BDD100K driving dataset. You will collect, curate, label, train, evaluate by slice, mine the long tail, auto-label the mined data, retrain, and write up the delta in a 4-page case study. Time budget: roughly two weekends of focused work.
Why this is the capstone. Projects 01–07 each touched one sub-loop: FiftyOne curation, SAM-2 auto-labeling, CARLA scenarios, Cosmos sim2real, OpenVLA fine-tune, LIBERO eval. This project chains them. If you can run this notebook end-to-end and explain the numbers in the case study, you can credibly walk into a Data Intelligence interview and discuss Data Explorer, Axion, the customer flywheel, and the long-tail Pareto without hand-waving.
Goal
Run one full turn of the AI data flywheel — collect → curate → label → train → evaluate-by-slice → mine → re-label → retrain — on BDD100K, and produce a measurable delta in mAP@0.5 on the worst-performing operational slice between iteration 0 (initial training) and iteration 1 (after mining + retraining).
The capstone has one number you point to: the iter-0 → iter-1 delta on the worst slice. Everything else (the architecture diagram, the slice-distribution charts, the cost-per-label analysis, the discussion of how this scales) frames that number.
Loops touched
This is the only project in the roadmap that touches all four loops:
- COLLECT — subsample 10 hours-equivalent of BDD100K (~36K frames), stratified by ODD attributes so rare slices aren't lost.
- CURATE — load into FiftyOne, embed with CLIP ViT-B-32 (and optionally DINOv2 for comparison), tag every sample with weather / timeofday / scene from BDD's native annotations, and measure the imbalance.
- LABEL & TRAIN — fine-tune YOLOv8m on the curated training pool using BDD's ground-truth labels. Then auto-label a held-out subset with Grounded-SAM-2 and quantify the precision/recall against ground-truth — i.e. the "what fraction of human-labeling could you skip" number.
- EVAL — evaluate the detector on a slice-balanced eval pool, compute per-slice mAP, identify the worst slice, mine clips closest to its failure cases via CLIP-embedding distance, auto-label the mined clips, and retrain. Re-evaluate. Report the delta.
Why this matters for AI Data Intelligence
This project is the AI Data Intelligence pitch in miniature. AI's public products — Data Explorer (slice-aware data tooling) and Axion (closed-loop training) — are the industrial-scale version of exactly this notebook:
- Data Explorer ↔ §2 of the notebook. Customers query their fleet
data by ODD attribute, find a regression slice, and pull frames into
a curated training set. That's
dataset.match_tags("foggy")plus a similarity search. We do this on 36K frames; AI customers do it on petabyte-scale logs. - Axion ↔ §6–7. Closed-loop training: pick worst slice, mine related data, auto-label, retrain, re-evaluate. That's the back half of the notebook. The fact that one iteration moves the slice metric is the entire product.
- The customer flywheel. Every AV company's pitch deck has the same circular diagram: drive → log → mine → label → train → deploy → drive. This notebook is one clockwise revolution of that diagram.
If you finish this and can answer "how would you change this for a customer with 10 PB of footage?", you have working intuition for the job. (Section 8 of the case-study template is exactly that question.)
A subtler point: the value of a data engine only shows up in slice-aware metrics. The overall mAP barely moves between iterations because the worst slice is, by definition, rare. If you only watch the population number you would conclude "this didn't help" and stop investing. The internal habit a Data Intelligence team needs to instill in customers is to look at slice tables, not headline numbers — and the case study you produce here is a one-page version of that argument you can show in interviews.
What's deliberately out of scope
Three temptations to resist:
- Beating SOTA on BDD. A YOLOv8m on 2.5K frames will not beat the bdd100k-models leaderboard. That's fine. The artifact is the workflow, not the absolute number.
- Multi-model ensembles. One model, two iterations, one slice. If you find yourself reaching for ensembling, you're optimizing the wrong axis.
- Production-grade data versioning. No DVC, no MLflow, no W&B
unless you genuinely already use them. A clean
outputs/tree with numbered run folders is enough for the case study. Real customers do need DVC-equivalents — note that as a future-work bullet, then move on.
Prerequisites
Strictly required:
- Project 01 complete. You should be comfortable opening a FiftyOne
app, computing visualizations, and using
compute_similarityto do k-NN retrieval. We will not re-explain that here. - Project 03 complete. You should have run Grounded-SAM-2 (or equivalent grounded-detection + segmentation) at least once and understand the prompt-tuning loop.
- Phase 0 reading internalized. Especially:
docs/01-av-industry-and-data.md— for the long-tail framing.docs/04-labeling-and-data-curation.md— for the labeling pyramid and the cost-per-label argument.docs/06-open-problems-and-benchmarks.md— for what BDD100K is actually a benchmark for and what it's missing.
Recommended:
- Skim
docs/07-learning-roadmap.mdPhase 5 — it justifies why this project sits where it does. - Project 12 (LIBERO eval) gives you eval-discipline reps; the per-slice eval here borrows that mindset.
Hardware
- Workstation GPU, 12 GB VRAM minimum. RTX 4070, 4070 Ti, 4080, 4090, A4000+, or any single A/H100 all work. Apple-silicon Macs can run the FiftyOne curation and slicing sections, but YOLO training and Grounded-SAM-2 inference will be too slow to be practical (estimate 10–20× the times below).
- ~150 GB free disk if you grab the full 100K-images bundle. ~50 GB if you stick to the 10K subset (the recommended fast path).
- CPU + 32 GB RAM is plenty; FiftyOne is the heaviest CPU consumer and it's IO-bound on the embedding pass.
End-to-end wall-clock on a single RTX 4070 (12 GB):
- Curation + embedding: ~20 min.
- YOLOv8m fine-tune iter 0 (50 epochs, 2.5K imgs): ~2 h.
- Grounded-SAM-2 auto-label of 100 mined frames: ~5 min.
- YOLOv8m fine-tune iter 1: ~2 h.
- Eval, mining, plotting: ~30 min total.
Plan for ~6 hours of compute, spread over two sittings.
Setup
cd projects/17-bdd100k-data-engine
chmod +x setup.sh
./setup.sh # creates .venv, installs deps, clones Grounded-SAM-2,
# downloads SAM 2.1 + Grounding DINO + YOLOv8m weights
source .venv/bin/activatesetup.sh does not download BDD100K (license requires login + manual
acceptance). You need to do that yourself:
- Make a free academic account at https://bdd-data.berkeley.edu/. The license permits research and not-for-profit use; commercial use requires contacting Berkeley OTL.
- Download the 10K Images bundle (~6 GB) from the portal — this is a labeled subset that is enough for the capstone walkthrough. If you want the full 10-hour video-equivalent, also download the 100K Images bundle (~70 GB).
- Download the Detection 2020 labels (
bdd100k_det_20_labels_trainval.zip, ~80 MB). - Unpack so the layout matches:
data/bdd100k/ images/ 10k/ train/<id>.jpg val/<id>.jpg test/<id>.jpg 100k/ # only if you grabbed the full bundle train/<id>.jpg val/<id>.jpg labels/ det_20/ det_train.json det_val.json
The notebook will refuse to run §1 if the labels are missing, with a helpful error pointing back here.
Steps (mapped to notebook sections)
The notebook (notebook.py) is organized as 8 numbered sections, each
self-contained, each with a checkpoint at the end. Resuming from any
section just requires having the previous section's outputs on disk.
- §1 — Collect. Load BDD100K via FiftyOne's loader, attach native weather / timeofday / scene tags, do a stratified subsample to ~36K frames (10 hours-equivalent at ~1 frame per video). Save the FiftyOne dataset to disk.
- §2 — Curate (embed + slice). Compute CLIP ViT-B-32 embeddings
on every kept frame (with a
--use-dinov2flag for the alternative). Build a slice index: cartesian product of weather × timeofday × scene, with frame counts. Plot the long-tail distribution. - §3 — Train pool / eval pool. Stratified split: 2500 frames for
training, 500 frames for eval (balanced across the top 20 most
common slices), the rest as the unlabeled mining pool. Convert
BDD's
det_20JSON into YOLO format for the training pool. Emit the YOLO dataset YAML. - §4 — Train iter 0. Fine-tune YOLOv8m for 50 epochs on the GT
training pool. Save weights to
outputs/runs/iter0/. - §5 — Eval iter 0 (per-slice). Evaluate on the eval pool; compute per-slice mAP@0.5 by filtering predictions/GT by slice tag before calling Ultralytics' validator. Identify the worst slice (argmin mAP among slices with ≥15 eval frames).
- §6 — Mine. Take the eval-set failures from the worst slice (frames where iter-0 mAP=0 or recall<0.3), average their CLIP embeddings into a "failure prototype", run k-NN against the unlabeled mining pool, return the top-100 mined frames.
- §7 — Auto-label + retrain (iter 1). Run Grounded-SAM-2 on the
100 mined frames with a BDD-class prompt list. Convert to YOLO
format. Spot-check 20 frames against the original BDD labels (if
the mined frame happened to be in the labeled split — often a few
are) and report auto-label precision/recall. Retrain YOLOv8m on
train ∪ mined; weights →
outputs/runs/iter1/. - §8 — Eval iter 1 + delta. Re-run the per-slice evaluation;
diff against iter 0; produce the headline table and the
outputs/figures/slice_delta.pngchart. Pickle everything needed for the case study.
Each section ends with a # %% [markdown] ## Checkpoint cell that
prints what was saved, what to verify visually, and what's safe to
delete if you need to restart from that section.
Subsampling strategy (justified)
10 hours of BDD ≈ (10 × 3600 s) / (40 s per video) = 900 videos. Each video has one labeled keyframe, so naively that's 900 frames — too few for a credible detector training run. Instead, we treat "10 hours equivalent" as a budget on raw video time and target ~36K frames at roughly one frame per ~1 second of source video, sourced as follows:
- Anchor on the labeled 10K subset (10K frames, ~6 GB). This is the labeled split with detection annotations. Use it as both the training pool source (via stratified sampling, ~2.5K frames) and the eval pool source (slice-balanced, 500 frames).
- Optionally extend with the unlabeled 100K subset (one keyframe per video, 100K total) for the mining pool. This is the pool that "iter 1 mines from"; we don't need labels here because Grounded-SAM-2 will produce them. Take a 33K-frame stratified sample of the unlabeled-to-us frames (i.e. all 100K minus the 10K that overlap with labeled splits, then stratified by weather × timeofday).
- Stratification is non-negotiable. Pure-random subsampling
of BDD lands you in a 50%+
clear/daytime/city streetset and you'll never see fog or snow at training time. Stratify by the cartesian product of weather × timeofday (28 buckets after droppingundefined); cap each bucket at its share of frames in the source pool plus a min-floor of 50 frames per slice so rare buckets aren't lost.
Total: ~36K frames if you grab the unlabeled extension, ~10K frames if you only use the labeled subset. Both work — the unlabeled extension just gives the mining step in §6 more candidates to pull from.
Done criterion
You're done when all of the following exist:
- A 4-page case study at
outputs/case_study/case_study_v1.md(or.pdf), filled in fromcase_study_template.md. Real numbers, real charts, no placeholders. - A 1-figure architecture diagram at
outputs/figures/architecture.png— a clean version of the Mermaid diagram in §2 of the case study. - A measurable delta on the worst slice. The headline table in §7 of the case study has at least one slice with non-trivial positive Δ mAP after one full flywheel turn. "Non-trivial" means either (a) ≥+0.05 mAP@0.5 on a slice with ≥15 eval frames, or (b) a clearly-explained reason why the delta is small (e.g. auto-label quality dominated, or mined frames were off-distribution — both are publishable findings).
- An honest limitations section. §9 of the case study, six bullets minimum. Don't gold-plate; be candid about the toy-scale problems.
The bar is not "I trained a great detector". The bar is "I can show a DI hiring panel a clean before/after table on the worst slice and discuss what would change at 1000× scale."
Common pitfalls
- BDD label-format quirks. The det_20 JSON has objects with
categorystrings that include"traffic light"(with space) and"traffic sign"(with space) — a naivecategory.replace(" ","_")will diverge from the YOLO names list. The notebook handles this into_yolo_label(); if you reimplement, mirror that mapping exactly. - ODD-tag schema choices. The temptation is to invent your own weather buckets ("clear+overcast as one"). Don't, on the first pass — use BDD's native attributes verbatim so you can defend the slice analysis without anyone questioning your taxonomy. Reserve schema redesign for a v2 of the project.
- CLIP vs. DINOv2 for mining. CLIP's text-image alignment is helpful when you can describe the failure mode in words ("foggy highway"), but it is style-biased — it can retrieve cinematic "foggy" content that isn't actually similar driving conditions. DINOv2 gives stronger pure-visual nearest neighbours but you lose text grounding. The notebook lets you toggle. Both are weaker than a domain-pretrained embedding would be at customer scale — note this in the case study.
- Train/test leakage by video. BDD's 100K-images set has one
keyframe per 40-second video, so frame-level splits are reasonably
safe. But the 10K subset has some near-duplicate scenes between
train and val. If you use the 10K subset for both training pool
and eval pool (which is the default fast path), spot-check that no
eval frame's filename prefix shows up in the training pool. Better:
hold out by
video_idif you've parsed it from filenames. - YOLOv8 dataset YAML paths. Ultralytics resolves
train:/val:paths relative to the YAML's directory, not relative tocwd. The notebook writes the YAML next to the data and uses relative paths — replicate that, or you'll get cryptic "no labels found" errors at training time. - Auto-label prompt drift. Grounded-SAM-2 with the prompt
"car . truck . bus . pedestrian . bicycle . motorcycle . traffic light . traffic sign . train . rider"does well on the first 6 classes and poorly ontrain(rare in BDD anyway),rider(collides with pedestrian+bicycle), andtraffic sign(over-fires on text and signage). Don't naively trust auto-label class IDs; the notebook reports per-class precision/recall against GT and you should expect 70–85% for the head classes. - Grad-CAM / explanation visualizations. Optional but recruiter-impressive. The notebook includes a stub. Skip on the first pass; revisit when you're polishing the case study.
- Cost-per-label numbers are illustrative. Real human-labeling costs vary 5–20× across vendors and class complexity. Quote a range in the case study and cite the assumption.
Further reading
- BDD100K original paper — Yu et al., "BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning" (CVPR 2020). https://arxiv.org/abs/1805.04687
- BDD100K format docs — https://doc.bdd100k.com/format.html
- FiftyOne BDD100K loader — https://docs.voxel51.com/dataset_zoo/datasets/bdd100k.html
- FiftyOne Brain
compute_similarity— https://docs.voxel51.com/brain.html - Ultralytics YOLO docs (training & val API) — https://docs.ultralytics.com/
- Grounded-SAM-2 repo — https://github.com/IDEA-Research/Grounded-SAM-2
- Grounded-SAM-2 auto-label tutorial — https://blog.roboflow.com/label-data-with-grounded-sam-2/
- DINOv2 paper — Oquab et al., 2023, https://arxiv.org/abs/2304.07193
- Voxel51 BDD blog — https://voxel51.com/blog/exploring-the-berkeley-deep-drive-autonomous-vehicle-dataset
- Long-tail mining papers worth skimming —
- Sener & Savarese, "Active Learning for CNNs: A Core-Set Approach", ICLR 2018.
- Kirsch et al., "BatchBALD", NeurIPS 2019 (the principled version of what we do heuristically here).
- Roadmap docs in this repo:
docs/01-av-industry-and-data.md§3 (long-tail Pareto).docs/04-labeling-and-data-curation.md§3, §5 (labeling pyramid, auto-label gates).docs/06-open-problems-and-benchmarks.md(what BDD is and isn't).
When you finish: commit the case study, push, and move on to project 18 (strategy memo). The strategy memo will reference this project's numbers as concrete evidence. Keep the iter-0 → iter-1 table handy.
Files in this project
- README.md
- case_study_template.md
- notebook.py
- requirements.txt
- setup.sh
Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.
Memo prompts
BDD100K Mini Data-Engine — Case Study
Fill this in as you work through
notebook.py. Target length: ~4 pages (≈1500–2000 words) plus one architecture figure and 4–6 charts. Save the final version asoutputs/case_study/case_study_v1.mdand export to PDF.The point of this document is not to claim a SOTA detector. The point is to walk through one full turn of the AI Data Intelligence flywheel — collect → curate → label → train → eval → mine → re-label → retrain — on a public driving dataset, and report the delta the loop produced on the worst-performing slice. A reader from Applied Intuition's Data Intelligence team should be able to skim this in five minutes and recognize the same workflow they sell to customers, just at 1/1000 scale.
1. Problem statement (≈200 words)
Setting. One paragraph: what the toy "AV team" you're playing the role of is trying to ship. Example: "An L2+ highway pilot for a customer operating in the Pacific Northwest. The customer's ODD spans clear, overcast, and rainy daytime conditions plus daytime fog (uncommon but critical) and nighttime urban driving. The detector under evaluation is a YOLOv8m-class model trained on a fixed budget of ~3000 labeled frames."
Hypothesis. One sentence: which slice you suspect will be the worst, and why. Reference Phase 0 reading on long-tail distributions.
Success metric. The number you'll point to at the end: mAP@0.5 on the worst slice, before vs. after one mining + retraining iteration. Per-class breakdowns are secondary; the headline is the slice delta.
2. Pipeline architecture
Replace the ASCII below with a real figure (outputs/figures/architecture.png)
once you have the loop running. Mermaid source for editing:
flowchart LR
A[BDD100K raw<br/>~100k frames] -->|stratified subsample<br/>10 hours ≈ 36k frames| B[FiftyOne dataset]
B -->|CLIP ViT-B-32<br/>embeddings| C[Slice index<br/>weather × time × scene]
C -->|hold-out by slice| D[Train pool<br/>~2.5k frames]
C -->|hold-out by slice| E[Eval pool<br/>~500 frames<br/>balanced by slice]
D -->|GT labels: BDD det_20<br/>auto-labels: GSAM-2| F[YOLOv8m fine-tune<br/>iter 0]
F -->|val on E| G[Per-slice mAP table]
G -->|argmin slice| H[Worst slice<br/>e.g. fog/night]
H -->|CLIP nearest-neighbour<br/>in unlabeled pool| I[100 mined clips]
I -->|GSAM-2 auto-label| J[YOLOv8m fine-tune<br/>iter 1<br/>train ∪ mined]
J -->|val on E| K[Per-slice mAP table v2]
G --> L[Δ mAP per slice]
K --> LTip: the figure is the resume artifact. Make it clean. Export from Mermaid Live Editor as SVG, then PNG at 1200 px wide.
3. Numbers per stage
Fill this table from the print-outs in notebook.py. Numbers below are
illustrative placeholders; replace them.
| Stage | Input | Output | Time | Disk |
|---|---|---|---|---|
| Collect | 100K BDD frames | 36K frames (10 hr equiv) | 5 min | 6 GB |
| Curate (embed) | 36K frames | 36K × 512-dim CLIP vectors | 18 min on RTX 4070 | 75 MB |
| Slice | 36K frames | 60 (weather × time × scene) buckets | <1 min | — |
| Train pool selection | 36K | 2.5K (stratified) | <1 min | — |
| Eval pool selection | 36K | 500 (balanced 25/slice for top 20 slices) | <1 min | — |
| Auto-label (GSAM-2) | 1.0K mined frames | det labels (10 classes) | 28 min | 8 MB |
| Train iter 0 (YOLOv8m, GT labels) | 2.5K frames × 50 epochs | weights | 2 h 10 min | 200 MB |
| Eval iter 0 | 500-frame eval pool | per-slice mAP | 2 min | — |
| Mine | unlabeled 33K + worst-slice errors | 100 candidate clips | 4 min | — |
| Train iter 1 (YOLOv8m, GT + auto) | 2.6K frames × 50 epochs | weights | 2 h 15 min | 200 MB |
| Eval iter 1 | 500-frame eval pool | per-slice mAP | 2 min | — |
4. Curation findings — slice imbalance
Insert outputs/figures/slice_distribution.png (a stacked bar chart of
weather × time-of-day from FiftyOne).
ODD attributes used. Following BDD's native annotation:
- weather ∈ {clear, overcast, rainy, snowy, foggy, partly cloudy, undefined}
- timeofday ∈ {daytime, night, dawn/dusk, undefined}
- scene ∈ {city street, highway, residential, parking lot, gas stations, tunnel, undefined}
Imbalance observations (fill in your numbers):
- Most common slice: e.g.
clear / daytime / city street— e.g. 38% of frames. - Rarest operational slices (excluding
undefined): e.g.foggy / daytime / highway(0.4%),snowy / night / residential(0.2%). - Number of slices with <50 frames: e.g. 9 of 60.
This is the long-tail Pareto that motivates the entire data-engine business. Reference: roadmap doc 04-labeling-and-data-curation §3.
5. Iteration 0 — baseline detector
Training config. YOLOv8m, 50 epochs, imgsz=640, batch=16, optimizer AdamW, cosine LR, warmup 3 epochs, COCO weights as init.
Overall metrics. Fill in:
| metric | value |
|---|---|
| mAP@0.5 (overall) | e.g. 0.42 |
| mAP@0.5:0.95 (overall) | e.g. 0.24 |
| recall@IoU≥0.5 (car) | e.g. 0.71 |
| recall@IoU≥0.5 (pedestrian) | e.g. 0.55 |
Per-slice mAP@0.5 (top 8 slices, sorted ascending). This is the table that drives the rest of the case study.
| slice (weather / time / scene) | n_eval | mAP@0.5 |
|---|---|---|
| foggy / daytime / highway | e.g. 18 | e.g. 0.11 |
| snowy / night / city street | e.g. 22 | e.g. 0.18 |
| rainy / night / city street | e.g. 25 | e.g. 0.21 |
| overcast / dawn-dusk / highway | e.g. 25 | e.g. 0.27 |
| clear / night / city street | e.g. 25 | e.g. 0.34 |
| rainy / daytime / city street | e.g. 25 | e.g. 0.39 |
| clear / daytime / highway | e.g. 25 | e.g. 0.51 |
| clear / daytime / city street | e.g. 25 | e.g. 0.58 |
The overall number hides everything. The worst slice is 3-5× worse than the best slice. This gap is the value proposition of a data-engine: a generic detector training run cannot see this gap; only a slice-aware eval can.
6. Worst-slice analysis
Slice chosen: fill in, e.g. foggy / daytime / highway.
Failure-mode hypothesis. One paragraph. Examples:
"On 18 fog-highway frames the detector misses ~70% of vehicles >50 m
away. Class confusion is minimal — the issue is recall, not precision.
Backbone activations on these frames look diffuse (visualized via Grad-CAM
in outputs/figures/gradcam_fog.png), suggesting the model never saw
enough low-contrast vehicle silhouettes during training."
Visual evidence. Insert 4–6 thumbnails from
outputs/figures/worst_slice_failures.png showing predictions vs. GT.
Mining strategy. We embed the failure cases themselves (the 18 fog frames) with CLIP ViT-B-32, average the embeddings to produce a "fog prototype" vector, and run k-NN against the 33K-frame unlabeled pool. Top-100 is what we mine.
Why this works at toy scale. CLIP captures coarse weather/lighting cues even though it was never trained on driving labels. At industrial scale, dedicated AV-pretrained embeddings (e.g. DINOv2 fine-tuned on internal video) close the gap further — but the workflow is identical.
7. Mining + auto-label + retraining (iteration 1)
Mined frames inspected. Insert
outputs/figures/mined_frames_grid.png (a 10×10 grid). Sanity check:
how many of the 100 are actually fog/highway? e.g. 78/100 — the rest
are overcast highway, which is fine; they're still informative.
Auto-label cost vs. ground-truth cost.
| labeling source | frames | wall-clock | $ at $0.50/frame human-equiv |
|---|---|---|---|
| Human (BDD baseline) | 100 | ~3 hours | $50 |
| Grounded-SAM-2 auto-label | 100 | ~3 minutes | <$0.10 (compute only) |
Auto-label quality (sampled 20 frames against human). Fill in.
| metric | value |
|---|---|
| precision (matched IoU≥0.5) | e.g. 0.84 |
| recall | e.g. 0.71 |
| classes most often missed | e.g. traffic light, bicycle |
Retraining config. Same as iter 0 but training set = 2500 GT + 100 auto-labeled mined frames. Validation set unchanged.
Per-slice mAP@0.5 — iter 1 vs iter 0. This is the headline table.
| slice | iter 0 | iter 1 | Δ |
|---|---|---|---|
| foggy / daytime / highway (mined) | e.g. 0.11 | e.g. 0.19 | +0.08 |
| snowy / night / city street | e.g. 0.18 | e.g. 0.18 | 0.00 |
| rainy / night / city street | e.g. 0.21 | e.g. 0.22 | +0.01 |
| overcast / dawn-dusk / highway | e.g. 0.27 | e.g. 0.29 | +0.02 |
| clear / night / city street | e.g. 0.34 | e.g. 0.34 | 0.00 |
| rainy / daytime / city street | e.g. 0.39 | e.g. 0.40 | +0.01 |
| clear / daytime / highway | e.g. 0.51 | e.g. 0.51 | 0.00 |
| clear / daytime / city street | e.g. 0.58 | e.g. 0.57 | -0.01 |
| overall | e.g. 0.42 | e.g. 0.43 | +0.01 |
Key reading of the table. Targeted mining lifts the worst slice without regressing the head of the distribution. Overall mAP barely moves (+0.01) because the worst slice is rare — and that is exactly the point: at population-level metrics, this work would look invisible. You only see the value in slice-aware eval. This is the central argument of the AI Data Intelligence pitch.
8. Discussion — what this looks like at AI scale
The data-engine in this notebook is ~36K frames, one detector, one worst slice, one iteration. Applied Intuition's customers operate the same flywheel at orders of magnitude larger. Here is what changes.
1. Scale of the unlabeled pool. At customer scale, the "unlabeled pool" is petabytes of fleet logs, not 33K frames. Embedding-based mining must run on a vector index that supports streaming inserts (LanceDB, Milvus, or a custom Spanner-backed index). The notebook's FAISS-CPU step would become a managed service.
2. Embedding choice. We used vanilla CLIP because it works on a laptop. At scale, customers train driving-specific video embeddings (e.g. self-supervised DINOv2 on raw fleet footage) so that "fog prototype" actually generalizes to their truck fleet, not the generic internet. The Phase 0 reading on representation learning becomes load-bearing here.
3. Slice schema. Our ODD schema (weather × time × scene) is hand-coded. At scale, ODD definitions are a contract between AV team and platform — typed schemas, versioned, tested. AI's "Data Explorer" essentially exposes this contract as a queryable surface.
4. Auto-label quality gates. Our auto-labels were spot-checked on 20 frames. At scale, every auto-label flows through a confidence-based router: top-confidence labels go straight to training, mid-confidence go to human review, low-confidence are discarded. This is the labeling pyramid from doc 04 §5.
5. Retraining cadence. Our iter 0 → iter 1 was a Tuesday afternoon. At scale, retraining is a nightly or weekly job triggered by slice-regression alerts; the slice-mAP dashboard is a first-class product surface.
6. The customer's job-to-be-done. From their POV the workflow is: "my detector regressed on fog at night last week — show me the new fog-night clips, label them, retrain, ship." Everything else is plumbing. This notebook is that plumbing in miniature.
9. Limitations
- Sample size. 36K frames is two orders of magnitude smaller than a realistic AV training set. The slice-mAP estimates have high variance; the iter 0 → iter 1 delta should be interpreted as direction, not magnitude.
- One iteration. A real flywheel runs many turns; we stopped at one. Returns diminish but the workflow is the same.
- Single random seed. Reproducing the train run with three seeds would tighten error bars; deferred for time.
- Auto-label classes. Grounded-SAM-2 with a 10-class prompt list
systematically under-detects
train(rare, prompt ambiguous with "train tracks") andrider(category collision withpedestrian+bicycle). A real pipeline would tune the prompt + post-process. - Test-set leakage risk. We held out by frame, not by video.
Frames from the same 40-second clip can land in train and val.
Holding out by
video_idwould be more honest; flagged as future work. - CLIP ≠ driving expert. CLIP nearest-neighbour mining drifts toward visual style (color, lighting) more than semantic content (e.g. it retrieves snowy mountain landscapes as "foggy" if the lighting matches). A real customer would use a domain-trained embedding.
10. Reproduction notes
- Code:
notebook.pyin this directory. - Dataset: BDD100K 10K subset, downloaded May 2026; SHA-256 of label
archive recorded in
outputs/case_study/data_provenance.txt. - Compute: single RTX 4070 (12 GB VRAM), ~6 hours wall-clock end-to-end.
- Random seed: 42 throughout (set at the top of
notebook.py).
Author: <your name>. Date: <YYYY-MM-DD>. Project 17 of the Physical AI roadmap. Part of preparation for Applied Intuition Data Intelligence.