Project 16: Active-learning loop — labeled-data efficiency

What this project is. A focused, laptop-scale exploration of the inner loop of an AI data engine: given a fixed labeling budget, how do you pick the most informative frames to label next? You will implement five canonical active-learning (AL) techniques — uncertainty sampling, BALD, k-center coreset, BADGE, and a hybrid — against a small AV classification task, and produce one chart: labeled-data-efficiency curves for each method versus random sampling. The deliverable is the chart plus three numbers (the accuracy gain over random at 25%, 50%, 75% of the labeled budget).

Why this exists. Project 17's capstone mines hard examples by CLIP-distance to known failures. That works, but it's one heuristic among many. Production data engines — Tesla's "trigger-conditioned uploads", Waymo's hard-mining, every Applied Intuition Axion deployment — combine uncertainty and diversity signals in a formal loop. This project fills that gap. After it, you can answer the question "if you replaced project 17's CLIP-distance mining with BADGE, what would change?" with a chart, not a hand-wave.

Goal

Implement five active-learning sampling strategies against a small classification task and produce a single learning-curve figure that answers: for each AL method, how many labels do I need to match the accuracy of training on the full labeled set? The number you point to is the accuracy gain over random sampling at 50% labeled budget — the budget regime where the gap between methods is widest and the business case (halve your labeling spend) is most legible.

The chosen task is BDD100K weather classification — 7 classes (clear / partly-cloudy / overcast / rainy / snowy / foggy / undefined), roughly 70K training images, easy to iterate on a laptop GPU and directly tied to project 17's slice taxonomy. A CIFAR-10 fallback is provided in setup.sh for environments without BDD100K access; the code paths are identical.

Loops touched

This project lives in two of the four data-engine loops:

CURATE — the embedding-space part of every AL method (k-center, BADGE's k-means++, the diversity term in the hybrid) uses the same DINOv2 / SigLIP embeddings you computed in projects 01 and 08. Active learning is curation with a model in the loop.
LABEL & TRAIN — the entire point of AL is to reduce the LABEL cost while preserving TRAIN quality. Every iteration of the loop is a label-then-train round; the curve we produce is labels-vs-accuracy.

It deliberately does not touch COLLECT (we assume an unlabeled pool is given, as in project 17) or EVAL (we report accuracy on a held-out set; per-slice eval is project 17's job).

Why this matters for AI Data Intelligence

Active learning is the single technique with the highest direct ROI for a data-intelligence org. The pitch is short:

Labels are the dominant marginal cost. A 2D bbox is $0.05–$0.20 per object; a 3D cuboid with attributes is $0.50–$2.00; a video segment with multi-actor interaction can run $5–$20. At fleet scale — millions of frames per week — the labeling line item is in the tens of millions per year. Halving the labeling budget while matching model quality is a direct opex cut.
The "30% of labels for the same model" claim is real and measurable. On detection benchmarks, BADGE and coreset methods routinely match random-sampling-at-100% with 40–60% of the labels. This project shows you the curve on AV-flavoured data so you can cite the specific number in interviews.
It's the technique behind every "data engine" pitch you've read about. Tesla's trigger-conditioned uploads ("only send back frames where the shadow detector disagrees with the lane model") is uncertainty sampling at fleet scale. Waymo's hard-example mining is uncertainty + diversity. Applied Intuition's Axion is a productized active-learning loop wrapped around a customer's stack. If you can't reason about which AL strategy is appropriate for a given budget regime and data-shift profile, you can't build the product.
It connects directly to project 17. Project 17 mines by CLIP-distance to known failure clusters. That is a diversity-only strategy — it has no notion of which mined frames the current model is uncertain about. BADGE replaces the cluster-distance mining with a single sampling step that simultaneously prefers uncertain and diverse points; the only added cost is one pseudo-gradient computation per unlabeled example. The trade-off between "approximate coreset on a sample" (project 17) and "model-aware AL on a sample" (this project) is the architecture question every AL deployment ultimately faces.

The honest counterweight: AL is not magic, and on highly redundant fleet data its gains can shrink to near-zero unless you combine it with deduplication first. We will see this in the cold-start regime (§ "Common pitfalls" below).

Prerequisites

Project 17 finished (or at least its data step) — the BDD100K download is reused. If you don't have BDD100K, the setup script bootstraps a CIFAR-10 fallback so you can still run the loop.
Comfort with PyTorch training loops. We do not use a high-level trainer (Lightning / HF Trainer); the AL loop is easier to read when the train step is one explicit function.
~30 GB free disk (BDD100K weather subset + DINOv2 embeddings + per-iteration model checkpoints).

Hardware

Minimum: Apple-silicon laptop (M1/M2/M3), 16 GB RAM. The classification model is a ResNet-18 / DINOv2-linear head; one full AL run (5 strategies × 6 budget points × 3 seeds) finishes in 3–5 hours on M2 Pro.
Recommended: any consumer GPU with ≥6 GB VRAM. Same run finishes in under 1 hour on an RTX 3060.
The deliberate scale choice is "laptop GPU, hours not days". Active-learning research papers often run on CIFAR-10 / SVHN precisely because the AL signal is what you want to study, not the absolute accuracy. We mirror that.

Setup

cd projects/16-active-learning-loop
./setup.sh           # creates .venv, installs requirements, fetches data
source .venv/bin/activate
jupytext --to notebook notebook.py    # optional; or open notebook.py in VSCode

setup.sh is idempotent. It tries to symlink BDD100K from projects/17-bdd100k-data-engine/data/ if you've already downloaded it; otherwise it falls back to CIFAR-10 (auto-downloaded by torchvision). Both paths use the same notebook code.

Steps

The notebook walks through ten numbered sections; each one ends in a Checkpoint cell that pickles the relevant arrays so you can resume.

Build the unlabeled pool. Load BDD100K weather labels (or CIFAR-10 fallback). Hold out 20% as a test set. Treat the remaining 80% as the unlabeled pool — we forget the labels and only allow the AL methods to "buy" them, one batch at a time.
Embed the pool with DINOv2. Frozen ViT-S/14 features become the input to the diversity-based methods (k-center, BADGE) and a handy linear-probe baseline for sanity-checking the task. The embeddings are computed once and cached.
Random-sampling baseline. Train at 5%, 10%, 25%, 50%, 75%, 100% labeled budgets. Three seeds per budget point. This is the curve every AL method must beat.
Uncertainty sampling — three flavours. Implement Shannon entropy, max-margin, and least-confidence acquisition functions. Sequentially label the top-K most-uncertain unlabeled examples per round. Plot each variant against random.
BALD via MC dropout. Add a dropout layer to the linear head, take K=20 stochastic forward passes, score each unlabeled example by mutual information between predictions and model parameters. Compare BALD with naïve entropy — they pick different examples and that's the point.
k-center coreset. Greedy farthest-point-first selection in DINOv2 embedding space, seeded by the currently-labeled set. Use FAISS for the inner-loop nearest-neighbour query so it stays feasible at pool sizes >50K.
BADGE. Compute the hallucinated last-layer gradient for each unlabeled example (uses the predicted hard label as a stand-in target), then run k-means++ seeding on those gradient embeddings to pick a batch that is both uncertain (large gradient norm) and diverse (well-separated in gradient space). One function, ~40 lines of NumPy.
Hybrid: BADGE-lite. A simpler composition: shortlist the top-2K most-uncertain unlabeled examples by entropy, then apply k-center on their DINOv2 embeddings to pick the diverse subset. This is the version most useful in practice because it's cheap.
One figure to rule them all. Plot all six curves (random + 5 AL methods) on a single labeled-data-efficiency chart. Tabulate the 25/50/75% gains. Save to outputs/figures/learning_curves.png.
Reflection cells. Three short markdown discussions: which method dominates in which budget regime; cold-start failure modes we observed; the cost of computing uncertainty / coreset / BADGE at 10× and 100× pool size, and which approximations buy you back the cost.
(Optional) Detection extension. A user-TODO cell that sketches how to swap classification → detection: the uncertainty signal becomes per-anchor entropy aggregated to per-image, the coreset/BADGE inputs become per-image embeddings, and the metric becomes mAP. The skeleton is there; running it is left as homework you can fold into project 17.
(Optional) Connect to project 17. A second TODO cell: replace project 17's CLIP-distance mining of the worst slice with BADGE on the same pool, compare the slice-mAP delta. This is the moment the gap fill closes.

Done criterion

You are done when all of the following are true:

outputs/figures/learning_curves.png exists and shows six curves (random + uncertainty + BALD + k-center + BADGE + hybrid), averaged over 3 seeds, with shaded ±1 std bands.
outputs/tables/budget_gains.csv lists, for each AL method, the accuracy delta over random at the 25%, 50%, and 75% budget points, with seed-stddev.
At least one method shows a positive gain at the 25% and 50% points (failure to beat random at 75% is expected — the gap closes as labels become abundant). If no method beats random at any budget, your training pipeline has a bug; the reflection cells flag this case and what to check.
The reflection markdown cells answer, in your own words: (a) which method wins in low-budget vs mid-budget, (b) why coreset alone often loses to BADGE, (c) what changes if the unlabeled pool has near-duplicates.

Common pitfalls

Five (well, eight) failure modes that bite real AL deployments and will probably bite you in this notebook too:

Uncalibrated uncertainty. A neural net's softmax is not a calibrated probability. Entropy / margin / least-confidence scores from an uncalibrated model can be systematically biased — the model is overconfident on out-of-distribution inputs, exactly the inputs you'd most want AL to flag. Mitigation: temperature- scale on a held-out calibration set before computing acquisition scores. The notebook does this in §4.
Coreset cost at scale. Greedy k-center is O(N · M) where N is pool size and M is selected size. At fleet scale (N=10⁸) it's infeasible without approximations. The notebook uses FAISS to keep the inner NN query at O(log N), and the reflection cell discusses the standard production trick: run k-means++ on a 1% sample first.
Batch-mode vs sequential AL. True sequential AL (label one, retrain, repeat) is the gold-standard but unaffordable at any real scale. Batch-mode AL (label B, retrain) is what production uses. The methods are not equivalent — naïve "top-K by uncertainty" picks B near-duplicates of the same hard example. BADGE and coreset are explicitly batch-aware; entropy/BALD on their own are not. The notebook compares "top-K entropy" vs "k-center on the top-K-uncertain shortlist" to make this concrete.
Cold-start. The first AL round has no model, or a useless random-init model. Uncertainty signals are noise. TypiClust and BADGE-with-self-supervised-embeddings exist precisely for this regime; the notebook's hybrid uses DINOv2 embeddings (which need no labels) for the first 1–2 rounds, then switches to model-aware acquisition. Do not start uncertainty sampling from scratch.
Distribution shift mid-loop. As you add labels biased by your acquisition function, the labeled pool drifts away from the unlabeled pool. The classifier you train on round-K labels may be worse on the original test set than a random-K-labeled classifier. This is the most common reason AL underperforms random in published benchmarks. Mitigations: importance-weight the loss by inverse acquisition probability, or use a small fraction of pure-random labels in every batch. The notebook reserves 10% of every AL batch as random — this is the Wilcoxon-Mann-Whitney trick from the BAAL papers.
Per-class imbalance from acquisition bias. Active learning on a long-tailed dataset can collapse onto the head classes (which are uncertain on average just because there are more of them). Track per-class label counts in every round; if a class is starving, add a stratified-random fallback.
Test-set leakage via the unlabeled pool. If you embed your unlabeled and test pools with the same DINOv2 and use those embeddings for AL diversity, you are not leaking labels but you are tilting selection toward the test distribution. For an honest benchmark, embed only the unlabeled pool.
Reproducibility theatre. AL runs are noisy. A single-seed result that says "BADGE beats random by 4 points at 25%" is meaningless; a 3-seed run with std=2.5 might show no real difference. The notebook averages over 3 seeds and shows std bands. For a publication-quality result you would want 5+.