Project 03 — Grounding DINO + SAM 2 auto-labeling

Goal

Take a 30-second nuScenes (or comparable) camera clip and auto-label it end-to-end: open-vocabulary 2D box detection with Grounding DINO, mask generation with SAM 2, and temporal mask propagation with the SAM 2 video predictor. Then compare the auto-labels against nuScenes' ground-truth 2D-projected boxes per class, catalogue three failure modes, and reason explicitly about cost-per-label economics.

The output is not just a working pipeline. It's a calibrated opinion — "auto-labels are reliable for X, mediocre for Y, and you should not let them near Z without a human in the loop." That opinion is the entire job for several roles on Applied Intuition's Data Intelligence team.

Loops touched

This project sits squarely in the LABEL & TRAIN loop of the four-loop data engine, with one foot in CURATE:

LABEL & TRAIN. The whole notebook is an auto-labeling pipeline. By the end you've gone from raw frames to per-object 2D boxes, masks, and track IDs across time, plus a quantitative comparison to ground truth that tells you which classes the auto-labels are trustworthy on.
CURATE (touched). The failure-mode analysis (§9 of the notebook) is exactly how you'd start an active-learning queue: collect the hardest cases, route them to humans for QA, retrain.

It does not touch COLLECT (no new data acquisition) or EVAL on a closed- loop driving task — those come in projects 03, 05, 07.

Why this matters for AI Data Intelligence

The auto-labeling pipeline is the dominant pattern of the 2025–2026 AV and robotics data stack. Every serious team has rebuilt theirs in the last 18 months around vision-language detectors + foundation segmenters:

Tesla popularized the "offline 4D auto-label" / "label from the future" idea on AI Day 2022 and has been industrializing it since. The key insight: an offline labeling network has access to the future trajectory of every object — so an object that's occluded now but visible in 30 frames can be back-projected, giving a clean label even through the occluded window. The fleet captures the data; the offline pipeline labels it; the online network learns from those labels. SAM 2's video predictor is doing the temporal-consistency part of this, but at 2D mask level instead of 3D vector-space level — same idea, smaller scale.
Waymo's Auto4D (Yang et al., CVPR 2021; productionized since) generates 4D — i.e. 3D + time — object trajectories from sequential LiDAR. They decompose the 4D label into 3D size (estimated once per object) and motion path (refined frame-by-frame). The two-stage decomposition is Waymo's contribution; the fact that it's all auto- generated is the cost-of-labels lesson everyone has internalized.
Mobileye REM is the limit case: instead of auto-labeling captured fleet data, REM auto-extracts geometry (lane lines, signs, road boundaries) on-board, then crowdsources millions of low-bandwidth fragments to a cloud aggregator that builds the HD map without ever storing raw imagery. Different objective, same economics: labels at fleet scale only work if humans aren't in the per-object loop.

This project is a working laptop-scale version of the first stage of all three pipelines: a foundation-model detector + segmenter + temporal propagator that turns a clip into structured labels. Once you've built it and have visceral feel for where it works and where it doesn't, you can read Tesla / Waymo / Wayve papers with the right priors — and you can have informed conversations with Applied's customers about what fraction of their labeling backlog they should automate today.

This connects directly to Applied Intuition's product line. Roads / ADAS customers ship perception models; they need to label their own fleet data; they don't all have an in-house auto-labeling team. Applied's job is, in part, to package this kind of pipeline so a Toyota or a Stellantis can operate it without hiring 30 ML engineers. Knowing the pipeline cold means you can size the gap between "Grounding DINO + SAM 2 is good enough" and "customer needs a custom 3D auto-labeler" with confidence.

Prerequisites

Working through Project 01 first is strongly recommended — you'll reuse the nuScenes mini download and the FiftyOne intuition for what per-class accuracy means in self-driving data.
Comfortable with PyTorch and HuggingFace transformers at the from_pretrained level.
Familiarity with bounding-box geometry: IoU, xyxy vs xywh, projecting 3D corners to image coords. Notebook does the math but doesn't re-derive it.
~30 minutes for setup + first download, then ~15 minutes per notebook run.

Hardware

Tier	Works	Notes
Laptop GPU 8 GB+ (RTX 4060/4070, M3 Pro w/ MPS)	Yes — recommended	`tiny` SAM 2 + `tiny` Grounding DINO. Full notebook in ~5–10 min.
Laptop GPU 4–6 GB	With caveats	Drop `MAX_FRAMES` to 60 and use SAM 2 tiny only. Video propagation may need `offload_video_to_cpu=True`.
CPU only (Intel/Apple Silicon)	Yes, slowly	~30× slower video propagation. Use `MAX_FRAMES = 30`, skip the propagation viz, focus on the §8 metrics cell.
Workstation GPU (24 GB+)	Yes — easy	Switch to `MODEL_SIZE = "large"` for noticeably better masks on distant objects.

The repo's working assumption is the laptop tier. Everything above that gets you nicer numbers and shorter runtimes; nothing below the laptop tier has a fast path.

Setup

cd projects/03-sam2-auto-labeling/
chmod +x setup.sh
./setup.sh                                    # creates .venv, installs deps,
                                              # clones SAM 2, downloads checkpoints
source .venv/bin/activate

What setup.sh does (idempotent — safe to re-run):

Creates .venv with the system Python (must be 3.10–3.12).
pip install -r requirements.txt (pinned versions — see GOTCHA below).
Probes torch.cuda.is_available() and prints your GPU + VRAM.
git clone facebookresearch/sam2 @ v2.1 into third_party/sam2/, then pip install -e . so the SAM 2 package is on sys.path. The SAM2_BUILD_CUDA=0 env var skips the optional CUDA extension; you only lose a slightly slower mask-postprocessing path.
Downloads sam2.1_hiera_tiny.pt and sam2.1_hiera_large.pt to checkpoints/ (~150 MB + ~860 MB). Skips if already present.
Pre-warms the HuggingFace cache for IDEA-Research/grounding-dino-tiny so the notebook's cell 1 doesn't stall for 700 MB of weights.
Creates data/nuscenes/, data/clips/, outputs/masks/, outputs/videos/.

You then need to bring your own data:

Recommended: download nuScenes mini (~4 GB) from https://www.nuscenes.org/nuscenes#download and unpack into data/nuscenes/ so you have data/nuscenes/v1.0-mini/, data/nuscenes/samples/, data/nuscenes/sweeps/. The notebook auto- detects this and runs the full GT-comparison cells.
Smoke-test fallback: drop any 30-second .mp4 into data/clips/clip.mp4. The pipeline still runs; you skip the GT cells.

ffmpeg is recommended for the propagation video output but not required — imageio-ffmpeg ships its own bundled binary as a fallback.

Steps

Configure (Config dataclass at top of notebook). Knobs that matter most: model_size, text_prompt, box_threshold, max_frames.
Load Grounding DINO via HuggingFace. Skip the IDEA-Research repo install — the HF port is ~5 % slower but doesn't need nvcc.
Load SAM 2, both image and video predictors, from the v2.1 checkpoint.
Pull a 30-second clip from nuScenes scene-0103 (front camera, 12 Hz) or from the fallback mp4. Stage frames as zero-padded JPGs — what the SAM 2 video predictor expects.
Detect with Grounding DINO on every keyframe. Save raw boxes without segmenting yet — debugging is easier if you can isolate detector failures from segmenter failures.
Segment each detected box with SAM2ImagePredictor. One Hiera forward per frame, then many cheap mask-decoder calls per frame.
Propagate masks across the full clip with SAM2VideoPredictor. Seed with the keyframe with the most detections; let the video predictor's memory bank carry the masks through occlusions.
Compare to nuScenes ground truth — per-class precision / recall at IoU 0.5, with greedy IoU matching. Map nuScenes' 23 classes to our 7 prompt classes.
Failure mode analysis — three categories: distance / small-object recall, occlusion (pedestrian-behind-car), class confusion (truck/car/bus). Notebook auto-finds and visualizes one example of each.
Cost-per-label calculation with concrete dollar numbers. 2D box vs polygon vs 3D cuboid manual prices, vs auto-labeled GPU costs, extrapolated to fleet scale. This is the strategic punchline.

Done criterion

You're done when all of these are true:

The notebook runs end-to-end with no errors.
The §8 per-class P/R table prints actual non-zero numbers for car, pedestrian, cyclist (the classes nuScenes mini has decent counts of).
outputs/videos/propagation.mp4 exists and shows tracked masks following objects across the clip.
You can articulate, in one paragraph each, the three failure modes you observed with concrete examples from your run. Don't just paraphrase the notebook — point at the specific images.
You have a calibrated answer to "should this customer auto-label their data with this stack?" — i.e. you can say "yes for X, no for Y, with QA for Z" and back it with numbers.

Common pitfalls

transformers 4.55 broke Grounding DINO post-processing. The uniformization PR (huggingface/transformers#34853) renamed the keyword arguments of post_process_grounded_object_detection — text_threshold is gone, replaced by text_labels. Either pin to 4.54.x (what we do in requirements.txt) or update the call site in the notebook. Symptom: TypeError: ... unexpected keyword argument on the very first detection call.
Grounding DINO box-score thresholds need per-domain tuning. The model's defaults (0.35 / 0.25) were tuned for COCO-ish images. On night-time or rainy nuScenes scenes, recall drops sharply — you may need box_threshold=0.2 and accept a noisier precision. There is no universal "right" threshold; it's a precision/recall trade-off you own.
SAM 2 VRAM blows up on long videos. SAM2VideoPredictor keeps per-frame embeddings + per-object memory banks in GPU memory. On 8 GB GPUs, ~360 frames × 15 objects × tiny model is roughly the ceiling. Mitigations in order of preference: (a) state.offload_video_to_cpu = True at init_state, (b) split the video into 60-frame chunks and stitch, (c) reduce MODEL_SIZE to tiny, (d) reduce MAX_FRAMES.
SAM 2 model-format gotcha: .pt vs Hydra config name. SAM 2 build_sam2() takes a config name string (e.g. "configs/sam2.1/sam2.1_hiera_t.yaml") that it resolves from inside the package, plus a checkpoint path. People routinely pass the checkpoint path as the config, get an opaque Hydra error, and get stuck. The config string must match the checkpoint variant exactly: _hiera_t.yaml ↔ sam2.1_hiera_tiny.pt, _hiera_l.yaml ↔ sam2.1_hiera_large.pt.
Box prompts that overlap occluded objects pick the wrong surface. If a pedestrian is mostly behind a car and Grounding DINO's box covers both, SAM 2 will segment the car (the more salient surface inside the box). The fix is point prompts (predictor.predict( point_coords=, point_labels=)) on the visible part of the pedestrian, not box prompts. Notebook discusses this in §9.
nuScenes mini doesn't annotate traffic lights / signs. Only moving objects (vehicles, pedestrians, cyclists) get 3D boxes. Don't panic when those rows in the P/R table show 0/0.
GroundingDINO's prompt format is finicky. Phrases must be lowercase, period-separated, with a trailing period. "car, truck" silently parses as one phrase; "Car . Truck" works but the capitalization sometimes hurts recall. The HF processor docs are explicit; follow them.