Project 03 — Grounding DINO + SAM 2 auto-labeling
Goal
Take a 30-second nuScenes (or comparable) camera clip and auto-label it end-to-end: open-vocabulary 2D box detection with Grounding DINO, mask generation with SAM 2, and temporal mask propagation with the SAM 2 video predictor. Then compare the auto-labels against nuScenes' ground-truth 2D-projected boxes per class, catalogue three failure modes, and reason explicitly about cost-per-label economics.
The output is not just a working pipeline. It's a calibrated opinion — "auto-labels are reliable for X, mediocre for Y, and you should not let them near Z without a human in the loop." That opinion is the entire job for several roles on Applied Intuition's Data Intelligence team.
Loops touched
This project sits squarely in the LABEL & TRAIN loop of the four-loop data engine, with one foot in CURATE:
- LABEL & TRAIN. The whole notebook is an auto-labeling pipeline. By the end you've gone from raw frames to per-object 2D boxes, masks, and track IDs across time, plus a quantitative comparison to ground truth that tells you which classes the auto-labels are trustworthy on.
- CURATE (touched). The failure-mode analysis (§9 of the notebook) is exactly how you'd start an active-learning queue: collect the hardest cases, route them to humans for QA, retrain.
It does not touch COLLECT (no new data acquisition) or EVAL on a closed- loop driving task — those come in projects 03, 05, 07.
Why this matters for AI Data Intelligence
The auto-labeling pipeline is the dominant pattern of the 2025–2026 AV and robotics data stack. Every serious team has rebuilt theirs in the last 18 months around vision-language detectors + foundation segmenters:
- Tesla popularized the "offline 4D auto-label" / "label from the future" idea on AI Day 2022 and has been industrializing it since. The key insight: an offline labeling network has access to the future trajectory of every object — so an object that's occluded now but visible in 30 frames can be back-projected, giving a clean label even through the occluded window. The fleet captures the data; the offline pipeline labels it; the online network learns from those labels. SAM 2's video predictor is doing the temporal-consistency part of this, but at 2D mask level instead of 3D vector-space level — same idea, smaller scale.
- Waymo's Auto4D (Yang et al., CVPR 2021; productionized since) generates 4D — i.e. 3D + time — object trajectories from sequential LiDAR. They decompose the 4D label into 3D size (estimated once per object) and motion path (refined frame-by-frame). The two-stage decomposition is Waymo's contribution; the fact that it's all auto- generated is the cost-of-labels lesson everyone has internalized.
- Mobileye REM is the limit case: instead of auto-labeling captured fleet data, REM auto-extracts geometry (lane lines, signs, road boundaries) on-board, then crowdsources millions of low-bandwidth fragments to a cloud aggregator that builds the HD map without ever storing raw imagery. Different objective, same economics: labels at fleet scale only work if humans aren't in the per-object loop.
This project is a working laptop-scale version of the first stage of all three pipelines: a foundation-model detector + segmenter + temporal propagator that turns a clip into structured labels. Once you've built it and have visceral feel for where it works and where it doesn't, you can read Tesla / Waymo / Wayve papers with the right priors — and you can have informed conversations with Applied's customers about what fraction of their labeling backlog they should automate today.
This connects directly to Applied Intuition's product line. Roads / ADAS customers ship perception models; they need to label their own fleet data; they don't all have an in-house auto-labeling team. Applied's job is, in part, to package this kind of pipeline so a Toyota or a Stellantis can operate it without hiring 30 ML engineers. Knowing the pipeline cold means you can size the gap between "Grounding DINO + SAM 2 is good enough" and "customer needs a custom 3D auto-labeler" with confidence.
Prerequisites
- Working through Project 01 first is strongly recommended — you'll reuse the nuScenes mini download and the FiftyOne intuition for what per-class accuracy means in self-driving data.
- Comfortable with PyTorch and HuggingFace transformers at the
from_pretrainedlevel. - Familiarity with bounding-box geometry: IoU, xyxy vs xywh, projecting 3D corners to image coords. Notebook does the math but doesn't re-derive it.
- ~30 minutes for setup + first download, then ~15 minutes per notebook run.
Hardware
| Tier | Works | Notes |
|---|---|---|
| Laptop GPU 8 GB+ (RTX 4060/4070, M3 Pro w/ MPS) | Yes — recommended | tiny SAM 2 + tiny Grounding DINO. Full notebook in ~5–10 min. |
| Laptop GPU 4–6 GB | With caveats | Drop MAX_FRAMES to 60 and use SAM 2 tiny only. Video propagation may need offload_video_to_cpu=True. |
| CPU only (Intel/Apple Silicon) | Yes, slowly | ~30× slower video propagation. Use MAX_FRAMES = 30, skip the propagation viz, focus on the §8 metrics cell. |
| Workstation GPU (24 GB+) | Yes — easy | Switch to MODEL_SIZE = "large" for noticeably better masks on distant objects. |
The repo's working assumption is the laptop tier. Everything above that gets you nicer numbers and shorter runtimes; nothing below the laptop tier has a fast path.
Setup
cd projects/03-sam2-auto-labeling/
chmod +x setup.sh
./setup.sh # creates .venv, installs deps,
# clones SAM 2, downloads checkpoints
source .venv/bin/activateWhat setup.sh does (idempotent — safe to re-run):
- Creates
.venvwith the system Python (must be 3.10–3.12). pip install -r requirements.txt(pinned versions — see GOTCHA below).- Probes
torch.cuda.is_available()and prints your GPU + VRAM. git clone facebookresearch/sam2 @ v2.1intothird_party/sam2/, thenpip install -e .so the SAM 2 package is onsys.path. TheSAM2_BUILD_CUDA=0env var skips the optional CUDA extension; you only lose a slightly slower mask-postprocessing path.- Downloads
sam2.1_hiera_tiny.ptandsam2.1_hiera_large.pttocheckpoints/(~150 MB + ~860 MB). Skips if already present. - Pre-warms the HuggingFace cache for
IDEA-Research/grounding-dino-tinyso the notebook's cell 1 doesn't stall for 700 MB of weights. - Creates
data/nuscenes/,data/clips/,outputs/masks/,outputs/videos/.
You then need to bring your own data:
- Recommended: download nuScenes mini (~4 GB) from
https://www.nuscenes.org/nuscenes#download and unpack into
data/nuscenes/so you havedata/nuscenes/v1.0-mini/,data/nuscenes/samples/,data/nuscenes/sweeps/. The notebook auto- detects this and runs the full GT-comparison cells. - Smoke-test fallback: drop any 30-second
.mp4intodata/clips/clip.mp4. The pipeline still runs; you skip the GT cells.
ffmpeg is recommended for the propagation video output but not required —
imageio-ffmpeg ships its own bundled binary as a fallback.
Steps
- Configure (
Configdataclass at top of notebook). Knobs that matter most:model_size,text_prompt,box_threshold,max_frames. - Load Grounding DINO via HuggingFace. Skip the IDEA-Research repo
install — the HF port is ~5 % slower but doesn't need
nvcc. - Load SAM 2, both image and video predictors, from the v2.1 checkpoint.
- Pull a 30-second clip from nuScenes scene-0103 (front camera, 12 Hz) or from the fallback mp4. Stage frames as zero-padded JPGs — what the SAM 2 video predictor expects.
- Detect with Grounding DINO on every keyframe. Save raw boxes without segmenting yet — debugging is easier if you can isolate detector failures from segmenter failures.
- Segment each detected box with
SAM2ImagePredictor. One Hiera forward per frame, then many cheap mask-decoder calls per frame. - Propagate masks across the full clip with
SAM2VideoPredictor. Seed with the keyframe with the most detections; let the video predictor's memory bank carry the masks through occlusions. - Compare to nuScenes ground truth — per-class precision / recall at IoU 0.5, with greedy IoU matching. Map nuScenes' 23 classes to our 7 prompt classes.
- Failure mode analysis — three categories: distance / small-object recall, occlusion (pedestrian-behind-car), class confusion (truck/car/bus). Notebook auto-finds and visualizes one example of each.
- Cost-per-label calculation with concrete dollar numbers. 2D box vs polygon vs 3D cuboid manual prices, vs auto-labeled GPU costs, extrapolated to fleet scale. This is the strategic punchline.
Done criterion
You're done when all of these are true:
- The notebook runs end-to-end with no errors.
- The §8 per-class P/R table prints actual non-zero numbers for
car,pedestrian,cyclist(the classes nuScenes mini has decent counts of). -
outputs/videos/propagation.mp4exists and shows tracked masks following objects across the clip. - You can articulate, in one paragraph each, the three failure modes you observed with concrete examples from your run. Don't just paraphrase the notebook — point at the specific images.
- You have a calibrated answer to "should this customer auto-label their data with this stack?" — i.e. you can say "yes for X, no for Y, with QA for Z" and back it with numbers.
Common pitfalls
transformers4.55 broke Grounding DINO post-processing. The uniformization PR (huggingface/transformers#34853) renamed the keyword arguments ofpost_process_grounded_object_detection—text_thresholdis gone, replaced bytext_labels. Either pin to 4.54.x (what we do inrequirements.txt) or update the call site in the notebook. Symptom:TypeError: ... unexpected keyword argumenton the very first detection call.- Grounding DINO box-score thresholds need per-domain tuning. The
model's defaults (0.35 / 0.25) were tuned for COCO-ish images. On
night-time or rainy nuScenes scenes, recall drops sharply — you may
need
box_threshold=0.2and accept a noisier precision. There is no universal "right" threshold; it's a precision/recall trade-off you own. - SAM 2 VRAM blows up on long videos.
SAM2VideoPredictorkeeps per-frame embeddings + per-object memory banks in GPU memory. On 8 GB GPUs, ~360 frames × 15 objects × tiny model is roughly the ceiling. Mitigations in order of preference: (a)state.offload_video_to_cpu = Trueatinit_state, (b) split the video into 60-frame chunks and stitch, (c) reduceMODEL_SIZEtotiny, (d) reduceMAX_FRAMES. - SAM 2 model-format gotcha:
.ptvs Hydra config name. SAM 2build_sam2()takes a config name string (e.g."configs/sam2.1/sam2.1_hiera_t.yaml") that it resolves from inside the package, plus a checkpoint path. People routinely pass the checkpoint path as the config, get an opaque Hydra error, and get stuck. The config string must match the checkpoint variant exactly:_hiera_t.yaml↔sam2.1_hiera_tiny.pt,_hiera_l.yaml↔sam2.1_hiera_large.pt. - Box prompts that overlap occluded objects pick the wrong surface.
If a pedestrian is mostly behind a car and Grounding DINO's box
covers both, SAM 2 will segment the car (the more salient surface
inside the box). The fix is point prompts (
predictor.predict( point_coords=, point_labels=)) on the visible part of the pedestrian, not box prompts. Notebook discusses this in §9. - nuScenes mini doesn't annotate traffic lights / signs. Only moving objects (vehicles, pedestrians, cyclists) get 3D boxes. Don't panic when those rows in the P/R table show 0/0.
- GroundingDINO's prompt format is finicky. Phrases must be
lowercase, period-separated, with a trailing period.
"car, truck"silently parses as one phrase;"Car . Truck"works but the capitalization sometimes hurts recall. The HF processor docs are explicit; follow them.
Further reading
The pipeline papers:
- Ravi et al., SAM 2: Segment Anything in Images and Videos, 2024. https://arxiv.org/abs/2408.00714
- Liu et al., Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, ECCV 2024.
- Meta AI, SAM 3: Segment Anything with Concepts, Nov 2025. Single- stage open-vocab detector + segmenter; the natural successor to the two-stage stack you build here. https://ai.meta.com/sam3/
- Meta AI, SAM 3.1, Mar 2026. Real-time video extension of SAM 3 with multiplexing and global reasoning.
Auto-labeling at scale:
- Tesla AI Day 2022 — the offline 4D auto-labeler walkthrough is the canonical "label from the future" reference. (YouTube; the relevant segment is around 1:35:00 in the 2022 stream.)
- Yang et al., Auto4D: Learning to Label 4D Objects from Sequential Point Clouds, 2021. The Waymo paper that popularized the size + motion-path decomposition.
- Qi et al., Offboard 3D Object Detection from Point Cloud Sequences, 2021. Waymo's offboard / pseudo-label paper that complements Auto4D.
- Mobileye REM technical overview: https://www.mobileye.com/technology/rem/
Engineering references:
- IDEA-Research / Grounded-SAM-2 repo — the canonical Grounding DINO + SAM 2 wiring, including a tracking demo we mirror in §7: https://github.com/IDEA-Research/Grounded-SAM-2
- Roboflow's supervision library —
Detections,BoxAnnotator,MaskAnnotator. Used in this notebook for annotation. Worth reading the source ofDetections.from_transformersto see how Grounding DINO outputs are normalized. - HuggingFace Grounding DINO docs: https://huggingface.co/docs/transformers/model_doc/grounding-dino
For the strategy-memo (Project 18):
- Cost-of-data analyses: Scale AI's, Voxel51's, Toloka's public reports on $/label by modality. The 100×–1000× ratio you'll see in §10 is consistent across all three.
- Applied Intuition's blog posts on the data engine for ADAS — read them with the priors you build here.
Files in this project
- README.md
- notebook.py
- requirements.txt
- setup.sh
Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.