Project 05 — Frontier VLMs as Zero-Shot Labelers

Goal

Take a small corpus of autonomous-driving camera frames (BDD100K mini or nuScenes mini, 100–500 frames) and zero-shot label every frame with a frontier vision-language model — Gemini 2.5 Pro is the default; Claude Sonnet 4.5 and GPT-5 are pluggable second/third opinions. Generate six label tracks per frame:

Object presence per class — car / pedestrian / cyclist / traffic-light / sign.
Weather — clear / cloudy / rain / snow / fog.
Time of day — day / dawn-or-dusk / night.
Road type — highway / city / residential / parking-lot.
Free-text scene caption — one sentence; what's happening here.
Rare-event flag — emergency vehicle, school zone, construction, unusual cargo, animal in roadway, debris.

Then run the same six tasks through the traditional pipeline from project 03 (Grounding DINO + SAM 2 for objects, CLIP-zero-shot for weather, EXIF + sun-angle heuristics for time-of-day) and produce one comparison table that an Applied Intuition product manager could put on a slide: per-task accuracy, per-frame cost in dollars, per-frame latency p50/p99, agreement rate VLM-vs-traditional, and a typology of the failure modes each approach fails on.

The artifact is not the pipeline. It is the 2x2 decision matrix at the end of the notebook — when to use a VLM, when to use a traditional pipeline, when to ensemble them, and when to bring in a human. That matrix is the thing the role asks you to have an opinion on.

Loops touched

This project sits on the seam between two loops of the data engine:

LABEL & TRAIN. The whole notebook is two competing label-generation pipelines, scored against ground truth on the same frames. By the end you have a per-task quality/cost frontier — for every label kind, you know which approach Pareto-dominates and where the frontier crosses.
CURATE (heavily touched). Section 11 implements the VLM-as-judge pattern: feed the traditional pipeline's outputs back into a VLM and ask "is this label correct?". The disagreements become the active- learning queue. This is exactly the Cosmos-Reason-as-judge pattern NVIDIA shipped in 2025–2026, in 30 lines of Python.

It does not touch COLLECT (no fleet) or EVAL on closed-loop driving (those are projects 13, 16). It connects forward to project 06 (provenance): every VLM-generated label comes with a provenance manifest recording model ID, prompt hash, response, cost, and timestamp — exactly the schema project 06 establishes for label-data lineage.

Why this matters for AI Data Intelligence

The auto-labeling stack is reshaping fast in 2025–2026, and the role of an AI Data Intelligence engineer at Applied Intuition (or any comparable team) is to know — with calibrated numbers, not vibes — which tool to use for which signal. Three forces are colliding:

Frontier VLMs do tasks at $0.005–$0.05 per frame that previously cost $1–$10 per frame from human labelers. A 1024×1024 frame costs about 1,290 input tokens on Gemini 2.5 Pro, plus a few hundred output tokens for a structured response — call it $0.003–$0.005 per frame end-to-end. For "what's the weather in this image?" that's a 200–2,000× cost reduction over the human baseline, at quality close-but-not-equal to human. Two years ago this category did not exist.
Traditional CV pipelines (Grounding DINO + SAM 2, the project 03 stack) still dominate on geometric tasks — pixel-accurate masks, tight bounding boxes, instance tracking across video — at roughly $0.001 per frame in GPU-amortized cost. They're an order of magnitude cheaper than VLMs and more accurate on geometry. But they are roughly useless on semantic attributes ("is this an emergency vehicle"), free-text captions, and behavior intent.
Humans remain the gold standard for ambiguous edge cases, novel classes, and anything safety-critical, at $1–$10 per frame and measured-in-days latency.

These three options span 4–5 orders of magnitude in cost and 1–2 orders of magnitude in quality, in different ways for different tasks. The pricing is non-uniform: VLMs are cheap on "easy semantic" labels and expensive on "label every box on a 30-Hz fleet". Traditional pipelines have the inverse curve. There is no universal winner — only routing decisions.

This project produces the cost/quality intuition required to make those calls credibly. Applied Intuition's customers (Toyota, Stellantis, Volvo, defense primes) ship perception models trained on their fleet data. They need labels at fleet scale and they cannot afford an in-house team to re-derive these tradeoffs from scratch every quarter. Applied's job is to package the routing layer — "send geometry tasks to the SAM 2 pipeline, semantic tasks to a VLM with structured output, edge cases to humans, and provenance-stamp every label so the customer can audit it later." Knowing the curve cold means you can size the cost of any labeling backlog within a 2× factor in a 5-minute conversation.

The VLM-as-judge twist (the Cosmos Reason pattern) is the self-bootstrapping feedback loop that closes this. Once you have two labelers — VLM and traditional — you get a free third signal: their disagreements. Disagreements are exactly the training-data candidates human labelers should see. This is how you bend the labeling-cost curve down further every month without losing quality on the parts you care about.

Prerequisites

Project 03 — the Grounding DINO + SAM 2 baseline. We import its auto_label() function (rather than re-implementing) and assume its outputs/ are already on disk. If you have not run project 03, do that first and come back.
API keys for at least one of: Google AI Studio (Gemini 2.5 Pro is the default and the cheapest), Anthropic (Claude Sonnet 4.5/4.6), or OpenAI (GPT-5). One key is enough; two is recommended for the cross-provider agreement section; three lets you run the full Section 10 ensemble and is the only way to get the "majority vote of three frontier VLMs vs. traditional" comparison.
Comfortable with Pydantic (we lean on it for response schemas) and basic pandas for the comparison tables.
~15 minutes setup, then ~5 minutes per smoke-test run, ~30 minutes per full 500-frame run.

Hardware

Tier	Works	Notes
Anything that runs Python 3.11	Yes — recommended	API-bound. CPU-only laptop is fine.
Laptop GPU (8 GB+)	Recommended only if you are also running project 03's traditional pipeline locally for comparison	Section 9 imports project 03 outputs from disk, so the GPU is only needed if you re-run that.
No GPU at all	Yes, with one caveat	You'll need to either (a) re-use project 03's saved outputs from a prior GPU run, or (b) skip the traditional-pipeline comparison cells and run the VLM half only. The decision matrix still works.

This project is API-bound, not compute-bound. Cost dominates: budget ~$10–$30 in API calls for a serious run on 500 frames across 2 providers, or ~$0.50–$2 for a 100-frame smoke test on Gemini alone. Section 0 of the notebook has a cost-estimator that prints the projected bill before any API call goes out. Read its output before answering "y".

Setup

cd projects/05-vlm-zero-shot-labeling/
chmod +x setup.sh
./setup.sh                        # creates .venv, installs deps, copies .env.template
source .venv/bin/activate
cp .env.template .env             # then fill in YOUR keys; .env is git-ignored

What setup.sh does (idempotent):

Creates .venv with Python 3.10–3.12 (3.11 recommended).
pip install -r requirements.txt. The big new deps are google-genai, anthropic, openai, pydantic, and instructor.
Drops a .env.template with placeholders for GOOGLE_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY. Adds .env to .gitignore if not already there. The notebook will refuse to call any API whose key is missing, rather than silently picking a random provider. This is on purpose.
Verifies the existence of ../03-sam2-auto-labeling/outputs/ (the project-02 outputs we'll compare against). Prints a friendly warning if missing — the notebook degrades gracefully but the comparison table will be VLM-only.
Stages a 100-frame BDD100K mini sample under data/bdd100k_mini/ if you've placed the BDD100K download archive in data/. Otherwise prints download instructions; BDD100K requires a free account at https://bdd-data.berkeley.edu/.

You bring your own data. Recommended sources, in order:

BDD100K mini (~1 GB, free account required). Has weather, time-of- day, and scene labels per image — gold for our task suite. Download bdd100k_images_10k.zip and bdd100k_labels_release.zip (the small versions); unpack into data/bdd100k_mini/ so you have data/bdd100k_mini/images/100k/{train,val}/ and data/bdd100k_mini/labels/.
nuScenes mini (already downloaded if you did project 03). Has object detections but no weather labels; the notebook has a --corpus nuscenes-mini flag that runs only tasks (a) object presence.

Steps

Configure the Config dataclass at the top: corpus, n_frames, providers (which VLM(s) to call), dry_run (estimates cost without making API calls). The dry-run cost estimate is what you read before you spend money.
Load the corpus: 100–500 frames with ground-truth labels for at least 4 of our 6 tasks (BDD100K is the sweet spot — has all of weather / time-of-day / scene plus 2D boxes for objects).
Define the response schema as a Pydantic model. This same model is fed to Gemini's response_schema, Anthropic's tool-use input_schema, and OpenAI's response_format — one schema, three providers. The schema enforces enums for weather/time/road-type so we can score with exact-match.
Define the prompt. One paragraph of instructions plus the JSON schema, tested across providers. The prompt itself is checked into the repo (prompts/v1.md) so we have provenance.
Run the VLM pipeline. Loop over frames, call each configured provider, parse the structured output, save to a Parquet file with columns for cost, latency, raw response, and parsed labels. The cost-tracking decorator in Section 0 captures usage tokens from each provider's response.usage and converts to $.
Load the traditional-pipeline outputs from project 03 (outputs/ detections.json, outputs/masks/). Augment with CLIP-zero-shot classification for weather (one CLIP forward per frame is ~$0.0001 on a laptop GPU at amortized cost). Time-of-day comes from a simple sun-elevation heuristic on the frame timestamp + nuScenes/BDD ego pose; for BDD without GPS, we fall back to a brightness-histogram classifier.
Score both pipelines against ground truth. Per-task accuracy / F1 for the categorical labels; mean BLEU + a VLM-as-judge LLM-eval score for the free-text caption (since BLEU on captions is noisy).
Cost & latency aggregation. Per-frame cost in dollars (VLM API cost from token usage; traditional pipeline cost from a GPU-amortized model: $0.50/hour for an L4 spot instance, divided by frames-per- second throughput from project 03's logs). p50 and p99 latency.
Slice analysis. Where does the VLM dominate? Where does the traditional pipeline dominate? We make four scatter plots: cost-vs- accuracy, with one point per (task, pipeline) pair, color-coded by pipeline. The Pareto frontier is the deliverable.
VLM-as-judge (Section 11). Take the traditional pipeline's outputs, send them back to Gemini with the prompt "here is an automated label for this image; is it correct? if not, what's wrong?". Score the judge against ground truth. This is the Cosmos Reason pattern in 30 lines, and it gives you a free quality-monitoring loop on the cheap pipeline.
Failure-mode taxonomy. Auto-categorize disagreements into five buckets: (a) hallucinated objects (VLM says "fire truck" on a red SUV), (b) missed-rare-class (VLM correct, traditional missed), (c) off-by-one counting (VLM says 3 cars, true is 4), (d) geometry failure (VLM says "car center-frame" but the actual box is on the edge), (e) prompt sensitivity (changing one word flips the answer).
The 2x2 decision matrix. A markdown table at the end with two axes — signal type (geometric vs semantic) on one axis, volume (one-off vs fleet-scale) on the other — and one of {VLM, traditional, ensemble, human} in each cell, with a one-line justification grounded in the numbers we just generated.

Done criterion

You're done when all of these are true:

The notebook runs end-to-end without errors against at least 100 frames of BDD100K mini using at least one frontier VLM provider.
Section 8 prints a comparison table with concrete non-zero numbers for VLM and traditional on at least 5 of the 6 task categories (the 6th may be N/A — e.g. nuScenes has no weather GT).
Section 9's Pareto-frontier scatter plot saves to outputs/figures/pareto.png and visibly shows the crossover — VLM upper-right on semantic tasks, traditional lower-left on geometric tasks.
Section 10's VLM-as-judge agreement-with-GT number is within 5% of the direct VLM number on the same task. (If wildly off, something's wrong with the judge prompt.)
You can articulate the 2x2 decision matrix in your own words — not just paraphrase the cell — with at least one war story per quadrant grounded in your run's failures. This is the artifact the role wants you to be able to defend in a whiteboard conversation.
outputs/cost_report.json exists and shows total $ spent matches what your provider's usage dashboard shows within 5%. (If they disagree by more, your token-counting decorator has a bug.)

Common pitfalls

Prompt-engineering sensitivity is the #1 failure mode. Frontier VLMs in 2026 are less sensitive than they were in 2024, but a single phrasing change ("describe the weather" vs "what is the weather condition: clear, rain, snow, fog, or cloudy?") still flips ~5–10% of labels on the same image. Mitigation: enforce the response schema with enums (Pydantic Literal["clear","rain","snow","fog", "cloudy"]), keep the prompt frozen and version-controlled, and run a prompt-stability ablation in Section 4 by re-running the same prompt 3 times on a subsample and reporting the disagreement rate (~2–3% on a good prompt; 10%+ means rewrite).
VLMs hallucinate rare objects. Ask "is there an emergency vehicle in this scene?" on 100 random night-time city frames and you will get 3–5 false positives — usually a red SUV, a delivery truck with a light bar that isn't, or a car with hazards on. Mitigation: require the model to cite the box (output evidence_bbox: [x0, y0, x1, y1]) and post-filter responses where the cited box doesn't contain anything box-like (overlap with Grounding DINO output > 0.3).
Bounding-box accuracy is poor with frontier VLMs. Even Gemini 2.5 Pro and GPT-5 produce boxes that are 50–150 px off on standard driving frames in 2026. Don't ask a VLM for tight boxes; ask for yes/no presence and let the SAM 2 pipeline draw the actual rectangle. This is the load-bearing reason the decision matrix routes geometric tasks to the traditional pipeline.
Latency variance under load is severe. A clean Gemini call is ~1.5 seconds p50, but p99 can spike to 30+ seconds during US-east peak hours. If you're computing p99 on 100 frames, sample at random times of day, not in one back-to-back batch. Use the providers' batch APIs (Anthropic's Message Batches, Gemini Batch, OpenAI Batch) for any >100-frame run — they're 50% cheaper and more predictable.
Structured-output enforcement is not 100% even with response_ schema. Gemini will occasionally return JSON with extra keys; GPT- 5 will occasionally truncate output and produce invalid JSON; Anthropic's tool-use is the most reliable but won't catch enum violations on free-text-typed fields. The notebook wraps every parse in a try / except with a re-prompt-with-error-message fallback, and logs parse failures separately. Expect ~1% raw-parse failure on 500 frames.
Cost variance across providers is large and is the strategic surprise. GPT-5 is cheaper than Gemini per token at $0.625/M input, but its image tokenization (1024×1024 ≈ 765 tokens depending on "low/high" detail mode) is denser than Gemini's flat 1290-tokens-per- image, so for image-heavy workloads Gemini 2.5 Pro can come out cheaper end-to-end depending on the prompt. Re-do the math for your actual prompt before assuming the cheapest-text-rate provider is the cheapest-overall provider.
Image-input rate limits are tighter than text rate limits. Gemini 2.5 Pro's image-input limit on Tier 1 was 60 RPM as of May 2026. 500 frames × 1 provider = ~9 minutes of wall-clock at the rate limit; 3 providers in parallel needs 3× the rate-limit budget or sequential calls. Notebook uses asyncio.Semaphore(max_rpm/60) to throttle.
Token usage from provider responses is the source of truth for cost; do NOT count tokens locally. tiktoken and friends will undercount image tokens by 5–20% because the per-image overhead varies by model and detail-mode. Always read response.usage from the provider response and bill from there. Section 0's cost decorator does this; resist the urge to "estimate" upfront.

Connection to project 06 (provenance)

Every VLM-generated label in this project carries a provenance record that is exactly the schema project 06 establishes for label-data lineage:

{
  "label_id": "...",
  "frame_id": "bdd_val_b1c81faa-3df17267_00135",
  "task": "weather",
  "value": "rain",
  "source": {
    "type": "vlm",
    "model": "gemini-2.5-pro-002",
    "prompt_hash": "sha256:0fa1...c2",
    "prompt_version": "v1",
    "raw_response_path": "outputs/raw/bdd_val_..._gemini.json",
    "input_tokens": 1342,
    "output_tokens": 87,
    "cost_usd": 0.00257,
    "latency_ms": 1843,
    "timestamp": "2026-05-08T14:33:21Z"
  }
}

This means: a year from now, a customer auditor can ask "why does this training example say it's raining?" and get an answer back to the specific model snapshot, the exact prompt, and the cost the provider billed for that label. Without provenance, VLM-generated labels are unauditable training data — which is a non-starter for any safety- critical customer. Project 06 makes this schema enforceable across the whole stack; project 05 produces the labels that schema describes.