Physical AI
← All projects
Project 01Phase AData fluency·Hardware: Laptop
CURATE

01 — FiftyOne + nuScenes scenario mining

One-line goal: Use FiftyOne 1.15 to mine rare driving scenarios from nuScenes mini with CLIP embeddings, uniqueness, mistakenness, and natural-language search — and write up where CLIP-based mining works and where it falls down.


Goal

By the end of this project you will have:

  1. A FiftyOne grouped dataset that wraps nuScenes mini (10 scenes, ~3,000 keyframe-aligned camera images plus LIDAR/RADAR per scene).
  2. CLIP image embeddings stored on every camera sample, plus a FiftyOne Brain similarity index (brain_key="clip_sim") that supports both image-to-image and text-to-image queries.
  3. A per-sample uniqueness score so you can see which frames the rest of the dataset already covers and which are visually rare.
  4. A mistakenness score on the ground-truth 2D detections (or your own model's predictions if you wire one up) so you can see which annotations CLIP+model-disagreement flag as suspect.
  5. Three concrete text queries executed against the index — "child near road at dusk", "emergency vehicle with flashing lights", "construction zone with cones" — with the top-K results reviewed in the FiftyOne App and a short precision/recall write-up of where CLIP succeeded and where it failed.
  6. A short failure-mode list in the notebook that you can paste straight into your study journal.

The whole loop should fit in 60–90 minutes on a laptop after the dataset is downloaded.


Loops touched

This is a COLLECT + CURATE project in the four-loop AV data flywheel (collect → label → train → evaluate, then curate the failures back into collect).

  • COLLECT — you ingest a real public dataset and discover what's actually inside it (scene diversity, sensor coverage, annotation quality).
  • CURATE — you compute embeddings, run uniqueness and mistakenness, and drive natural-language scenario search to surface "interesting" subsets. This is the exact muscle that AV data engineers use every day.

You do not train or evaluate a model here — that's projects 02, 06, 07, 08.


Why this matters for AI Data Intelligence (the Applied Intuition framing)

Applied Intuition's Data Intelligence team builds Data Explorer and the products around Axion for AV/ADAS customers. The core user job-to-be-done is: "out of 50 hours of fleet logs we collected last week, find me the 200 clips where a pedestrian was in the road during dusk transitions, and rank them by how rare they are versus the rest of the corpus." That is exactly what this project does in miniature, on a 10-scene public dataset.

The pipeline you build here — load → embed → index → text-query → triage → export curated subset — is the same conceptual pipeline that powers a production curation tool. The differences in production are scale (millions of clips, multi-camera + LIDAR + radar joined on time), governance (PII redaction, dataset versioning, reviewer workflows), and the ML models behind the embeddings (often domain-fine-tuned CLIP variants, plus task-specific heads). The shape is identical.

You will also feel, first-hand, why CLIP alone is not enough: it confuses construction barriers with police lights, it has no notion of time-of-day unless lighting is dramatic, and it has no notion of behavior (a child standing still on a sidewalk vs. a child stepping into the road look identical to a frame-level encoder). The "failure modes" section of your write-up is the most valuable artifact of this project — those are the gaps that justify investments in temporal models, multi-modal fusion, and human-in-the-loop tooling. That is the conversation you want to be able to have in interviews.


Prerequisites

  • Python 3.9 – 3.12. FiftyOne 1.15 dropped 3.8 and does not yet support 3.13. 3.11 is the safest pick.
  • OS: macOS (Apple Silicon or Intel), Linux. Windows works but FiftyOne recommends WSL2.
  • Disk: ~6 GB free. nuScenes mini is ~4 GB unpacked, plus ~1 GB for embeddings/index/cache.
  • RAM: 8 GB is the floor; 16 GB is comfortable. The FiftyOne App keeps the dataset's MongoDB process resident.
  • Network: ~4 GB download of nuScenes mini, ~600 MB of model weights (CLIP ViT-B/32) on first run.
  • GPU: optional. CPU embedding of the ~2,400 camera frames in nuScenes mini takes 5–8 minutes on a modern laptop CPU; an RTX 3060 or M-series GPU does it in under a minute.
  • Background: comfortable Python, you've used numpy/pandas, you've at least seen PyTorch. No prior FiftyOne experience required.

Hardware

Laptop is fine. The bottleneck is CLIP embedding: ~2,400 images × ViT-B/32 ≈ a few minutes on CPU. If you only have a CPU, leave it running while you read the README — that's the right beat.

If you have a CUDA GPU, FiftyOne picks it up automatically once torch.cuda.is_available() is true. On Apple Silicon, the CLIP model will run on MPS via PyTorch with no extra config.


Setup

# from the repo root
cd projects/01-fiftyone-scenario-mining
 
# 1. Create venv + install pinned deps. Idempotent.
bash setup.sh
 
# 2. Activate
source .venv/bin/activate
 
# 3. Download nuScenes mini.
#    Sign up (free) at https://www.nuscenes.org/nuscenes#download
#    Download "Mini" split (v1.0-mini.tgz, plus the matching maps and CAN bus archives if listed).
#    Unpack into ./data/nuscenes/ so you end up with:
#       data/nuscenes/v1.0-mini/
#       data/nuscenes/samples/
#       data/nuscenes/sweeps/
#       data/nuscenes/maps/
 
# 4. Open the notebook.
jupytext --to notebook notebook.py        # one-time conversion to .ipynb
jupyter lab notebook.ipynb
# OR: open notebook.py directly in VSCode and use the "Run Cell" gutter

If you skip setup.sh and install manually, the only non-obvious detail is that PyTorch CUDA wheels are platform-specific — install torch from https://pytorch.org/get-started/locally/ for your platform before pip install -r requirements.txt and pip will treat that requirement as already satisfied.


Steps

Each step maps to a labeled section in notebook.py. The notebook is jupytext percent-format — # %% cells are runnable, # %% [markdown] cells are narrative.

  1. Load nuScenes mini into a FiftyOne grouped dataset (# %% Step 1). Initialize NuScenes(version='v1.0-mini', dataroot=...) from the devkit, create a FiftyOne Dataset, call dataset.add_group_field("group", default="CAM_FRONT"), then iterate the 10 mini scenes and add one fo.Group per keyframe with one slice per sensor (CAM_FRONT, CAM_FRONT_LEFT, …, LIDAR_TOP). After this step dataset.group_slices should list all 7 sensor slices.

  2. Inspect the dataset and launch the App (# %% Step 2). Print dataset.stats(include_media=True), view a single group with dataset.first(), and call fo.launch_app(dataset) to see the multi-camera + LIDAR group view in the browser. This is where the "FiftyOne is the GUI for your dataset" mental model clicks.

  3. Filter to the camera slice (# %% Step 3). Embeddings and CLIP queries operate on images, so build a view: cam_view = dataset.select_group_slices(["CAM_FRONT","CAM_FRONT_LEFT","CAM_FRONT_RIGHT","CAM_BACK","CAM_BACK_LEFT","CAM_BACK_RIGHT"]). Confirm you have ~2,400 images.

  4. Compute CLIP embeddings + similarity index (# %% Step 4). Call fob.compute_similarity(cam_view, model="clip-vit-base32-torch", brain_key="clip_sim", embeddings="clip_embedding"). This (a) downloads CLIP ViT-B/32 the first time, (b) runs it on every camera frame, (c) stores the per-sample 512-dim embedding in clip_embedding, and (d) builds an in-memory similarity index that supports text prompts (verify with index.config.supports_prompts is True).

  5. Compute uniqueness (# %% Step 5). fob.compute_uniqueness(cam_view, embeddings="clip_embedding") reuses the embeddings you just computed to add a uniqueness float ∈ [0,1] per sample. Sort descending and eyeball the top 20 — you'll see the rare frames (tunnel exits, weird crops, motion blur) bubble up.

  6. Run the three text-CLIP queries (# %% Step 6). For each of "child near road at dusk", "emergency vehicle with flashing lights", "construction zone with cones", call cam_view.sort_by_similarity(query, k=20, brain_key="clip_sim") and inspect the top 20 in the App. Mark each result as a true positive or false positive. Log raw counts in the notebook's results-table cell.

  7. Compute mistakenness on the ground-truth 2D boxes (# %% Step 7). nuScenes annotations are 3D, but the loader projects them to 2D per camera. The notebook walks you through running a small object detector (YOLO via the FiftyOne model zoo) to produce predictions, then fob.compute_mistakenness(cam_view, pred_field="predictions", label_field="ground_truth") to flag suspect labels. This is the slowest step (~5 min on CPU) and is optional for the done criterion — skip if you're tight on time.

  8. Write up precision/recall + failure modes (# %% Step 8). The notebook has a markdown cell with a results table you fill in (TP/FP counts at K=20 for each query, your one-paragraph failure-mode summary). This is the artifact that goes into your study journal and that you'll reference in interviews.

  9. Extend the queries (TODO cell) (# %% Step 9). The notebook ends with an empty list of your own queries. Pick three more — e.g., "car pulling out of a parking spot", "cyclist in a bike lane next to a truck", "rain on the windshield at night" — and run them. Compare against the seeded three. Note which kinds of scenarios CLIP can and cannot retrieve.

  10. Export the curated subset (# %% Step 10). Tag the true-positive results from your 6 queries with fo_tag = "rare_scenarios" in the App, then export with dataset.match_tags("rare_scenarios").export(export_dir="outputs/rare_scenarios", dataset_type=fo.types.FiftyOneDataset). This artifact is what you'd hand off to a labeling vendor or a model-training pipeline in a production curation flow.


Done criterion

You are done when all four of these are true:

  1. dataset.stats() shows ~70 groups (10 scenes × ~7 keyframes per scene at 2 Hz, give or take) and 7 group slices.
  2. cam_view.first().clip_embedding returns a 512-dim numpy array.
  3. The three seeded text queries each produced a top-20 view that you reviewed in the FiftyOne App, and the markdown results table in notebook.py (Step 8) is filled in with TP-at-20 counts for each query.
  4. Your "Failure modes" markdown paragraph (Step 8) lists at least three concrete CLIP failure patterns you observed (e.g., "construction cones retrieves traffic-light poles when no cones are in view").

If any of those four are missing, you're not done.


Common pitfalls

  1. No module named 'fiftyone.utils.nuscenes' — the FiftyOne nuScenes loader is hand-written by you using the nuscenes-devkit; FiftyOne does not ship a turnkey importer. Follow the pattern in notebook.py Step 1: instantiate NuScenes(version='v1.0-mini', dataroot=...) and walk nusc.scene yourself.

  2. dataroot mismatch — the devkit expects the path containing v1.0-mini/, not v1.0-mini/ itself. Symptom: AssertionError: Database version not found: .... Fix: set dataroot="data/nuscenes", not dataroot="data/nuscenes/v1.0-mini".

  3. CLIP model download stalls behind a corporate proxy. The first call to compute_similarity(model="clip-vit-base32-torch") pulls weights from dl.fbaipublicfiles.com or HuggingFace. If you're behind a proxy, set HF_HUB_ENABLE_HF_TRANSFER=1 and HTTPS_PROXY/HTTP_PROXY before launching Python. Cached weights live under ~/.cache/torch/hub/ and ~/.fiftyone/.

  4. App doesn't render LIDAR. The 3D viewer needs a WebGL2-capable browser tab. Safari sometimes hangs; Chrome or Firefox is fine. Also confirm your LIDAR_TOP slice samples have filepath pointing at the .pcd.bin file from data/nuscenes/samples/LIDAR_TOP/.

  5. sort_by_similarity(text_query) raises "this index does not support prompts." This means the brain key you're querying was built without a prompt-capable model. Check with dataset.load_brain_results("clip_sim").config.supports_prompts. If False, rebuild the index with model="clip-vit-base32-torch" (or "open-clip-torch") — generic vision-only encoders like ResNet won't accept text.

  6. Mistakenness fails with pred_field has no logits. compute_mistakenness works on confidence by default but is much sharper with logits. The default YOLO predictions in the FiftyOne model zoo store confidence, not logits — that's fine; just leave use_logits=False (the default). If you want logits, run a HuggingFace transformers detector and store the raw scores yourself.

  7. MongoDB process lingers after you Ctrl-C. FiftyOne starts a local MongoDB. If your laptop fan kicks on after closing the notebook, run fo.delete_non_persistent_datasets() and fo.core.service.DatabaseService().stop(), or just pkill -f "mongod.*fiftyone".


Further reading

Files in this project

  • README.md
  • notebook.py
  • requirements.txt
  • setup.sh

Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.