01 — FiftyOne + nuScenes scenario mining
One-line goal: Use FiftyOne 1.15 to mine rare driving scenarios from nuScenes mini with CLIP embeddings, uniqueness, mistakenness, and natural-language search — and write up where CLIP-based mining works and where it falls down.
Goal
By the end of this project you will have:
- A FiftyOne grouped dataset that wraps nuScenes mini (10 scenes, ~3,000 keyframe-aligned camera images plus LIDAR/RADAR per scene).
- CLIP image embeddings stored on every camera sample, plus a FiftyOne Brain similarity index (
brain_key="clip_sim") that supports both image-to-image and text-to-image queries. - A per-sample
uniquenessscore so you can see which frames the rest of the dataset already covers and which are visually rare. - A
mistakennessscore on the ground-truth 2D detections (or your own model's predictions if you wire one up) so you can see which annotations CLIP+model-disagreement flag as suspect. - Three concrete text queries executed against the index —
"child near road at dusk","emergency vehicle with flashing lights","construction zone with cones"— with the top-K results reviewed in the FiftyOne App and a short precision/recall write-up of where CLIP succeeded and where it failed. - A short failure-mode list in the notebook that you can paste straight into your study journal.
The whole loop should fit in 60–90 minutes on a laptop after the dataset is downloaded.
Loops touched
This is a COLLECT + CURATE project in the four-loop AV data flywheel (collect → label → train → evaluate, then curate the failures back into collect).
- COLLECT — you ingest a real public dataset and discover what's actually inside it (scene diversity, sensor coverage, annotation quality).
- CURATE — you compute embeddings, run uniqueness and mistakenness, and drive natural-language scenario search to surface "interesting" subsets. This is the exact muscle that AV data engineers use every day.
You do not train or evaluate a model here — that's projects 02, 06, 07, 08.
Why this matters for AI Data Intelligence (the Applied Intuition framing)
Applied Intuition's Data Intelligence team builds Data Explorer and the products around Axion for AV/ADAS customers. The core user job-to-be-done is: "out of 50 hours of fleet logs we collected last week, find me the 200 clips where a pedestrian was in the road during dusk transitions, and rank them by how rare they are versus the rest of the corpus." That is exactly what this project does in miniature, on a 10-scene public dataset.
The pipeline you build here — load → embed → index → text-query → triage → export curated subset — is the same conceptual pipeline that powers a production curation tool. The differences in production are scale (millions of clips, multi-camera + LIDAR + radar joined on time), governance (PII redaction, dataset versioning, reviewer workflows), and the ML models behind the embeddings (often domain-fine-tuned CLIP variants, plus task-specific heads). The shape is identical.
You will also feel, first-hand, why CLIP alone is not enough: it confuses construction barriers with police lights, it has no notion of time-of-day unless lighting is dramatic, and it has no notion of behavior (a child standing still on a sidewalk vs. a child stepping into the road look identical to a frame-level encoder). The "failure modes" section of your write-up is the most valuable artifact of this project — those are the gaps that justify investments in temporal models, multi-modal fusion, and human-in-the-loop tooling. That is the conversation you want to be able to have in interviews.
Prerequisites
- Python 3.9 – 3.12. FiftyOne 1.15 dropped 3.8 and does not yet support 3.13. 3.11 is the safest pick.
- OS: macOS (Apple Silicon or Intel), Linux. Windows works but FiftyOne recommends WSL2.
- Disk: ~6 GB free. nuScenes mini is ~4 GB unpacked, plus ~1 GB for embeddings/index/cache.
- RAM: 8 GB is the floor; 16 GB is comfortable. The FiftyOne App keeps the dataset's MongoDB process resident.
- Network: ~4 GB download of nuScenes mini, ~600 MB of model weights (CLIP ViT-B/32) on first run.
- GPU: optional. CPU embedding of the ~2,400 camera frames in nuScenes mini takes 5–8 minutes on a modern laptop CPU; an RTX 3060 or M-series GPU does it in under a minute.
- Background: comfortable Python, you've used numpy/pandas, you've at least seen PyTorch. No prior FiftyOne experience required.
Hardware
Laptop is fine. The bottleneck is CLIP embedding: ~2,400 images × ViT-B/32 ≈ a few minutes on CPU. If you only have a CPU, leave it running while you read the README — that's the right beat.
If you have a CUDA GPU, FiftyOne picks it up automatically once torch.cuda.is_available() is true. On Apple Silicon, the CLIP model will run on MPS via PyTorch with no extra config.
Setup
# from the repo root
cd projects/01-fiftyone-scenario-mining
# 1. Create venv + install pinned deps. Idempotent.
bash setup.sh
# 2. Activate
source .venv/bin/activate
# 3. Download nuScenes mini.
# Sign up (free) at https://www.nuscenes.org/nuscenes#download
# Download "Mini" split (v1.0-mini.tgz, plus the matching maps and CAN bus archives if listed).
# Unpack into ./data/nuscenes/ so you end up with:
# data/nuscenes/v1.0-mini/
# data/nuscenes/samples/
# data/nuscenes/sweeps/
# data/nuscenes/maps/
# 4. Open the notebook.
jupytext --to notebook notebook.py # one-time conversion to .ipynb
jupyter lab notebook.ipynb
# OR: open notebook.py directly in VSCode and use the "Run Cell" gutterIf you skip setup.sh and install manually, the only non-obvious detail is that PyTorch CUDA wheels are platform-specific — install torch from https://pytorch.org/get-started/locally/ for your platform before pip install -r requirements.txt and pip will treat that requirement as already satisfied.
Steps
Each step maps to a labeled section in notebook.py. The notebook is jupytext percent-format — # %% cells are runnable, # %% [markdown] cells are narrative.
-
Load nuScenes mini into a FiftyOne grouped dataset (
# %% Step 1). InitializeNuScenes(version='v1.0-mini', dataroot=...)from the devkit, create a FiftyOneDataset, calldataset.add_group_field("group", default="CAM_FRONT"), then iterate the 10 mini scenes and add onefo.Groupper keyframe with one slice per sensor (CAM_FRONT,CAM_FRONT_LEFT, …,LIDAR_TOP). After this stepdataset.group_slicesshould list all 7 sensor slices. -
Inspect the dataset and launch the App (
# %% Step 2). Printdataset.stats(include_media=True), view a single group withdataset.first(), and callfo.launch_app(dataset)to see the multi-camera + LIDAR group view in the browser. This is where the "FiftyOne is the GUI for your dataset" mental model clicks. -
Filter to the camera slice (
# %% Step 3). Embeddings and CLIP queries operate on images, so build a view:cam_view = dataset.select_group_slices(["CAM_FRONT","CAM_FRONT_LEFT","CAM_FRONT_RIGHT","CAM_BACK","CAM_BACK_LEFT","CAM_BACK_RIGHT"]). Confirm you have ~2,400 images. -
Compute CLIP embeddings + similarity index (
# %% Step 4). Callfob.compute_similarity(cam_view, model="clip-vit-base32-torch", brain_key="clip_sim", embeddings="clip_embedding"). This (a) downloads CLIP ViT-B/32 the first time, (b) runs it on every camera frame, (c) stores the per-sample 512-dim embedding inclip_embedding, and (d) builds an in-memory similarity index that supports text prompts (verify withindex.config.supports_prompts is True). -
Compute uniqueness (
# %% Step 5).fob.compute_uniqueness(cam_view, embeddings="clip_embedding")reuses the embeddings you just computed to add auniquenessfloat ∈ [0,1] per sample. Sort descending and eyeball the top 20 — you'll see the rare frames (tunnel exits, weird crops, motion blur) bubble up. -
Run the three text-CLIP queries (
# %% Step 6). For each of"child near road at dusk","emergency vehicle with flashing lights","construction zone with cones", callcam_view.sort_by_similarity(query, k=20, brain_key="clip_sim")and inspect the top 20 in the App. Mark each result as a true positive or false positive. Log raw counts in the notebook's results-table cell. -
Compute mistakenness on the ground-truth 2D boxes (
# %% Step 7). nuScenes annotations are 3D, but the loader projects them to 2D per camera. The notebook walks you through running a small object detector (YOLO via the FiftyOne model zoo) to produce predictions, thenfob.compute_mistakenness(cam_view, pred_field="predictions", label_field="ground_truth")to flag suspect labels. This is the slowest step (~5 min on CPU) and is optional for the done criterion — skip if you're tight on time. -
Write up precision/recall + failure modes (
# %% Step 8). The notebook has a markdown cell with a results table you fill in (TP/FP counts at K=20 for each query, your one-paragraph failure-mode summary). This is the artifact that goes into your study journal and that you'll reference in interviews. -
Extend the queries (TODO cell) (
# %% Step 9). The notebook ends with an empty list of your own queries. Pick three more — e.g.,"car pulling out of a parking spot","cyclist in a bike lane next to a truck","rain on the windshield at night"— and run them. Compare against the seeded three. Note which kinds of scenarios CLIP can and cannot retrieve. -
Export the curated subset (
# %% Step 10). Tag the true-positive results from your 6 queries withfo_tag = "rare_scenarios"in the App, then export withdataset.match_tags("rare_scenarios").export(export_dir="outputs/rare_scenarios", dataset_type=fo.types.FiftyOneDataset). This artifact is what you'd hand off to a labeling vendor or a model-training pipeline in a production curation flow.
Done criterion
You are done when all four of these are true:
dataset.stats()shows ~70 groups (10 scenes × ~7 keyframes per scene at 2 Hz, give or take) and 7 group slices.cam_view.first().clip_embeddingreturns a 512-dim numpy array.- The three seeded text queries each produced a top-20 view that you reviewed in the FiftyOne App, and the markdown results table in
notebook.py(Step 8) is filled in with TP-at-20 counts for each query. - Your "Failure modes" markdown paragraph (Step 8) lists at least three concrete CLIP failure patterns you observed (e.g., "construction cones retrieves traffic-light poles when no cones are in view").
If any of those four are missing, you're not done.
Common pitfalls
-
No module named 'fiftyone.utils.nuscenes'— the FiftyOne nuScenes loader is hand-written by you using thenuscenes-devkit; FiftyOne does not ship a turnkey importer. Follow the pattern innotebook.pyStep 1: instantiateNuScenes(version='v1.0-mini', dataroot=...)and walknusc.sceneyourself. -
datarootmismatch — the devkit expects the path containingv1.0-mini/, notv1.0-mini/itself. Symptom:AssertionError: Database version not found: .... Fix: setdataroot="data/nuscenes", notdataroot="data/nuscenes/v1.0-mini". -
CLIP model download stalls behind a corporate proxy. The first call to
compute_similarity(model="clip-vit-base32-torch")pulls weights fromdl.fbaipublicfiles.comor HuggingFace. If you're behind a proxy, setHF_HUB_ENABLE_HF_TRANSFER=1andHTTPS_PROXY/HTTP_PROXYbefore launching Python. Cached weights live under~/.cache/torch/hub/and~/.fiftyone/. -
App doesn't render LIDAR. The 3D viewer needs a WebGL2-capable browser tab. Safari sometimes hangs; Chrome or Firefox is fine. Also confirm your
LIDAR_TOPslice samples havefilepathpointing at the.pcd.binfile fromdata/nuscenes/samples/LIDAR_TOP/. -
sort_by_similarity(text_query)raises "this index does not support prompts." This means the brain key you're querying was built without a prompt-capable model. Check withdataset.load_brain_results("clip_sim").config.supports_prompts. If False, rebuild the index withmodel="clip-vit-base32-torch"(or"open-clip-torch") — generic vision-only encoders like ResNet won't accept text. -
Mistakenness fails with
pred_field has no logits.compute_mistakennessworks on confidence by default but is much sharper with logits. The default YOLO predictions in the FiftyOne model zoo storeconfidence, notlogits— that's fine; just leaveuse_logits=False(the default). If you want logits, run a HuggingFacetransformersdetector and store the raw scores yourself. -
MongoDB process lingers after you Ctrl-C. FiftyOne starts a local MongoDB. If your laptop fan kicks on after closing the notebook, run
fo.delete_non_persistent_datasets()andfo.core.service.DatabaseService().stop(), or justpkill -f "mongod.*fiftyone".
Further reading
- FiftyOne 1.15 release notes — https://docs.voxel51.com/release-notes.html
- FiftyOne self-driving guide (loading nuScenes step-by-step) — https://docs.voxel51.com/getting_started/self_driving/01_loading_datasets.html
fiftyone.brainAPI reference (signatures forcompute_uniqueness,compute_mistakenness,compute_similarity) — https://docs.voxel51.com/api/fiftyone.brain.html- Grouped datasets user guide — https://docs.voxel51.com/user_guide/groups.html
- Voxel51 blog — "Navigating the Road Ahead: nuScenes" — https://voxel51.com/blog/nuscenes-dataset-navigating-the-road-ahead
- nuScenes-devkit on GitHub — https://github.com/nutonomy/nuscenes-devkit
- nuScenes dataset paper — Caesar et al., 2020 — https://arxiv.org/abs/1903.11027
- OpenAI CLIP paper (the encoder powering the text queries) — Radford et al., 2021 — https://arxiv.org/abs/2103.00020
- FiftyOne Model Zoo
clip-vit-base32-torch— https://docs.voxel51.com/model_zoo/models/clip_vit_base32_torch.html - Applied Intuition Data Explorer overview (so you can see the production analog) — https://www.appliedintuition.com/products/data-explorer
Files in this project
- README.md
- notebook.py
- requirements.txt
- setup.sh
Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.