Project 04 — BEVFormer 3D detection on nuScenes

Goal

Run the BEVFormer camera-only 3D object detector on nuScenes, evaluate it on the official 3D-detection metrics (NDS, mAP, plus the five true-positive errors ATE / ASE / AOE / AVE / AAE), and inspect failure modes by operational-design-domain (ODD) slice — day vs night, clear vs rain, easy vs occluded. Then write down, in your own words, why BEV (bird's-eye view) became the dominant representation for AV perception between 2022 and 2026, with the lift-splat-shoot → BEVFormer → BEVFusion lineage as scaffolding.

The deliverable is a notebook that loads pretrained weights, computes real numbers on nuScenes, and produces three artifacts: an NDS / mAP score table, a per-ODD-slice breakdown, and a calibrated paragraph about BEV's place in the perception stack. Training BEVFormer from scratch costs ~8 GPU-days on 8×A100; we explicitly default to inference

optional light fine-tune so this project is feasible on a single workstation GPU.

The output is not "I ran a 3D detector." It is intuition: feel for how multi-camera surround attention lifts pixels into a metric BEV grid, why occlusion handling is easier in BEV than in the image plane, where the 3D failure modes live, and how 3D auto-labels actually get made at Tesla, Waymo, and Wayve.

Loops touched

This project sits in the LABEL & TRAIN loop of the four-loop data engine, with a strong reach into CURATE:

LABEL & TRAIN. BEVFormer trains on 3D-cuboid labels. Running its eval shows what nuScenes' 7-metric eval actually rewards — i.e. what any auto-labeler downstream has to match. The ODD-slice analysis is the bridge: production teams don't ship a single NDS number; they ship per-slice numbers and gate on the worst slice.
CURATE (touched). The per-class and per-ODD-slice breakdowns produce mineable failure sets (e.g. pedestrians >40 m at night) that feed Project 16's active-learning queues. Project 03 does this at 2D-mask level; Project 04 at 3D-cuboid level.

It does not touch COLLECT (somebody else's fleet) or closed-loop EVAL (Project 15). And it is not SAM-2 — SAM 2 is for 2D; BEV is the 3D analog: a unified spatial representation downstream tasks plug into. Project 03 was the 2D version of "everything talks to one representation"; Project 04 is the 3D version.

Why this matters for AI Data Intelligence

3D perception in BEV is the AV-native representation. A 2D image-plane box is useful for screenshots; a 3D box in metric BEV is what prediction, planning, and control actually consume. Every serious AV team in 2024–2026 ships a perception model whose head produces 3D outputs in BEV space. To work on Applied Intuition's Data Intelligence team you should be able to draw the BEV pipeline on a napkin and defend every box.

The lineage that matters:

Lift-Splat-Shoot (LSS) — Philion & Fidler, ECCV 2020. Each camera pixel is "lifted" into a frustum of depths (a soft depth distribution), per-camera feature volumes are "splatted" onto a shared BEV grid via known calibration, and a small BEV CNN "shoots" the predictions. This is when multi-camera fusion stopped being heuristic late fusion and became a single differentiable network. Every modern BEV detector descends from LSS.
BEVDet / BEVDepth / BEVFormer — 2021–2022. Three branches refine LSS. BEVDet (Huang 2021) productionizes LSS with stronger backbones. BEVDepth (Li, AAAI 2023) shows the lift is depth-bottlenecked and adds explicit LiDAR depth supervision. BEVFormer (Li, ECCV 2022) replaces per-pixel lift with attention queries: a fixed grid of BEV query tokens does spatial cross-attention (each query samples the cameras at the geometric points it projects to) and temporal self-attention (each query attends to the previous timestep's BEV features, ego- motion compensated). The temporal step is what handles occlusion — the network "remembers" what was visible a frame ago. This is what we run in the notebook.
BEVFusion — 2022–2023. Two papers (mit-han-lab and ADLab-AutoDrive, both titled "BEVFusion") cement the next move: a unified BEV representation for cameras AND LiDAR. Camera features reach BEV via LSS; LiDAR features reach BEV via voxelization + PointPillars. The two maps are concatenated and decoded jointly. This is what makes auto-labeling actually work in production: cameras give class and lane semantics; LiDAR gives metric depth and free-space; BEV is the shared space where the two finally agree.
Occupancy networks → world models — 2023–2026. Tesla's AI Day 2022 unveiled "occupancy networks" — a per-voxel occupancy/semantic predictor in BEV — essentially BEVFormer with a denser output head. Wayve's GAIA-2 (2025) and Tesla's world model (2024–2026) extend this generatively: predict the future occupancy grid, not just the current one.

The connection to multi-sensor fusion and auto-labeling: BEV is the only place in the stack where it is geometrically clean to fuse multiple cameras, LiDARs, and radar. Image-plane fusion is hopeless (each camera lives in its own 2D coords); BEV is metric, sensor- agnostic, and extensible to time. That's why BEV won. Every 3D auto-labeler in production today is BEV-native — including the ones at Applied Intuition's Roads / Aurora customers, who evaluate on something close to nuScenes' eval and gate releases on per-ODD-slice numbers. The notebook in this project is a laptop-scale facsimile of the eval cell every such team runs every night.

Prerequisites

Project 01 (FiftyOne / nuScenes scenario mining) recommended. You'll reuse the nuScenes mini download and the intuition for per-class evaluation.
Project 03 (SAM 2 auto-labeling) recommended for the 2D ↔ 3D parallel — same data engine pattern, different output dimension.
Comfortable with PyTorch and 3D geometry (rotations, homogeneous transforms, projecting between sensor frames). §3 walks the math but assumes you've seen rotations before.
Basic OpenMMLab familiarity helps but is not required.
~45 min setup; ~30 min per inference-only notebook run.

Hardware

Tier	Path	Notes
RTX 4090 / A6000 / A100 (24 GB+)	Full inference + light fine-tune	BEVFormer-base, full v1.0-trainval val split, ~25 min eval.
RTX 4070 / 4080 (12–16 GB)	Inference only	Tiny on mini. ~15 min.
RTX 4060 / 3060 (8 GB)	Mini-only	Tiny + reduced batch. 81 val samples. Numbers not paper-comparable.
Mac (M-series)	§3 calibration only	MSDA CUDA op has no MPS/CPU fallback. §4+ needs a remote NVIDIA box.
CPU only	Not supported	Same reason.

Working assumption: RTX 4070+ for inference. Training from scratch is out of scope (~8 GPU-days on 8×A100 for 51.7 NDS). The §11 cell shows a 2-epoch fine-tune for feel, not numbers.

Setup

cd projects/04-bevformer-3d-detection/
chmod +x setup.sh
./setup.sh                                    # creates .venv (Python 3.8),
                                              # installs the legacy OpenMMLab
                                              # 1.x stack, clones BEVFormer,
                                              # builds MSDA CUDA op, downloads
                                              # tiny checkpoint
source .venv/bin/activate

What setup.sh does (idempotent — safe to re-run):

Python 3.8 sanity. mmcv-full 1.4.0 only ships wheels for 3.6–3.9; 3.8 is the cleanest. If you have only 3.10+, install 3.8 via pyenv.
CUDA / nvcc check. BEVFormer's MultiScaleDeformableAttention has no CPU fallback. Driver ≥450.80.02 + CUDA 11.1 toolkit is the tested combo.
Pin pip + setuptools. setuptools<60 is required because mmdet3d 0.17.1's python setup.py install predates PEP 660.
Install torch 1.9.1 + cu111 from the PyTorch wheel index. This must happen before mmcv-full so the OpenMMLab index can pick the right wheel.
Install mmcv-full 1.4.0 from the OpenMMLab cu111/torch1.9.0 wheel index. The URL encodes torch and cuda — mismatching either triggers a doomed source build.
Install mmdet 2.14.0 + mmsegmentation 0.14.1 from PyPI.
Clone mmdetection3d v0.17.1, pip install -e . (compiles ~1 min of C++ ops).
Install the rest of requirements.txt — nuscenes-devkit, numpy 1.19, scipy 1.7, etc.
Clone fundamentalvision/BEVFormer, build its custom MultiScaleDeformableAttention CUDA op via python setup.py build install from projects/mmdet3d_plugin/ops/.
Pre-create data/nuscenes/, ckpts/, outputs/.
Download bevformer_tiny_epoch_24.pth (~250 MB) from the BEVFormer release.
Final import sanity check — imports torch, mmcv, mmdet, mmdet3d, mmseg, nuscenes and prints versions.

You then need to bring your own data:

Recommended for inference: download nuScenes mini (~ 4 GB) from https://www.nuscenes.org/nuscenes#download. Unpack into data/nuscenes/ so you have data/nuscenes/v1.0-mini/, data/nuscenes/samples/, data/nuscenes/sweeps/. The notebook auto-detects this and runs the inference + eval cells.
Required for paper-comparable numbers: v1.0-trainval (~350 GB). This is unavoidable for the 6019-sample val split that the paper reports on; mini's 81 samples have far too much variance to reproduce the headline 0.517 NDS.
CAN-bus expansion: also needed by BEVFormer for ego-motion. Drop nuScenes-can_bus.zip next to the unpacked dataset.

After data is in place, run BEVFormer's data-prep script to build the pickled info files (one-time, ~5 min on mini):

cd third_party/BEVFormer
python tools/create_data.py nuscenes \
  --root-path ../../data/nuscenes \
  --out-dir   ../../data/nuscenes \
  --extra-tag nuscenes \
  --version   v1.0-mini \
  --canbus    ../../data/nuscenes

Steps

Configure. Edit the Config dataclass at the top of the notebook. Knobs: model_size (tiny/small/base), nuscenes_version (v1.0-mini / v1.0-trainval), eval_subset_size (cap eval samples for laptop runs).
Hardware sanity. Print torch + CUDA + GPU + VRAM. Bail loudly if no CUDA — the MSDA op has no fallback.
Sensor calibration walkthrough. Load one nuScenes sample, visualize the 6 surround cameras, the LiDAR sweep, and walk through the ego frame ↔ sensor frame transforms. Show the intrinsic matrix K for one camera, the extrinsic T_lidar→ego and T_camera→ego, and project a 3D LiDAR point onto a camera image with the math written out. Foundational for everything that follows.
Load BEVFormer. Build the model from projects/configs/bevformer/bevformer_tiny.py, load the pretrained checkpoint, push to GPU, and assert the MultiScaleDeformableAttention op imports cleanly.
Run inference on the val subset. Each sample produces a list of 3D boxes with class, score, and 9-DOF pose (xyz, lwh, yaw, vel_x, vel_y).
Compute official nuScenes detection metrics. Use the official nuscenes.eval.detection.evaluate.DetectionEval class to produce NDS, mAP, ATE, ASE, AOE, AVE, AAE. Tabulate.
Per-class breakdown. Same metrics decomposed by the 10 nuScenes detection classes (car, truck, bus, trailer, construction_vehicle, pedestrian, motorcycle, bicycle, traffic_cone, barrier). The class imbalance is enormous — barriers and trailers are rare; cars and pedestrians dominate.
Per-ODD-slice analysis. Split the val set on the description field of each scene (day/night, clear/rain) and recompute mAP per slice. The night-rain slice is where every AV stack's worst numbers live.
Visualize. Render BEV ground truth (orange boxes) vs predictions (blue boxes) on the BEV grid, AND project the predicted 3D boxes onto the 6 camera images. Save to outputs/figures/.
Reflection cell. Why BEV won — the LSS lineage in your own words. Three paragraphs, with the notebook providing prompts.
(Optional) Light fine-tune. Two-epoch fine-tune on mini for the curious. Will not improve numbers — there's not enough data — but lets you feel the train loop.
Compare to BEVFusion (TODO for the user). Pointer to the BEVFusion install path and a paragraph on what the camera-vs-camera+LiDAR delta looks like in numbers.

Done criterion

You're done when all of these are true:

Notebook runs end-to-end with no errors.
§6 prints NDS and mAP as actual numbers. On the v1.0-trainval val split with bevformer-base you should be within ~0.01 of the paper's 0.517 NDS / 0.416 mAP. On mini the numbers will be much noisier (≈0.30 NDS) — that is expected and the right finding: mini is for plumbing, not for benchmarking.
§7 prints a per-class table with all 10 classes; you can explain why barrier and traffic_cone numbers are noisy (low support) and why pedestrian AOE is high (yaw is poorly defined for pedestrians).
§8 prints day/night/rain slice numbers; you have written down which slice is worst and a hypothesis why.
§9 produced both a BEV figure and per-camera reprojection figure; both saved to disk.
You can answer, in 3 sentences each, on a whiteboard: (a) why BEV; (b) what spatial cross-attention does in BEVFormer; (c) what BEVFusion adds beyond BEVFormer.

Common pitfalls

mmcv-mmdet-mmdet3d version mismatch. The biggest source of pain. The pinned combo is mmcv-full 1.4.0 + mmdet 2.14.0 + mmsegmentation 0.14.1 + mmdet3d 0.17.1. Do not try to modernize to OpenMMLab 2.x for BEVFormer — the registry rename and the new mmengine runner are not backwards compatible, and BEVFormer's plugin code imports private 1.x APIs that are simply gone in 2.x. The clean modernization path is to switch detectors entirely (use BEVFusion via mmdet3d 1.4 instead); keep BEVFormer on the 1.x stack.
Torch + CUDA + mmcv wheel URL alignment. The OpenMMLab wheel URL encodes both torch and cuda: https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html. If you install torch 1.10 and try to fetch mmcv-full 1.4.0 from the cu111/torch1.9.0 index, pip will silently fall back to a source build, which then takes ~25 minutes and fails on glibc. setup.sh does this for you; if you're poking by hand, double- check the URL.
MultiScaleDeformableAttention build failure. The custom CUDA op under BEVFormer/projects/mmdet3d_plugin/ops/ builds with python setup.py build install. On nvcc-version mismatches the build emits ~50 lines of warnings then errors on nvcc fatal: unsupported gpu architecture. Fix: export TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6" (Turing+Ampere) or the specific compute capability of your GPU (RTX 4090 is 8.9, A100 is 8.0, V100 is 7.0).
Sensor extrinsics — the T_sensor→ego direction trap. nuScenes stores calibrated_sensor records as translation + rotation (quaternion). These are T_sensor→ego — the transform that maps a point in the sensor frame to ego frame. To go the other way (project an ego-frame box into the camera) you invert. Half of all BEVFormer-related Stack Overflow questions are people who got this direction wrong and got either mirrored detections or detections way out in space. The §3 calibration walkthrough writes the transforms out explicitly to immunize you.
BEV grid resolution tradeoffs. BEVFormer's BEV grid is 200×200 cells covering [-51.2, 51.2] m × [-51.2, 51.2] m, i.e. 0.512 m / cell. Halving the cell size to 0.256 m doubles each side of the grid and 4× the BEV-attention compute, for marginal gains on the small-object metrics. Doubling the range to ±100 m for highway scenarios is a more impactful knob than halving the resolution. Most teams default to the nuScenes-tuned grid because nuScenes' own labels stop at ±50 m; if you're working with Waymo or Argoverse 2 (which extend to ±150 m) the BEV grid must be re-tuned.
CAN-bus data is mandatory for BEVFormer. The temporal self-attention takes ego-motion as input. nuScenes ships ego motion in a separate nuScenes-can_bus.zip archive; if you skip it, BEVFormer's data pipeline crashes with a FileNotFoundError on can_bus_info_path. The data-prep --canbus flag points to the unzipped CAN-bus root.
numpy>=1.20 deprecation breakage. mmdet 2.14 still uses np.float/np.int aliases that numpy 1.20 deprecated and 1.24 removed. Pin numpy==1.19.5. If you see AttributeError: module 'numpy' has no attribute 'float' you've drifted off the pin.
setuptools>=60 breaks python setup.py install. PEP 660 removed the legacy path mmdet3d 0.17 uses. Pin setuptools==59.5.0 before installing mmdet3d, not after.