Project 04 — BEVFormer 3D detection on nuScenes
Goal
Run the BEVFormer camera-only 3D object detector on nuScenes, evaluate it on the official 3D-detection metrics (NDS, mAP, plus the five true-positive errors ATE / ASE / AOE / AVE / AAE), and inspect failure modes by operational-design-domain (ODD) slice — day vs night, clear vs rain, easy vs occluded. Then write down, in your own words, why BEV (bird's-eye view) became the dominant representation for AV perception between 2022 and 2026, with the lift-splat-shoot → BEVFormer → BEVFusion lineage as scaffolding.
The deliverable is a notebook that loads pretrained weights, computes real numbers on nuScenes, and produces three artifacts: an NDS / mAP score table, a per-ODD-slice breakdown, and a calibrated paragraph about BEV's place in the perception stack. Training BEVFormer from scratch costs ~8 GPU-days on 8×A100; we explicitly default to inference
- optional light fine-tune so this project is feasible on a single workstation GPU.
The output is not "I ran a 3D detector." It is intuition: feel for how multi-camera surround attention lifts pixels into a metric BEV grid, why occlusion handling is easier in BEV than in the image plane, where the 3D failure modes live, and how 3D auto-labels actually get made at Tesla, Waymo, and Wayve.
Loops touched
This project sits in the LABEL & TRAIN loop of the four-loop data engine, with a strong reach into CURATE:
- LABEL & TRAIN. BEVFormer trains on 3D-cuboid labels. Running its eval shows what nuScenes' 7-metric eval actually rewards — i.e. what any auto-labeler downstream has to match. The ODD-slice analysis is the bridge: production teams don't ship a single NDS number; they ship per-slice numbers and gate on the worst slice.
- CURATE (touched). The per-class and per-ODD-slice breakdowns produce mineable failure sets (e.g. pedestrians >40 m at night) that feed Project 16's active-learning queues. Project 03 does this at 2D-mask level; Project 04 at 3D-cuboid level.
It does not touch COLLECT (somebody else's fleet) or closed-loop EVAL (Project 15). And it is not SAM-2 — SAM 2 is for 2D; BEV is the 3D analog: a unified spatial representation downstream tasks plug into. Project 03 was the 2D version of "everything talks to one representation"; Project 04 is the 3D version.
Why this matters for AI Data Intelligence
3D perception in BEV is the AV-native representation. A 2D image-plane box is useful for screenshots; a 3D box in metric BEV is what prediction, planning, and control actually consume. Every serious AV team in 2024–2026 ships a perception model whose head produces 3D outputs in BEV space. To work on Applied Intuition's Data Intelligence team you should be able to draw the BEV pipeline on a napkin and defend every box.
The lineage that matters:
- Lift-Splat-Shoot (LSS) — Philion & Fidler, ECCV 2020. Each camera pixel is "lifted" into a frustum of depths (a soft depth distribution), per-camera feature volumes are "splatted" onto a shared BEV grid via known calibration, and a small BEV CNN "shoots" the predictions. This is when multi-camera fusion stopped being heuristic late fusion and became a single differentiable network. Every modern BEV detector descends from LSS.
- BEVDet / BEVDepth / BEVFormer — 2021–2022. Three branches refine LSS. BEVDet (Huang 2021) productionizes LSS with stronger backbones. BEVDepth (Li, AAAI 2023) shows the lift is depth-bottlenecked and adds explicit LiDAR depth supervision. BEVFormer (Li, ECCV 2022) replaces per-pixel lift with attention queries: a fixed grid of BEV query tokens does spatial cross-attention (each query samples the cameras at the geometric points it projects to) and temporal self-attention (each query attends to the previous timestep's BEV features, ego- motion compensated). The temporal step is what handles occlusion — the network "remembers" what was visible a frame ago. This is what we run in the notebook.
- BEVFusion — 2022–2023. Two papers (mit-han-lab and ADLab-AutoDrive, both titled "BEVFusion") cement the next move: a unified BEV representation for cameras AND LiDAR. Camera features reach BEV via LSS; LiDAR features reach BEV via voxelization + PointPillars. The two maps are concatenated and decoded jointly. This is what makes auto-labeling actually work in production: cameras give class and lane semantics; LiDAR gives metric depth and free-space; BEV is the shared space where the two finally agree.
- Occupancy networks → world models — 2023–2026. Tesla's AI Day 2022 unveiled "occupancy networks" — a per-voxel occupancy/semantic predictor in BEV — essentially BEVFormer with a denser output head. Wayve's GAIA-2 (2025) and Tesla's world model (2024–2026) extend this generatively: predict the future occupancy grid, not just the current one.
The connection to multi-sensor fusion and auto-labeling: BEV is the only place in the stack where it is geometrically clean to fuse multiple cameras, LiDARs, and radar. Image-plane fusion is hopeless (each camera lives in its own 2D coords); BEV is metric, sensor- agnostic, and extensible to time. That's why BEV won. Every 3D auto-labeler in production today is BEV-native — including the ones at Applied Intuition's Roads / Aurora customers, who evaluate on something close to nuScenes' eval and gate releases on per-ODD-slice numbers. The notebook in this project is a laptop-scale facsimile of the eval cell every such team runs every night.
Prerequisites
- Project 01 (FiftyOne / nuScenes scenario mining) recommended. You'll reuse the nuScenes mini download and the intuition for per-class evaluation.
- Project 03 (SAM 2 auto-labeling) recommended for the 2D ↔ 3D parallel — same data engine pattern, different output dimension.
- Comfortable with PyTorch and 3D geometry (rotations, homogeneous transforms, projecting between sensor frames). §3 walks the math but assumes you've seen rotations before.
- Basic OpenMMLab familiarity helps but is not required.
- ~45 min setup; ~30 min per inference-only notebook run.
Hardware
| Tier | Path | Notes |
|---|---|---|
| RTX 4090 / A6000 / A100 (24 GB+) | Full inference + light fine-tune | BEVFormer-base, full v1.0-trainval val split, ~25 min eval. |
| RTX 4070 / 4080 (12–16 GB) | Inference only | Tiny on mini. ~15 min. |
| RTX 4060 / 3060 (8 GB) | Mini-only | Tiny + reduced batch. 81 val samples. Numbers not paper-comparable. |
| Mac (M-series) | §3 calibration only | MSDA CUDA op has no MPS/CPU fallback. §4+ needs a remote NVIDIA box. |
| CPU only | Not supported | Same reason. |
Working assumption: RTX 4070+ for inference. Training from scratch is out of scope (~8 GPU-days on 8×A100 for 51.7 NDS). The §11 cell shows a 2-epoch fine-tune for feel, not numbers.
Setup
cd projects/04-bevformer-3d-detection/
chmod +x setup.sh
./setup.sh # creates .venv (Python 3.8),
# installs the legacy OpenMMLab
# 1.x stack, clones BEVFormer,
# builds MSDA CUDA op, downloads
# tiny checkpoint
source .venv/bin/activateWhat setup.sh does (idempotent — safe to re-run):
- Python 3.8 sanity. mmcv-full 1.4.0 only ships wheels for 3.6–3.9; 3.8 is the cleanest. If you have only 3.10+, install 3.8 via pyenv.
- CUDA / nvcc check. BEVFormer's
MultiScaleDeformableAttentionhas no CPU fallback. Driver ≥450.80.02 + CUDA 11.1 toolkit is the tested combo. - Pin pip + setuptools.
setuptools<60is required because mmdet3d 0.17.1'spython setup.py installpredates PEP 660. - Install torch 1.9.1 + cu111 from the PyTorch wheel index. This must happen before mmcv-full so the OpenMMLab index can pick the right wheel.
- Install mmcv-full 1.4.0 from the OpenMMLab cu111/torch1.9.0 wheel index. The URL encodes torch and cuda — mismatching either triggers a doomed source build.
- Install mmdet 2.14.0 + mmsegmentation 0.14.1 from PyPI.
- Clone mmdetection3d v0.17.1,
pip install -e .(compiles ~1 min of C++ ops). - Install the rest of
requirements.txt— nuscenes-devkit, numpy 1.19, scipy 1.7, etc. - Clone fundamentalvision/BEVFormer, build its custom
MultiScaleDeformableAttentionCUDA op viapython setup.py build installfromprojects/mmdet3d_plugin/ops/. - Pre-create
data/nuscenes/,ckpts/,outputs/. - Download
bevformer_tiny_epoch_24.pth(~250 MB) from the BEVFormer release. - Final import sanity check — imports torch, mmcv, mmdet, mmdet3d, mmseg, nuscenes and prints versions.
You then need to bring your own data:
- Recommended for inference: download nuScenes mini (~ 4 GB)
from https://www.nuscenes.org/nuscenes#download. Unpack into
data/nuscenes/so you havedata/nuscenes/v1.0-mini/,data/nuscenes/samples/,data/nuscenes/sweeps/. The notebook auto-detects this and runs the inference + eval cells. - Required for paper-comparable numbers: v1.0-trainval (~350 GB). This is unavoidable for the 6019-sample val split that the paper reports on; mini's 81 samples have far too much variance to reproduce the headline 0.517 NDS.
- CAN-bus expansion: also needed by BEVFormer for ego-motion.
Drop
nuScenes-can_bus.zipnext to the unpacked dataset.
After data is in place, run BEVFormer's data-prep script to build the pickled info files (one-time, ~5 min on mini):
cd third_party/BEVFormer
python tools/create_data.py nuscenes \
--root-path ../../data/nuscenes \
--out-dir ../../data/nuscenes \
--extra-tag nuscenes \
--version v1.0-mini \
--canbus ../../data/nuscenesSteps
- Configure. Edit the
Configdataclass at the top of the notebook. Knobs:model_size(tiny/small/base),nuscenes_version(v1.0-mini / v1.0-trainval),eval_subset_size(cap eval samples for laptop runs). - Hardware sanity. Print torch + CUDA + GPU + VRAM. Bail loudly if no CUDA — the MSDA op has no fallback.
- Sensor calibration walkthrough. Load one nuScenes sample,
visualize the 6 surround cameras, the LiDAR sweep, and walk
through the ego frame ↔ sensor frame transforms. Show the
intrinsic matrix
Kfor one camera, the extrinsicT_lidar→egoandT_camera→ego, and project a 3D LiDAR point onto a camera image with the math written out. Foundational for everything that follows. - Load BEVFormer. Build the model from
projects/configs/bevformer/bevformer_tiny.py, load the pretrained checkpoint, push to GPU, and assert the MultiScaleDeformableAttention op imports cleanly. - Run inference on the val subset. Each sample produces a list of 3D boxes with class, score, and 9-DOF pose (xyz, lwh, yaw, vel_x, vel_y).
- Compute official nuScenes detection metrics. Use the
official
nuscenes.eval.detection.evaluate.DetectionEvalclass to produce NDS, mAP, ATE, ASE, AOE, AVE, AAE. Tabulate. - Per-class breakdown. Same metrics decomposed by the 10 nuScenes detection classes (car, truck, bus, trailer, construction_vehicle, pedestrian, motorcycle, bicycle, traffic_cone, barrier). The class imbalance is enormous — barriers and trailers are rare; cars and pedestrians dominate.
- Per-ODD-slice analysis. Split the val set on the
descriptionfield of each scene (day/night, clear/rain) and recompute mAP per slice. The night-rain slice is where every AV stack's worst numbers live. - Visualize. Render BEV ground truth (orange boxes) vs
predictions (blue boxes) on the BEV grid, AND project the
predicted 3D boxes onto the 6 camera images. Save to
outputs/figures/. - Reflection cell. Why BEV won — the LSS lineage in your own words. Three paragraphs, with the notebook providing prompts.
- (Optional) Light fine-tune. Two-epoch fine-tune on mini for the curious. Will not improve numbers — there's not enough data — but lets you feel the train loop.
- Compare to BEVFusion (TODO for the user). Pointer to the BEVFusion install path and a paragraph on what the camera-vs-camera+LiDAR delta looks like in numbers.
Done criterion
You're done when all of these are true:
- Notebook runs end-to-end with no errors.
- §6 prints NDS and mAP as actual numbers. On the v1.0-trainval val split with bevformer-base you should be within ~0.01 of the paper's 0.517 NDS / 0.416 mAP. On mini the numbers will be much noisier (≈0.30 NDS) — that is expected and the right finding: mini is for plumbing, not for benchmarking.
- §7 prints a per-class table with all 10 classes; you can explain why barrier and traffic_cone numbers are noisy (low support) and why pedestrian AOE is high (yaw is poorly defined for pedestrians).
- §8 prints day/night/rain slice numbers; you have written down which slice is worst and a hypothesis why.
- §9 produced both a BEV figure and per-camera reprojection figure; both saved to disk.
- You can answer, in 3 sentences each, on a whiteboard: (a) why BEV; (b) what spatial cross-attention does in BEVFormer; (c) what BEVFusion adds beyond BEVFormer.
Common pitfalls
- mmcv-mmdet-mmdet3d version mismatch. The biggest source of
pain. The pinned combo is
mmcv-full 1.4.0+mmdet 2.14.0+mmsegmentation 0.14.1+mmdet3d 0.17.1. Do not try to modernize to OpenMMLab 2.x for BEVFormer — the registry rename and the newmmenginerunner are not backwards compatible, and BEVFormer's plugin code imports private 1.x APIs that are simply gone in 2.x. The clean modernization path is to switch detectors entirely (use BEVFusion via mmdet3d 1.4 instead); keep BEVFormer on the 1.x stack. - Torch + CUDA + mmcv wheel URL alignment. The OpenMMLab wheel
URL encodes both torch and cuda:
https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html. If you install torch 1.10 and try to fetch mmcv-full 1.4.0 from the cu111/torch1.9.0 index, pip will silently fall back to a source build, which then takes ~25 minutes and fails on glibc.setup.shdoes this for you; if you're poking by hand, double- check the URL. MultiScaleDeformableAttentionbuild failure. The custom CUDA op underBEVFormer/projects/mmdet3d_plugin/ops/builds withpython setup.py build install. On nvcc-version mismatches the build emits ~50 lines of warnings then errors onnvcc fatal: unsupported gpu architecture. Fix:export TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6"(Turing+Ampere) or the specific compute capability of your GPU (RTX 4090 is 8.9, A100 is 8.0, V100 is 7.0).- Sensor extrinsics — the
T_sensor→egodirection trap. nuScenes storescalibrated_sensorrecords astranslation+rotation(quaternion). These areT_sensor→ego— the transform that maps a point in the sensor frame to ego frame. To go the other way (project an ego-frame box into the camera) you invert. Half of all BEVFormer-related Stack Overflow questions are people who got this direction wrong and got either mirrored detections or detections way out in space. The §3 calibration walkthrough writes the transforms out explicitly to immunize you. - BEV grid resolution tradeoffs. BEVFormer's BEV grid is 200×200 cells covering [-51.2, 51.2] m × [-51.2, 51.2] m, i.e. 0.512 m / cell. Halving the cell size to 0.256 m doubles each side of the grid and 4× the BEV-attention compute, for marginal gains on the small-object metrics. Doubling the range to ±100 m for highway scenarios is a more impactful knob than halving the resolution. Most teams default to the nuScenes-tuned grid because nuScenes' own labels stop at ±50 m; if you're working with Waymo or Argoverse 2 (which extend to ±150 m) the BEV grid must be re-tuned.
- CAN-bus data is mandatory for BEVFormer. The temporal
self-attention takes ego-motion as input. nuScenes ships ego
motion in a separate
nuScenes-can_bus.ziparchive; if you skip it, BEVFormer's data pipeline crashes with aFileNotFoundErroroncan_bus_info_path. The data-prep--canbusflag points to the unzipped CAN-bus root. numpy>=1.20deprecation breakage. mmdet 2.14 still usesnp.float/np.intaliases that numpy 1.20 deprecated and 1.24 removed. Pinnumpy==1.19.5. If you seeAttributeError: module 'numpy' has no attribute 'float'you've drifted off the pin.setuptools>=60breakspython setup.py install. PEP 660 removed the legacy path mmdet3d 0.17 uses. Pinsetuptools==59.5.0before installing mmdet3d, not after.
Further reading
The pipeline papers (in lineage order):
- Philion & Fidler, Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D, ECCV 2020. https://arxiv.org/abs/2008.05711
- Huang et al., BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View, 2021. https://arxiv.org/abs/2112.11790
- Li et al., BEVDepth: Acquisition of Reliable Depth for Multi-View 3D Object Detection, AAAI 2023. https://arxiv.org/abs/2206.10092
- Li et al., BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, ECCV 2022. https://arxiv.org/abs/2203.17270
- Yang et al., BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision, CVPR 2023. https://arxiv.org/abs/2211.10439
Multi-sensor fusion in BEV:
- Liu et al., BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation, ICRA 2023 (mit-han-lab). https://arxiv.org/abs/2205.13542
- Liang et al., BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework, NeurIPS 2022 (ADLab-AutoDrive). Different paper, same name — read both. https://arxiv.org/abs/2205.13790
- Liu et al., PETR / PETRv2, ICCV/2022, ECCV 2022. The 3D-positional- embedding alternative to BEVFormer's geometric-projection sampling.
Production-relevant context:
- Tesla AI Day 2022 — the occupancy-network walk through is the natural sequel to BEVFormer. (YouTube; the relevant segment starts ~50:00 into the 2022 stream.)
- Caesar et al., nuScenes: A Multimodal Dataset for Autonomous Driving, CVPR 2020. The dataset paper — read sections 4 (sensor setup) and 6 (eval metrics) carefully. https://arxiv.org/abs/1903.11027
- nuScenes detection-eval reference implementation (the source of truth for NDS): https://github.com/nutonomy/nuscenes-devkit/tree/master/python-sdk/nuscenes/eval/detection
- mmdetection3d's
projects/BEVFusion/— the cleanest modern reference implementation if you want to stay on OpenMMLab 2.x: https://github.com/open-mmlab/mmdetection3d/tree/main/projects/BEVFusion
For the strategy memo (Project 18):
- Wayve's GAIA-2 (2025) and Tesla's world-model talks are legible through the BEV-as-substrate lens — predicting future BEV occupancy is the natural extension of predicting current BEV.
- Yang et al., Auto4D (Waymo, 2021) and Qi et al., Offboard 3D Object Detection from Point Cloud Sequences (Waymo, 2021) — both 3D-auto-labeler papers, both BEV-native.
Files in this project
- README.md
- notebook.py
- requirements.txt
- setup.sh
Notebook (notebook.py) is in jupytext percent format — open in VS Code or convert with jupytext --to notebook.