Project 13 — Argoverse 2 motion forecasting

What this project is. A focused, workstation-scale exploration of the behavior-prediction half of an AV stack. Where projects 10 (BEVFormer 3D detection) and 13 (Bench2Drive closed-loop) live on the perception and planning ends respectively, this project plants you in the middle: given the past 5 seconds of an actor's motion plus an HD map, predict the next 6 seconds as a multi-modal distribution of K=6 trajectories with confidences. You will load the Argoverse 2 Motion Forecasting dataset via av2-api, run a published QCNet checkpoint on the validation split, and report minADE_6 / minFDE_6 / MR_6 / brier-FDE_6 numbers next to the published baseline. Then you will slice the validation set by interaction type — straight-through, intersection turns, lane changes, merges, and "cut-in / yielding" two-actor cases — and produce the per-slice metrics breakdown that is the actual pedagogical centerpiece.

Why this exists. Of the 14 projects already in this learning roadmap, not one touches motion prediction. That is a gap: behavior prediction is half of AV planning (docs/06 §C.1) and the canonical leaderboard most planning interviewers will assume you have shipped against. It is also the modality where the labeling story is fundamentally different from camera or lidar: behavior labels come "from the future" — you replay logs to extract the ground-truth future motion (docs/04 §D.2 on temporal labels, "the future is the label"). After this project you can answer the question "if a Data Intelligence customer asks for a curated set of cut-in scenarios with high model uncertainty, what exactly do you need to mine and how?" with a working pipeline rather than a hand-wave.

Goal

Run a published trajectory predictor (QCNet — CVPR 2023, same author as HiVT, AV2 checkpoints publicly released July 2023) on the AV2 motion-forecasting validation split, and report:

Aggregate metrics — minADE_6, minFDE_6, MR_6, brier-FDE_6 — next to the QCNet paper's published numbers. The expected ballpark on AV2 val is minADE_6 ≈ 0.72, minFDE_6 ≈ 1.25, MR_6 ≈ 0.16 (QCNet repo). Your numbers should match within rounding when running the official checkpoint on the same split. If they don't, something is wrong with your coordinate frame or your scoring loop, and the README's "common pitfalls" section is your first stop.
Per-interaction-type breakdown — partition val scenarios by a geometric interaction tag (we derive tags from the focal track's trajectory + the local map, since AV2 does not ship explicit interaction labels) and re-compute the four metrics per slice. Concretely: straight_through, intersection_turn, lane_change, merge, cut_in_yield (two-actor interaction inferred from relative-motion geometry). Expect minFDE_6 to be 1.5–3× worse on lane_change and cut_in_yield than on straight_through. Quantifying that gap is the deliverable.
Reflection — one page on (a) why the open-loop ADE/FDE numbers are sharper than the perception-side counterparts but also more misleading in a closed-loop sense, connecting back to project 15; (b) what a "behavior-label data engine" looks like when the labels are auto-extracted from log futures; (c) how the per-slice gap you measured implies which scenarios a curation team should oversample.

The outcome artifact is outputs/metrics/aggregate.json (the four official numbers) plus outputs/metrics/per_slice.csv (the slice breakdown) plus a single outputs/figures/topk_overlay.png showing top-K predictions on the HD map for one well-predicted and one badly-predicted scenario from each slice. The learning artifact is the reflection paragraph at the bottom of the notebook.

Loops touched

This project sits primarily in the EVAL loop (docs/04) — you take a frozen model and characterize its error structure on a held-out distribution. But the per-interaction-type slicing pulls in LABEL & TRAIN thinking too, because the whole point of the slice analysis is to argue for which scenarios the next labeling sweep should oversample. Crucially, the labels themselves — 6-second future trajectories — are auto-extracted from log futures in the AV2 pipeline. There is no human in the labeling loop for the future position of a tracked vehicle. This is what docs/04 §D.2 means by "the future is the label": replay the log past time T and the actor's position at T+δ is the supervision signal, no annotator required. The cost of a behavior label is the cost of a tracker plus a disk read, which is why behavior datasets can be 100–1000× larger than hand-annotated 3D detection datasets at comparable quality. (The expensive part of behavior labeling is intent and causality — "actor A yielded to actor B" — which is exactly the part that humans and VLMs still cover in 2025-26.)

Why this matters for AI Data Intelligence

Three reasons, in increasing order of how often they show up in interviews:

1. Motion forecasting is half of AV planning. Every learned planner consumes a multi-modal future-trajectory distribution either explicitly (as a separate module — UniAD, VAD, PARA-Drive) or implicitly (an end-to-end transformer that emits both predictions and plans — VADv2, EMMA, DriveTransformer). If you can't talk fluently about minADE / minFDE / MR / brier-FDE and what trades them off, you can't talk fluently about planner training data. The Applied Intuition Data Intelligence team curates exactly this kind of data for OEM customers — they need engineers who understand the metric they are optimizing for at the dataset level.

2. Behavior labels are the AV equivalent of next-token prediction. Both are self-supervised by replay: in language, you mask the next token and predict it from the past; in driving, you mask the next 6 seconds and predict them from the past. The implication is huge: behavior-prediction training data scales like crawled internet scales — with logs, not with annotators. This is precisely why MultiPath++, Wayformer, and the Waymo Open Motion Dataset paper all explicitly draw the analogy to language-model pretraining. Auto-labeling for behavior is essentially solved; the remaining question is curation — which 6-second windows are worth training on, and which are near-duplicates of yesterday's training set. That curation question is exactly the one Applied Intuition's data engine sells to OEMs.

3. The Data Intelligence challenges live on the slice axis, not the aggregate. A trajectory predictor with a great aggregate minFDE_6 can still be 3× worse on lane-change scenarios than on straight-through ones. Your customer (an OEM) cares about the lane-change number because that is where their planner crashes. So the deliverable that matters in production is not "we got 1.25 minFDE_6"; it is "we got 1.25 minFDE_6, but here is the slice table, here is the slice the customer should fund the next labeling sweep on, and here is the projected lift from that sweep." This project produces a miniature version of exactly that artifact.

The framing connects directly to docs/06 §C.1 ("AV leaderboards"), which lists Argoverse Motion Forecasting alongside nuScenes detection and Waymo Open as the three canonical benchmarks every AV-stack engineer is expected to have a working opinion on.

Connection to project 15 (closed-loop)

Closed-loop evaluation in project 15 implicitly depends on behavior prediction: the simulator's other actors must move plausibly, and the planner's own internal model of other actors must be accurate enough that its plans don't crash. Two concrete handoffs:

A motion forecaster trained here can serve as a sim-agent in project 15's CARLA / Bench2Drive setup — replacing the rule-based Traffic Manager with learned, log-conditioned behavior. (This is exactly what Waymax does with the Waymo Open Motion Dataset, and what project 14 will explore.)
The same forecaster can serve as a planner module — its top-1 trajectory becomes a candidate ego plan, scored against safety constraints. UniAD does this; you can do it too, in 100 lines.

The intellectual payoff is that you start to see why the open-loop ADE/FDE leaderboards are sharper but also more misleading than the closed-loop driving-score leaderboards: they measure prediction quality on the expert's state distribution, not the planner's own. This is the same covariate-shift argument from project 15's README, applied one level up.

Prerequisites

Project 07 (CARLA scenarios) understood — you should know what an OpenSCENARIO clip looks like and how a focal actor differs from background traffic.
Project 04 (BEVFormer) finished — you should be comfortable loading large multi-sensor AV datasets and writing a PyTorch eval loop that matches a published baseline.
Project 15 (Bench2Drive) read — at minimum the Why this matters and the open-loop-vs-closed-loop sections. Idealy run.
Project 16 (active learning) read — the slice / curation framing in this project is its mirror image on the prediction side.
Comfort with pytorch_lightning (QCNet's training framework) and torch_geometric (its GNN backbone). If you have not used either, budget 2–3 hours of orientation before starting.

Hardware

GPU: RTX 4070 / 4080 / 4090 / A6000 / 3090 — 12 GB VRAM minimum for inference at the published batch size; 24 GB if you want to fine-tune on a subset.
Disk: ~70 GB free for the AV2 motion-forecasting train+val+test archive (~58 GB downloaded + ~12 GB unpacked working copy). You can get away with val-only (~6 GB) if you skip the from-scratch stretch.
RAM: 32 GB recommended; 16 GB workable if you stream scenarios from disk rather than caching pre-processed graphs in memory.
CPU: 8+ cores helps the dataloader; AV2 scenario parsing is CPU-bound.

Setup

bash setup.sh        # creates .venv, installs av2-api, torch, torch-geometric, QCNet deps
source .venv/bin/activate
# Optional: download just the val split (~6 GB) for fast iteration
bash setup.sh --val-only
# Or full train + val + test (~58 GB) for the from-scratch stretch
bash setup.sh --full

The download uses s5cmd against the public S3 bucket s3://argoverse/datasets/av2/motion-forecasting/, which the Argoverse user guide recommends as the canonical fast path (saturates a 40 Gbps link; ~10 minutes for val on a fast home connection, 1–2 hours for the full 58 GB).

Steps

Hardware sanity check — print CUDA device, VRAM, free disk. Fail loudly if you don't have at least 12 GB VRAM and 70 GB disk.
Install av2-api & verify — pip install av2, then load one scenario from the val split and assert its structure (11 seconds at 10 Hz, focal track marked, map present).
Visualize a scenario — render the focal track's past+future plus all SCORED tracks plus the local lane graph as a static PNG. Sanity-check: ego-frame vs world-frame conventions match what the model expects. (See pitfall #1.)
Frame the prediction problem — markdown cell walking through 5 seconds of past → 6 seconds of future, multi-modal output (K=6 trajectories with softmax-normalized confidences), and the distinction between single-agent (FOCAL_TRACK only) and multi-agent (all SCORED_TRACK) prediction. We do single-agent.
Load the QCNet AV2 checkpoint — clone the QCNet repo, download the released checkpoint, instantiate the model, and verify that the parameter count matches the published one (~7.7 M).
Run inference on val — single-pass over the entire val set (~25 K scenarios). Cache predictions to disk so you can re-score without re-running the model. Expect 30–90 minutes on an RTX 4070.
Compute aggregate metrics — minADE_6, minFDE_6, MR_6 (miss threshold = 2 m at 6 s), brier-FDE_6. Compare against the QCNet paper's published numbers; flag anything off by more than 5%.
Derive interaction-type tags — for each val scenario, classify the focal track into one of {straight_through, intersection_turn, lane_change, merge, cut_in_yield} using a small set of geometric rules: heading change over the 6 s future, lateral displacement relative to the local lane centerline, and (for cut_in_yield) the existence of a nearby SCORED actor whose trajectory crosses the focal track's predicted path within 2 s.
Compute per-slice metrics — re-score the cached predictions per slice. Output the table in outputs/metrics/per_slice.csv.
Visualize top-K overlays — for one good and one bad scenario per slice, render the top-K predictions on the HD map next to the ground-truth future. The good/bad cut is by per-scenario brier-FDE rank within the slice. (5th vs 95th percentile.)
Reflection — markdown cell answering the three questions framed in the Goal section.
Stretch (optional, 4–8 h) — train a tiny MLP-only baseline from scratch on a 10K-scenario subset. Output its aggregate numbers next to QCNet's. The point is not to beat QCNet — the point is to internalize how brutal the gap is between a map-unaware MLP and a query-centric Transformer.

Done criterion

You are done when:

outputs/metrics/aggregate.json contains four numbers within 5% of the QCNet-paper reference (minADE_6 ≈ 0.72, minFDE_6 ≈ 1.25, MR_6 ≈ 0.16, brier-FDE_6 ≈ 1.85).
outputs/metrics/per_slice.csv shows minFDE_6 broken down across at least 5 interaction-type slices, with the per-slice scenario count and the absolute and relative gap to the aggregate number.
outputs/figures/topk_overlay.png shows top-6 predictions on the HD map for at least 10 hand-picked scenarios (one good, one bad per slice).
The reflection markdown at the bottom of notebook.py is filled in with your own words, naming at least one slice that you would oversample if you were the Data Intelligence PM, and why.

Common pitfalls

Ego-frame vs world-frame conventions. AV2 stores trajectories in world frame (UTM-like local coordinates). QCNet expects each actor's history to be re-centered and re-rotated to that actor's agent frame at t=0 (the last observed timestep). If you skip this, your minFDE_6 will be 50–100 m wrong. The official av2.datasets.motion_forecasting parser plus QCNet's preprocessor handle this; if you write your own loader, match it exactly.
Multi-modal output handling. A predictor outputs K=6 trajectories and K confidences. minADE picks the trajectory with minimum endpoint error against ground truth (the "oracle" mode); brier-FDE penalizes by (1 - p_best)^2 to reward calibrated confidences. Common bug: averaging across K modes instead of taking the min, which gives meaningless inflated numbers.
Argoverse 1 vs Argoverse 2 schema confusion. AV1 used CSV format with 5-second total clips (2 s observed + 3 s future); AV2 uses Parquet with 11-second clips (5 s observed + 6 s future). Several public repos (including older HiVT branches) target AV1 only. Verify the dataset version in the path before you start debugging.
Scenario-format gotchas. AV2 scenarios contain four track categories: FOCAL_TRACK, SCORED_TRACK, UNSCORED_TRACK, TRACK_FRAGMENT. The leaderboard scores only the FOCAL_TRACK (single-agent) or all SCORED_TRACKs (multi-agent). Don't accidentally include UNSCORED_TRACK in your prediction loss / eval — your numbers will look better than they should.
Training-data subsampling. If you take the from-scratch stretch, do not uniformly subsample the train set to 10K — the resulting model will severely underperform on the rare interaction slices because they are already rare in the full set. Either stratify by interaction type or accept that your subsampled baseline will be worse on lane-change than QCNet by more than the train-set-size ratio implies.
Map-coordinate confusion. AV2 maps are stored in city frame (UTM-aligned) but QCNet ingests lane segments in the focal-track agent frame. Two coordinate transforms in a row are easy to get wrong; debug by overlaying lanes on a known scenario before running any model.
Mode collapse on lane changes. Many predictors (including QCNet at lower training budgets) emit 6 nearly-identical modes on lane-change scenarios because the expected lane change has ~30% prior probability and the model regresses to the mean. The per-slice analysis will surface this; do not treat it as a bug in your eval.