17 — Privacy & Provenance: a redaction + lineage pipeline for AV logs

A focused project that fills two related Tier-2 gaps in this portfolio: PII redaction (face & license-plate blurring on camera frames) and data provenance / lineage (an OpenLABEL-extended audit manifest that tracks every artifact from raw log to training sample). Both concerns share a single operational theme — treating data as audit-able — and they compose into one production-hygiene layer that sits on top of the MCAP plumbing from project 02.

This is the layer that turns a research notebook into a pipeline a real fleet operator can stand behind in a regulator audit, an incident review, or an EU AI Act compliance interview.

Goal

By the end of this project you will be able to:

Detect faces and license plates on a corpus of camera frames using open, production-grade detectors (Meta EgoBlur Gen2, with a RetinaFace
- open-source YOLOv8-plate fallback path).
Blur the detected regions in a way that is reversible under key escrow — the redacted frame is what data scientists see; an investigator with the escrow key can recover the original pixels for a specific incident, on demand, with an audit trail.
Wrap every artifact (raw frame, redacted frame, label JSON, derived training sample) in an OpenLABEL-extended provenance manifest that tracks its source, the chain of transformations applied to it, the actor and timestamp for each transformation, the trust level (raw / redacted / synthetic / world-model-generated), the license, the expiration, and the contact-for-dispute.
Walk lineage backwards — given any training sample, list every transformation back to its source raw log, and prove a clip is or is not in the training set.
Compute precision / recall on a labeled subset for both the face detector and the plate detector, so you can defend the numbers under adversarial questioning.
Emit the manifest as a sidecar to MCAP files from project 02 — the provenance layer composes with, rather than replaces, the log layer.

The single non-obvious deliverable is the reversible-under-escrow path. Most blurring you see in the wild is irreversible (Gaussian blur, pixelate, solid box). That's bad for incident investigation: when a regulator asks "who was in the crosswalk at the moment of the disengagement?", an irreversibly-redacted log cannot answer. We make the trade-off explicit by encrypting the original pixels of each blur region under an escrowed key before the visible redaction, so the data scientist's view is privacy-safe and a legitimate investigator can recover the original under the right ceremony.

Loops touched

This project explicitly touches three of the five canonical data-engine loops, because provenance is intrinsically cross-cutting:

COLLECT. Every frame is wrapped in a manifest recording where it came from (source MCAP, channel, log time, vehicle id, firmware) and what was done to it on ingest. Redaction is itself a recorded transformation: actor, timestamp, model version, checksum, parameters. Without that record, redaction is a black hole — you have a redacted frame and no way to prove what it used to be.
CURATE. Curation has to ask "is any of this clip PII-clean?" and "is any of this from a contributor who has exercised GDPR Article 17?" Both are graph queries over the manifest. Project 02's tag-mining CLI is the entry point; this manifest is what makes privacy-aware filtering possible at all.
LABEL & TRAIN. Training-set hygiene requires every sample carry its lineage. When the customer asks "can you prove this training set was clean and traceable?" — they will, in the first week — the only defensible answer is a query against this manifest.

SIM and EVAL are downstream consumers: when project 09 emits a synthetic frame from Cosmos Transfer, the manifest schema below is what flags it trust_level: WM-generated rather than trust_level: raw. Auditors increasingly care about the difference.

Why this matters for AI Data Intelligence

Three reasons, in order of how often they will come up in an interview.

1. Every commercial AV log goes through PII redaction before training. Not as an afterthought — as a blocking ingest step. Tesla, Waymo, Cruise, Mobileye, Wayve, Zoox, and Aurora all run something EgoBlur-shaped at the edge of their data lake. If your pipeline cannot redact, your pipeline cannot ingest. This is plumbing the way water mains are plumbing.

2. ISO 21448 SOTIF and the EU AI Act both require data provenance, and the August 2026 enforcement deadline is the load-bearing date. SOTIF calls for traceability of the training data behind any safety-critical perception function — not the model weights, the data. The EU AI Act's high-risk-system requirements become fully applicable on August 2, 2026, with an explicit obligation (Article 10) to document training-data sources, labeling procedures, cleaning methods, and augmentation. A manifest like the one below is the minimum artifact that answers those obligations.

3. This is the difference between a research notebook and a production pipeline, and customers will ask in the first week. Applied Intuition's customers are not buying "can you mine scenarios?" — they are buying "can I, the OEM legal officer, explain to a regulator how a clip got into the training set?" That's a provenance question. The load-bearing question for any 2026–2027 V&V pitch reduces to: "can you prove this training set was clean and traceable?" This manifest is what "yes" looks like.

A fourth, smaller reason: there is no unified open standard for AV-data provenance as of mid-2026. ASAM OpenLABEL covers labels, W3C PROV covers generic provenance, and commercial data platforms are the de facto answer. Knowing where the gap is — and having an opinion about how to fill it — is differentiating.

Prerequisites

Project 02 (MCAP & ROS 2 plumbing). You should be comfortable opening an MCAP file, iterating channels, and reading metadata. The provenance manifest in this project is designed to sit as a sidecar to the MCAP files project 02 produces.
Project 03 (SAM 2 auto-labeling). You don't need the SAM 2 stack itself, but the foundation-model-as-labeler intuition — small models doing high-volume work where humans only check the edge cases — is exactly the shape of the redaction loop here.
Python 3.10+, comfort reading PyTorch detection model output, basic cryptography mental model (symmetric AES-GCM, key separation).

Hardware

Laptop with GPU strongly preferred. EgoBlur Gen2 face and plate detectors are FasterRCNN-shaped, ~104M parameters each, and run at comfortable interactive rates on a single recent laptop GPU (RTX 4070 / M3 Pro and up). On CPU, expect ~1–3 fps per detector, which is fine for a toy corpus but not for the 1000-frame benchmark.
~3 GB free disk for detector checkpoints and the demo corpus.
No GPU? Run with MAX_FRAMES = 60 and MODEL_SIZE = "small" in the notebook; everything still completes end-to-end, the precision/recall numbers just shift slightly because the detectors aren't quantized.

Setup

cd projects/06-privacy-and-provenance
bash setup.sh
source .venv/bin/activate

setup.sh creates a venv, installs pinned deps, and downloads the EgoBlur checkpoints (face + plate, ~800 MB total). The download URLs are in the script; you can pre-cache them under models/ to avoid the network.

Verified-working components (May 2026):

Component	Pinned version	Purpose
`egoblur`	2.0.0a2	Meta's face + plate detector (Apache 2.0)
`ultralytics`	8.3.x	Optional YOLOv8-plate fallback
`torch`	2.4.x	EgoBlur is FasterRCNN-via-TorchScript
`opencv-python`	4.9+	Frame I/O, blur kernel, visualization
`cryptography`	43.x	AES-GCM for reversible redaction
`jsonschema`	4.23+	OpenLABEL-extension validation
`mcap`	1.3.1	Sidecar manifest emission

The detectors are bundled by Meta as TorchScript files (.jit / .pt) you can load with torch.jit.load. If the egoblur PyPI package install fails (it's an alpha at time of writing), the notebook has a pip install git+https://github.com/facebookresearch/EgoBlur.git fallback, and ultimately a minimal local wrapper that just calls torch.jit.load on the published checkpoints — that path is in section 3 of the notebook.

Steps

Run setup.sh. Activate the venv. Confirm python -c "import torch, cv2, cryptography, jsonschema; print('ok')" prints ok.
Download a small camera-frame corpus. The notebook supports two sources: nuScenes-mini front-camera frames (preferred — has real PII) or a synthetic corpus shipped in data/synthetic_frames/ (fallback for air-gapped runs; faces are stock-photo CC0). ~200 frames is enough for the precision/recall step.
Load the EgoBlur face detector. Section 3 of the notebook. Run on the corpus, visualize before/after for 5 frames, save to outputs/face_demo/.
Load the EgoBlur plate detector. Section 4. Same shape: detect, visualize, save.
Implement reversible redaction with key escrow. Section 5. For each detected box, encrypt the original pixel block under an AES-GCM key derived from a master escrow key + per-frame salt. Store (box_id, ciphertext, nonce, key_id) in a sidecar .escrow.json. Apply visible Gaussian blur to the redacted frame. Demonstrate the recovery path: investigator-with-key script that reads escrow + redacted frame and restores the originals.
Compute precision and recall. Section 6. Hand-label or use the provided data/labeled_frames.json with ground-truth boxes. IoU >= 0.5, per-class P/R, plus a confusion table for the over-blur / under-blur failure modes. Save the numbers to outputs/metrics.json.
Define the OpenLABEL-extension provenance manifest schema. Section 7. The schema lives in schema/openlabel_provenance_v0.json and validates against jsonschema. Top-level keys mirror OpenLABEL's metadata block and add a provenance extension under a vendor-prefixed namespace (x-provenance per the OpenLABEL extension convention).
Apply the manifest to one clip end-to-end. Section 8. Take a single 30-second clip, generate manifests for: (a) raw frames, (b) redacted frames, (c) detector boxes (as labels), (d) the derived training shard. Emit them as MCAP Metadata records and as standalone JSON sidecars.
Demonstrate lineage queries. Section 9. Build a small networkx graph over the manifests in outputs/. Pick one training sample. Walk the graph back to its source MCAP. Print every transformation along the way (actor, timestamp, model checksum, parameters). This is the artifact that answers "can you prove this training set was clean?"
GDPR Article 17 demo. Section 10. Simulate a right-to-erasure request: given a subject_id, list every artifact in the corpus that contains pixels traceable to that subject, and produce the deletion plan. The provenance graph makes this O(query) instead of O(corpus).
Reflect on scaling. Section 11. At fleet scale, redaction + provenance are batch-distributed jobs running on top of project 02's MCAP plumbing. The manifest is queryable as a graph (Iceberg + a relations table, or Neo4j for the 99th-percentile query). SOTIF audits become tractable because the graph is the audit.
(User TODO.) Extend the manifest to track world-model-generated synthetic data — i.e., Cosmos Transfer outputs from project 09. Add a synthesis block that records world_model: cosmos-transfer-2.5, prompt_hash, seed, control_inputs[] (the CARLA log + label refs that conditioned the generation), and parent_clip_id. This is increasingly the audit headache for 2026–2027 because synthetic data has no obvious owner and no GDPR subject, but still needs lineage for reproducibility and dataset-version pinning.

Done criterion

You can answer all six of the following questions, end-to-end, with artifacts in this folder:

"What is the precision and recall of your face redaction on the labeled subset?" — Number, with confusion table, in outputs/metrics.json.
"What is the precision and recall of your plate redaction?" — Same.
"Show me a redacted frame and demonstrate the original is recoverable under key escrow." — outputs/escrow_demo/before.png, outputs/escrow_demo/redacted.png, recovery script that closes the loop.
"Show me the OpenLABEL provenance manifest for one clip." — outputs/clip_0001_manifest.json, validates against the schema.
"Given this training sample, walk back to its source raw log." — python lineage_query.py --sample <id> prints the chain.
"How would you handle a GDPR Article 17 request?" — python erasure_plan.py --subject <id> prints the deletion plan.

If those six artifacts exist and the numbers are non-trivial, the project has done its job.

Common pitfalls

Over-blurring (false positives on hands, signs, billboards). EgoBlur was trained on egocentric Project Aria data; on automotive forward-facing camera frames it occasionally fires on hand reflections in the instrument cluster, on photos of faces inside billboards, and on anthropomorphic crash-test-dummy stickers. A small post-filter on box aspect ratio and absolute pixel size catches most of these. Document the failure mode in your precision number — don't suppress it.
Under-blurring (occluded faces, reflections, distant pedestrians). The harder failure mode. Faces seen through a windshield with strong glare, in low-light parking-garage scenes, or below ~32 px on the long axis are routinely missed. The mitigation is not lowering the detection threshold (that explodes false positives); the mitigation is a second-pass detector on hard mining. Track recall separately by bounding-box size bucket.
Reversibility key management. It is tempting to store the escrow key in the same JSON file as the ciphertext. Don't. The point of key escrow is separation of duties: the data scientist has access to the redacted frames and the ciphertext; only the legal/compliance role has the master key. In production, the master key lives in an HSM or a cloud KMS; in this project we simulate that by writing keys to ~/.egoblur_keys/ with mode 0600 and the ciphertext to data/. Audit yourself: if your data-scientist user can cat the master key, your threat model is broken.
OpenLABEL extension validation. OpenLABEL's metadata block has defined fields (schema_version, name, annotator, etc.) and an open extension mechanism (vendor-prefixed keys). It is easy to drop a provenance field under metadata and have it pass one validator but fail another that enforces strict key sets. The fix: namespace the extension under x-provenance (or the vendor prefix you'd use in production), mirror the convention in the schema, and validate every manifest with jsonschema before writing it. Don't trust "it loaded fine."
GDPR vs CCPA vs jurisdictional nuance. GDPR Article 17 ("right to be forgotten") gives EU data subjects the right to demand deletion. CCPA gives a similar but materially different right to California residents. Both treat biometric data (a face image is biometric) differently from generic PII, and both have carve-outs for "scientific research" that AV training arguably qualifies for — but the carve-out is jurisdiction-specific and time-bounded. Don't pretend one redaction policy satisfies both. The manifest's jurisdiction field is the hook for routing; document the policy explicitly.
Dropping provenance when re-encoding. A common production bug: an ingest job decompresses an MCAP, re-shards it for training, and forgets to forward the Metadata records. The new shard has no lineage. Forward provenance aggressively — every transformation step appends to the manifest, never replaces it. The manifest is append-only, like a git log; lossy operations on it are bugs.

Interview prep — questions this project should let me answer

"Walk me through your PII redaction pipeline." — EgoBlur Gen2 face + plate detectors run as the first step on ingest; reversible-under-escrow visible blur with AES-GCM-encrypted originals; P/R tracked per bounding-box-size bucket so the recall floor on small faces is explicit, not hidden.
"GDPR Article 17 request — walk me through it." — Subject ID maps to artifact IDs via the manifest; lineage query expands to derived shards and model training-data refs; deletion plan is the union, minus jurisdiction-registered research carve-outs.
"Why reversible redaction?" — Without it, post-incident investigation is blind. With it, data scientists see redacted frames; investigators recover originals under a logged ceremony. Escrow is an attack surface, but for safety-critical perception the alternative is operating in the dark.
"OpenLABEL vs your manifest?" — OpenLABEL is what is in the data; provenance is who touched what when. Orthogonal; we extend OpenLABEL's metadata block via the vendor-prefix convention so labels and lineage live in the same parser pass.

Three sentences to be able to say cold:

"Redaction without provenance is a black hole; provenance without redaction is a leak. Same hygiene problem, two ends."
"The load-bearing artifact in any 2026 V&V pitch is not the model card, it's the data manifest. Models change quarterly; the manifest outlives them."
"OpenLABEL solves what is in the data; this manifest solves where the data came from and what was done to it. You need both."