Project 08 — Real-to-sim with 3D Gaussian Splatting (gsplat on a nuScenes drive)

TL;DR. Take a multi-camera nuScenes scene, fit a 3D Gaussian Splat with gsplat, render from off-trajectory novel viewpoints, and measure PSNR / SSIM / LPIPS on held-out frames. Then do the wow demo: zero out the Gaussians inside a moving car's bounding box and re-render — that car is gone, the rest of the scene is intact. This is the real-to-sim half of the simulation matrix that Cosmos Transfer (project 09) does not cover; it is the lineage running through Wayve PRISM-1, Waymo + NVIDIA EmerNeRF, and Applied Intuition Neural Sim. The pedagogical punchline is the dynamic-scene failure mode: vanilla 3DGS assumes a static world, so moving actors smear into ghosts. We visualize the ghosts, name the fixes (Street Gaussians, DrivingGaussian, EmerNeRF, SplatAD), and leave 4DGS as a TODO.

Goal

The single deliverable is one results table and one before/after pair:

Metric on held-out front-camera frames	Value
PSNR ↑	?
SSIM ↑	?
LPIPS (AlexNet) ↓	?
Gaussians at convergence	?
Train wall-clock (RTX 4090)	?

…plus two rendered videos: one along the original ego trajectory (sanity check) and one along a novel trajectory offset 2 m laterally (the real-to-sim use case — you cannot do this with a recorded drive). And an actor-removal pair: the original render with a moving car visible, and the same render after we mask out the Gaussians inside that car's 3D bounding box. The pair is the demo that closes a loop with the AI Neural Sim narrative: every recorded drive becomes an editable simulator scene.

This is a measurement project, not a leaderboard chase. The PSNR number matters less than the gap between the static-region PSNR and the moving-vehicle PSNR — that gap is the entire reason the field invented 4DGS variants in the first place.

Loops touched

This lives in two of the seven AV/robotics data loops we use as a mental map (see /docs/03-simulation-and-synthetic-data.md §D for the full matrix):

COLLECT (synthetic, real-to-sim half) — every recorded fleet drive becomes a replayable, editable, novel-view-generating simulator scene. This is the half of the simulation matrix that Cosmos Transfer (a generative resynthesizer) does not cover. Cosmos rephotographs a classical render under arbitrary visual conditions; 3DGS / 4DGS take a real recording and turn it back into geometry you can fly through. They are complementary tools, not competitors. Wayve's PRISM-1 is the productized version of this idea for an AV stack; Applied Intuition's Neural Sim is the productized version for a sim platform.
EVAL — once the recorded drive is a splat, the same scenario can be replayed under counterfactual conditions: "what if a cyclist had been there?" "what if the bus had not braked?" That is a closed-loop eval asset, not a logged-data eval asset. The actor-removal demo at the end of the notebook is a one-step toy version.

We do not touch DEPLOY, MEASURE-IN-FIELD, LABEL & TRAIN, or CURATE in this project. The output is an asset (a .ply of Gaussians plus a camera trajectory), not a model.

Why this matters for AI Data Intelligence

Every AV company eventually faces the same fork in the simulation matrix:

	Generates from scratch	Reconstructs from real
Classical	CARLA, Spectral	(no good answer pre-NeRF)
Neural / generative	Cosmos Predict, Cosmos Transfer	3DGS / 4DGS / EmerNeRF / PRISM-1 / Neural Sim

Project 09 covered the upper-right cell (generative photoreal augmentation on top of a classical renderer). This project covers the lower-right cell — the half of the simulation matrix that turns real fleet logs into simulator content. Without this cell, "every drive becomes a scenario" is a slide; with it, it is an asset pipeline.

A few productization signals worth tracking, all in this lineage:

Wayve PRISM-1 — Wayve's reconstruction-based simulator that re-renders real drives. Public posts emphasize 4D scene decomposition and dynamic-actor handling. The "Ghost Gym" pitch is downstream.
Waymo + NVIDIA EmerNeRF (arXiv 2311.02077) — self-supervised static/dynamic decomposition with an emergent flow field, evaluated on the NOTR Waymo subsample. NeRF-backed; the field has since migrated this decomposition into 3DGS-backed methods.
Tesla 4D-GS — closed-source, but the CVPR-talk hints align with the academic 4DGS line.
DrivingGaussian (CVPR 2024), Street Gaussians (ECCV 2024), SplatAD (CVPR 2025) — the academic 4DGS-for-driving trio. Each decomposes the scene into a static background and per-actor Gaussian sets. SplatAD additionally renders LiDAR returns (intensity + ray-drop) — the closest open-source proxy to a productized AV reconstruction tool.
Applied Intuition Neural Sim / Spectral — the productization of exactly this matrix: Spectral is the classical-deterministic side, Neural Sim is the neural-photoreal side. A candidate who can say "Cosmos generates, Street Gaussians reconstructs, here are the failure modes I measured on a nuScenes drive" is exactly who the Data Intelligence team needs for both halves of the matrix.

If you can articulate, with numbers, "vanilla 3DGS gives me X PSNR on a nuScenes log but Y PSNR drops to Z PSNR on the moving-vehicle pixels — here is where the dynamic-scene literature picks up," you are doing science on the tool, not marketing it. That distinction is the entire job.

Prerequisites

Project 01 must be done first. This project assumes you have already downloaded the nuScenes mini split (v1.0-mini, ~3.9 GB) to ~/data/nuscenes/v1.0-mini/ and that the nuscenes-devkit import works inside your venv. The mini split has ten 20-second scenes, which is exactly the size 3DGS expects (6 cameras × ~40 keyframes × 10 scenes ≈ 2400 images). Pick scene 0061 or 0103 for the notebook walkthrough — both have visible moving vehicles, which makes the dynamic-failure punchline obvious.
Optional: project 09. Not required, but the contrast in the README ("Cosmos Transfer rephotographs; 3DGS reconstructs") only lands if you have run a Cosmos Transfer pass and seen its output.

Hardware

Setting	Min VRAM	Train time	Notes
Mini, 1 scene, front cam only, 1k iters	8 GB	~5 min	Sanity check; will look bad
Mini, 1 scene, front cam, 30k iters	12 GB	~25 min	Reduced default for the notebook
Mini, 1 scene, all 6 cams, 30k iters	16 GB	~45 min	Honest novel-view setting
Full split, 1 scene, all 6 cams, 30k iters	24 GB	~60 min	RTX 3090 / 4090 / A6000
Full split, all cams, 4DGS variant	40 GB+	hours	A100 / H100 territory; out of scope

The notebook defaults to the 12 GB row so a 3090 / 4090 / A6000 can run it end-to-end without OOM. We document the bigger settings, but we do not run them by default — the pedagogical points (PSNR baseline, dynamic ghosts, actor-removal demo) all show up at the reduced setting.

CPU-only execution is not supported. gsplat's CUDA rasterizer is the whole point of the library. macOS users: rent a Vast.ai or Lambda Labs RTX 4090 for an hour ($0.40–0.80) or use a Colab Pro+ A100. The notebook checks for CUDA at the top and refuses to start without it.

Setup

cd projects/08-gaussian-splatting-real2sim
./setup.sh                     # creates .venv, installs torch+cu121, gsplat, lpips, etc.
source .venv/bin/activate
python -c "import gsplat; print(gsplat.__version__)"   # smoke test

setup.sh is idempotent. If tinycudann or the gsplat CUDA build fails on the first run (the most common failure mode — see Common pitfalls below), the script prints a diagnostic block pointing at the right log file. Do not try to fix tinycudann errors by upgrading PyTorch in requirements.txt; the version pins are matched to the gsplat wheels at https://docs.gsplat.studio/whl/. If you must change the CUDA version, also change the --index-url in setup.sh.

The notebook is in jupytext percent format. Open it directly in VSCode, or convert with jupytext --to notebook notebook.py && jupyter lab notebook.ipynb.

Steps

The notebook has 10 numbered sections that match this list one-for-one:

Hardware + CUDA verify. nvidia-smi, torch.cuda.is_available(), gsplat import, free-VRAM probe. Refuse to start on <8 GB free.
Install gsplat + Nerfstudio (or just gsplat). We pin gsplat to a 2026-current release that builds against torch 2.4 + CUDA 12.1. The notebook uses the lower-level gsplat rasterizer directly so the training loop is visible; we do not depend on the full Nerfstudio CLI.
Load a nuScenes scene. Pick scene 0061 or 0103. Pull the 6-camera keyframes, the ego pose at each keyframe, and (optionally) the LiDAR sweep nearest each keyframe. Save the camera intrinsics from cs_record["camera_intrinsic"] and the world-to-camera extrinsics built from ego_pose and calibrated_sensor.
Decide on poses: nuScenes-provided vs COLMAP. nuScenes ships high-quality intrinsics and IMU-fused ego poses. We use them. We document when COLMAP would be needed (a non-nuScenes log, or a scene where the IMU drift visibly breaks the splat) and include a fallback COLMAP-on-keyframes call as a code reference, but skip running it.
Build the parser dict gsplat expects. A dict with Ks (3×3 per-image), camtoworlds (4×4 per-image), the image paths, and an initial sparse point cloud (we seed from the LiDAR sweep — this is the trick that makes nuScenes-mini converge in 30k iterations). Save to data/parser_<scene>.pkl.
Train a vanilla 3DGS on the scene. 30k iterations of SGD on Gaussian positions, scales, rotations, opacities, and SH colors. Default config in the notebook is the gsplat default_strategy from simple_trainer.py. Wall clock ≈ 25 min on a 4090. Log to TensorBoard.
Render from novel viewpoints. Two trajectories: (a) the original ego path (sanity check), (b) the same path offset 2 meters laterally (the real-to-sim use case). Render both at 1080p, write to MP4.
Compute PSNR / SSIM / LPIPS on held-out frames. We hold out every 8th front-camera keyframe from the train set, render the splat at those poses, and compare to the held-out RGBs. Report all three metrics, plus a per-region breakdown: static pixels (sky + road + buildings) vs dynamic pixels (the front-camera segmentation mask of vehicle.car at that timestep). The static-vs-dynamic gap is the pedagogical punchline.
Visualize dynamic-scene failure. Render a frame where a moving car is in the scene. Show the 4-panel: ground truth | vanilla-3DGS render | absolute error map | the smear of "ghost Gaussians" along the vehicle's path. Discuss why this happens (3DGS optimizes a static set of Gaussians against time-stamped views; a moving object cannot be fit by any static Gaussian, so the optimizer compromises with a smeared low-opacity blob). Name the four academic fixes (DrivingGaussian, Street Gaussians, EmerNeRF, SplatAD) and what each adds.
Actor-removal demo + export. Pick the moving car from step 9. Look up its 3D bounding box from the nuScenes annotation at the relevant keyframe. Mask all Gaussians whose centers fall inside the box (in world coordinates). Re-render. The car is gone; the rest of the scene is intact. Export the (full) splat to PLY for use as an Isaac Sim asset. Log the User TODO: replace this script-level mask with a learned static/dynamic decomposition (Street Gaussians) and re-evaluate.

Done criterion

You are done when all four of the following are true:

The notebook runs top-to-bottom on your GPU without manual intervention, in under 90 minutes wall-clock on a 4090.
You have written PSNR / SSIM / LPIPS for the held-out front-camera frames into outputs/metrics.json. The numbers are sane: PSNR > 22 dB on the static-only mask, PSNR < 18 dB on the dynamic-vehicle mask. The gap between those two numbers is the result you are looking for. (Report it. The gap is more important than the absolute level.)
You have rendered two MP4s — original-trajectory and offset-2 m — and eyeballed them. The novel-view render is recognizably the same street, with no catastrophic artifacts in the static regions.
The actor-removal pair (outputs/with_car.png and outputs/without_car.png) shows the moving car gone in the second image, with no large hole or visible mask edge. If the hole is obviously large, your bounding box was too tight or the dynamic ghosts are bleeding outside the box — note that as a finding.

Common pitfalls

COLMAP fails on driving sequences. COLMAP's SfM expects a lot of feature overlap between views. A forward-facing AV camera moving straight down a road provides almost no parallax for distant features (sky, road horizon, far buildings), and frame-to-frame baselines are small relative to scene depth. COLMAP either fails to register or produces a degenerate point cloud. Fix: use the dataset's provided poses (nuScenes, Waymo, KITTI-360 all ship them) and seed the splat from the LiDAR sweep, not COLMAP points.
Vanilla 3DGS assumes the world is static — moving cars become ghosts. This is the most important pitfall and the one you should visualize, not paper over. Vanilla 3DGS optimizes a single set of Gaussians against time-stamped views; a vehicle that moves between views cannot be fit by any static Gaussian, so the optimizer settles on low-opacity smears along the vehicle's trajectory. The fixes are architectural: per-actor Gaussian sets in their own object frame (Street Gaussians, DrivingGaussian), a deformation field (4DGS / Spacetime Gaussians), or a learned static/dynamic decomposition (EmerNeRF). Pick one for follow-up.
Sky and unbounded regions blow up. AV scenes have a sky that is effectively at infinity. Without a fix, 3DGS will allocate many high-opacity Gaussians to the sky and never converge cleanly. Use one of: (a) a sky mask from a semantic segmenter applied at training time, (b) a separate environment-map / cubemap branch (Street Gaussians does this), (c) the Mip-NeRF 360 unbounded contraction trick if you must keep everything in one representation.
Multi-camera consistency is hard. nuScenes' six cameras have tight intrinsics but their extrinsics drift mm-scale across the sweep. If you do not respect the per-image camtoworld (i.e. you reuse one extrinsic across all timestamps of a given camera), the splat will look fine on the front camera and ghostly on the side cameras. Build the camtoworld from ego_pose × calibrated_sensor for each sample data, not once per camera.
Memory blowup at urban-scene scale. A naive default_strategy densifies aggressively, and an urban scene with dense LiDAR seeding can grow past 5 M Gaussians and OOM a 24 GB card. Use the MCMC strategy or cap the Gaussian count at 1.5 M for the notebook. Document this in the config; do not let it bite the user with a 25-minute training run that crashes at iter 18k.
LPIPS install pulls a wrong torch. The lpips PyPI package installs a CPU torch as a transitive dep on some platforms, silently shadowing the CUDA torch. Install lpips with pip install lpips --no-deps, then make sure torchvision is present from the cu121 index. The setup.sh script does this in the right order.
Held-out frames must be temporally separated. Holding out every 8th frame is fine for a typical 3DGS scene, but on driving data the ego is moving, so frame N+1 is spatially close to frame N. If your held-out frame is one keyframe away from a train frame, you are measuring near-trivial novel-view synthesis. Hold out a contiguous block (e.g. the last 4 keyframes of the 40-keyframe scene) to get a meaningful number.