PROSE: Training-Free Egocentric Scene Registration
with Vision-Language Models

* Equal contribution  ·  † Equal advising
TL;DR: A training-free, RGB-only pipeline that registers two egocentric captures of the same scene taken at different times: a single pretrained VLM both builds an object-level 3D scene graph per capture and matches instances across them, outperforming geometric and learned scene-graph baselines on Aria Digital Twin & Aria Everyday Activities.

Video

PROSE teaser: scene representation, cross-scan instance correspondence, and registered scene

Given two egocentric RGB sequences of the same indoor space captured at different times, PROSE recovers the rigid transform aligning them — and produces an open-vocabulary 3D scene graph for each capture along the way.

Abstract

Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

Egocentric RGB view of the scene at time t0 Another egocentric RGB view of the scene at time t0
Source scan · egocentric RGB at t0
Egocentric RGB view of the same scene at time t1 Another egocentric RGB view of the same scene at time t1
Reference scan · egocentric RGB at t1

The actual input: head-mounted egocentric RGB views of the same scene, captured at different times (Aria Digital Twin).

0
Learned Parameters
Entirely training-free: geometry, segmentation, and matching all come from off-the-shelf foundation models prompted at inference time.
89.2%
Registration Recall on ADT
Best registration recall on ground-truth clouds, improving over the strongest geometric baseline (80.3%) and far ahead of SG-Reg (46.9%).
+22.6%
RR Gain on AEA, RGB-only
When point clouds must be reconstructed from RGB, PROSE reaches 65.5% recall vs. 42.9% for TEASER++ — the lead grows as geometry degrades.
3.4×
More Precise Correspondences
70.4% node precision on ADT instance matching, vs. 20.8% for SG-Reg and 14.2% for GDino+CLIP embedding matching.

How PROSE Works

A single pretrained VLM drives both scene understanding and cross-scan matching — no fine-tuning anywhere in the pipeline.

PROSE pipeline: scene parsing, height-binned correspondence and verification, instance-level scene graph matching, pose hypothesis and voting

From two egocentric RGB scans of the same scene at different times, PROSE parses each into a per-scan 3D scene graph, matches instances across scans, and estimates the rigid transform by generating one candidate per matched pair and selecting the highest-inlier-ratio hypothesis.

1

Scene Parsing

Each RGB sequence is lifted to per-frame depth and poses by a geometric foundation model. A VLM lists the scene's landmark objects, SAM 3 turns the names into temporally consistent instance masks, and a voxel-revote fusion step consolidates everything into an object-level 3D scene graph per scan.

VGGT-Ω Qwen3.6-27B SAM 3
2

Height-Binned Correspondence

Instances are split into quantile bins along the gravity axis, so a ceiling lamp never competes with a floor rug. Within each bin, the VLM matches crops labeled with shared-namespace Set-of-Marks markers, and every candidate is re-checked with paired same?/different? prompts to expose hallucinated matches.

K = 5 height bins Set-of-Marks 2× verification
3

Pose Hypothesis & Voting

Each matched pair yields its own rigid-transform hypothesis via per-instance RANSAC on descriptor correspondences. Hypotheses are scored by scene-wide inlier ratio and the strongest geometric consensus wins — so a minority of bad matches cannot corrupt the final transform.

FCGF / FPFH / GeoTrans RANSAC Inlier-ratio voting

Registration: Ahead on Clean Clouds, Far Ahead on RGB-Only

PROSE vs. the strongest scene-level baseline (TEASER++) in each setting. Higher RR ↑ is better; lower RRE ↓ and RTE ↓ are better. The lead widens exactly where it matters: clouds reconstructed from RGB.

ADT · ground-truth clouds
Aria Digital Twin, GT point clouds (Ours w/ FCGF)
Registration Recall ↑
TEASER++
80.3%
Ours
89.2% +8.9%
Rotation Error ↓
TEASER++
8.83°
Ours
4.16° 53% ↓
Translation Error ↓
TEASER++
0.40 m
Ours
0.17 m 58% ↓
ADT · RGB-reconstructed clouds
Sensor-free setting, VGGT-Ω predicted clouds (Ours w/ GeoTrans)
Registration Recall ↑
TEASER++
44.4%
Ours
56.2% +11.8%
Rotation Error ↓
TEASER++
17.11°
Ours
12.55° 27% ↓
Translation Error ↓
TEASER++
0.97 m
Ours
0.59 m 39% ↓
AEA · RGB-reconstructed clouds
Aria Everyday Activities, VGGT-Ω predicted clouds (Ours w/ GeoTrans)
Registration Recall ↑
TEASER++
42.9%
Ours
65.5% +22.6%
Rotation Error ↓
TEASER++
18.87°
Ours
12.71° 33% ↓
Translation Error ↓
TEASER++
2.55 m
Ours
1.39 m 45% ↓

Qualitative Examples

TEASER++
SG-Reg
Ours
GT
Reference scan Source scan (registered)

Drag to rotate, scroll to zoom — cameras are synced across panels.

Full Results

Total-split results on ADT and AEA. Bold = best, underline = second best within each cloud setting.

Method ADT · GT clouds ADT · VGGT-Ω clouds AEA · VGGT-Ω clouds
RR (%) ↑RRE (°) ↓RTE (m) ↓ RR (%) ↑RRE (°) ↓RTE (m) ↓ RR (%) ↑RRE (°) ↓RTE (m) ↓
TEASER++ FPFH 80.38.830.40 44.417.110.97 42.918.872.55
TEASER++ FCGF 79.49.630.44 Fail to converge Fail to converge
GeoTransformer Out of Memory Out of Memory Out of Memory
BUFFER-X 61.710.570.59 37.116.951.03 34.520.352.44
SG-Reg 46.919.762.12 19.129.583.27 19.427.063.52
Ours FPFH 82.57.440.30 49.915.710.70 47.917.911.84
Ours FCGF 89.24.160.17 50.013.210.63 58.59.040.99
Ours GeoTrans 78.88.850.36 56.212.550.59 65.512.711.39

No single descriptor backend dominates across the three "Ours" rows — the gains come from the VLM correspondence stage, not the descriptor. GeoTransformer OOMs at scene level but works fine as a PROSE backend, since per-instance registration keeps each problem small.

Downstream: The Scene Graph Plans Paths

The same open-vocabulary scene graph that drives registration supports object-goal path planning with RRT on a real iPad capture — emulating a quadruped navigating 50 cm above the floor.

Photographs of the robotic demo with a Spot quadruped navigating the room

Robotic demo. Real-world assessment with a quadruped robot.

Bird's-eye view of the reconstructed open-plan room with the labeled scene graph

Bird's-eye view of the reconstructed open-plan room with the labeled scene graph (in-room vs. out-of-room instances).

Method SRroomSRnearSRfarSRcross NinstNpairs
FM-Fusion*100*100*100*59
ConceptGraphs46.247.146.838.133379
Ours49.662.147.140.4192500

*FM-Fusion recovers valid graphs for only 2 rooms — its perfect rate reflects negligible coverage, not planning quality.

BibTeX

@article{chen2026prose,
  title     = {PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models},
  author    = {Chen, Zhiang and Lee, Nahyuk and Sun, Boyang and Kwon, Taein and Pollefeys, Marc and Bauer, Zuria and Hong, Sunghwan},
  journal   = {arXiv preprint arXiv:XXXX.XXXXX},
  year      = {2026}
}