PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

Video

PROSE teaser: scene representation, cross-scan instance correspondence, and registered scene

Given two egocentric RGB sequences of the same indoor space captured at different times, PROSE recovers the rigid transform aligning them — and produces an open-vocabulary 3D scene graph for each capture along the way.

Abstract

Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

Egocentric RGB view of the scene at time t0

Another egocentric RGB view of the scene at time t0

Source scan · egocentric RGB at t₀

Egocentric RGB view of the same scene at time t1

Another egocentric RGB view of the same scene at time t1

Reference scan · egocentric RGB at t₁

The actual input: head-mounted egocentric RGB views of the same scene, captured at different times (Aria Digital Twin).

0

Learned Parameters

Entirely training-free: geometry, segmentation, and matching all come from off-the-shelf foundation models prompted at inference time.

89.2%

Registration Recall on ADT

Best registration recall on ground-truth clouds, improving over the strongest geometric baseline (80.3%) and far ahead of SG-Reg (46.9%).

+22.6%

RR Gain on AEA, RGB-only

When point clouds must be reconstructed from RGB, PROSE reaches 65.5% recall vs. 42.9% for TEASER++ — the lead grows as geometry degrades.

3.4×

More Precise Correspondences

70.4% node precision on ADT instance matching, vs. 20.8% for SG-Reg and 14.2% for GDino+CLIP embedding matching.

How PROSE Works

A single pretrained VLM drives both scene understanding and cross-scan matching — no fine-tuning anywhere in the pipeline.

PROSE pipeline: scene parsing, height-binned correspondence and verification, instance-level scene graph matching, pose hypothesis and voting

From two egocentric RGB scans of the same scene at different times, PROSE parses each into a per-scan 3D scene graph, matches instances across scans, and estimates the rigid transform by generating one candidate per matched pair and selecting the highest-inlier-ratio hypothesis.

1

Scene Parsing

Each RGB sequence is lifted to per-frame depth and poses by a geometric foundation model. A VLM lists the scene's landmark objects, SAM 3 turns the names into temporally consistent instance masks, and a voxel-revote fusion step consolidates everything into an object-level 3D scene graph per scan.

VGGT-Ω Qwen3.6-27B SAM 3

2

Height-Binned Correspondence

Instances are split into quantile bins along the gravity axis, so a ceiling lamp never competes with a floor rug. Within each bin, the VLM matches crops labeled with shared-namespace Set-of-Marks markers, and every candidate is re-checked with paired same?/different? prompts to expose hallucinated matches.

K = 5 height bins Set-of-Marks 2× verification

3

Pose Hypothesis & Voting

Each matched pair yields its own rigid-transform hypothesis via per-instance RANSAC on descriptor correspondences. Hypotheses are scored by scene-wide inlier ratio and the strongest geometric consensus wins — so a minority of bad matches cannot corrupt the final transform.

FCGF / FPFH / GeoTrans RANSAC Inlier-ratio voting

Registration: Ahead on Clean Clouds, Far Ahead on RGB-Only

PROSE vs. the strongest scene-level baseline (TEASER++) in each setting. Higher RR ↑ is better; lower RRE ↓ and RTE ↓ are better. The lead widens exactly where it matters: clouds reconstructed from RGB.

ADT · ground-truth clouds

Aria Digital Twin, GT point clouds (Ours w/ FCGF)

Registration Recall ↑

TEASER++

80.3%

Ours

89.2% +8.9%

Rotation Error ↓

TEASER++

8.83°

Ours

4.16° 53% ↓

Translation Error ↓

TEASER++

0.40 m

Ours

0.17 m 58% ↓

ADT · RGB-reconstructed clouds

Sensor-free setting, VGGT-Ω predicted clouds (Ours w/ GeoTrans)

Registration Recall ↑

TEASER++

44.4%

Ours

56.2% +11.8%

Rotation Error ↓

TEASER++

17.11°

Ours

12.55° 27% ↓

Translation Error ↓

TEASER++

0.97 m

Ours

0.59 m 39% ↓

AEA · RGB-reconstructed clouds

Aria Everyday Activities, VGGT-Ω predicted clouds (Ours w/ GeoTrans)

Registration Recall ↑

TEASER++

42.9%

Ours

65.5% +22.6%

Rotation Error ↓

TEASER++

18.87°

Ours

12.71° 33% ↓

Translation Error ↓

TEASER++

2.55 m

Ours

1.39 m 45% ↓

Qualitative Examples

TEASER++

SG-Reg

Ours

GT

Reference scan Source scan (registered)

Drag to rotate, scroll to zoom — cameras are synced across panels.

Full Results

Total-split results on ADT and AEA. Bold = best, underline = second best within each cloud setting.

Method	ADT · GT clouds			ADT · VGGT-Ω clouds			AEA · VGGT-Ω clouds
Method	RR (%) ↑	RRE (°) ↓	RTE (m) ↓	RR (%) ↑	RRE (°) ↓	RTE (m) ↓	RR (%) ↑	RRE (°) ↓	RTE (m) ↓
TEASER++ _FPFH	80.3	8.83	0.40	44.4	17.11	0.97	42.9	18.87	2.55
TEASER++ _FCGF	79.4	9.63	0.44	Fail to converge			Fail to converge
GeoTransformer	Out of Memory			Out of Memory			Out of Memory
BUFFER-X	61.7	10.57	0.59	37.1	16.95	1.03	34.5	20.35	2.44
SG-Reg	46.9	19.76	2.12	19.1	29.58	3.27	19.4	27.06	3.52
Ours _FPFH	82.5	7.44	0.30	49.9	15.71	0.70	47.9	17.91	1.84
Ours _FCGF	89.2	4.16	0.17	50.0	13.21	0.63	58.5	9.04	0.99
Ours _GeoTrans	78.8	8.85	0.36	56.2	12.55	0.59	65.5	12.71	1.39

No single descriptor backend dominates across the three "Ours" rows — the gains come from the VLM correspondence stage, not the descriptor. GeoTransformer OOMs at scene level but works fine as a PROSE backend, since per-instance registration keeps each problem small.

Downstream: The Scene Graph Plans Paths

The same open-vocabulary scene graph that drives registration supports object-goal path planning with RRT on a real iPad capture — emulating a quadruped navigating 50 cm above the floor.

Photographs of the robotic demo with a Spot quadruped navigating the room

Robotic demo. Real-world assessment with a quadruped robot.

Bird's-eye view of the reconstructed open-plan room with the labeled scene graph (in-room vs. out-of-room instances).

Method	SR_room ↑	SR_near ↑	SR_far ↑	SR_cross ↑	N_inst	N_pairs
FM-Fusion*	100*	100*	100*	–	5	9
ConceptGraphs	46.2	47.1	46.8	38.1	33	379
Ours	49.6	62.1	47.1	40.4	192	500

*FM-Fusion recovers valid graphs for only 2 rooms — its perfect rate reflects negligible coverage, not planning quality.

BibTeX

@article{chen2026prose,
  title     = {PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models},
  author    = {Chen, Zhiang and Lee, Nahyuk and Sun, Boyang and Kwon, Taein and Pollefeys, Marc and Bauer, Zuria and Hong, Sunghwan},
  journal   = {arXiv preprint arXiv:2606.16569},
  year      = {2026}
}

PROSE: Training-Free Egocentric Scene Registrationwith Vision-Language Models

Video

Abstract

How PROSE Works

Scene Parsing

Height-Binned Correspondence

Pose Hypothesis & Voting

Registration: Ahead on Clean Clouds, Far Ahead on RGB-Only

Qualitative Examples

Full Results

Downstream: The Scene Graph Plans Paths

BibTeX

PROSE: Training-Free Egocentric Scene Registration
with Vision-Language Models