TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

Hu, Jiyuan; Zhang, Zechuan; Yang, Zongxin; Yang, Yi

TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

Jiyuan Hu¹, Zechuan Zhang^1,2, Zongxin Yang^1,2, Yi Yang¹

¹ReLER Lab, CCAI, Zhejiang University ²DBMI, HMS, Harvard University

Arxiv Code Data

TRACE enables high-fidelity 3D scene editing through geometry-aware multi-view anchoring, precise asset alignment, and temporally stable video-based refinement across diverse structural and appearance edits.

Abstract

We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE enables fine-grained, part-level manipulation, such as local pose shifting or component replacement, while preserving the structural integrity of the central subject.

Our approach consists of three stages: Multi-view 3D-Anchor Synthesis, which uses the MV-TRACE dataset to generate spatially consistent anchors; Tangible Geometry Anchoring (TGA), which aligns inserted meshes with the 3DGS scene through two-phase registration; and Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline for temporally stable and physically grounded rendering.

Extensive experiments show that TRACE consistently outperforms existing methods, especially in editing versatility and structural integrity.

Results

Quantitative

Benchmark Comparison

We evaluate on a benchmark of 8 scenes collected from IN2N, BlendedMVS, and Mip-NeRF 360, covering both geometry transformation and style transfer tasks with roughly 6 editing cases per scene. TRACE improves semantic alignment, structural consistency, and perceived visual quality while keeping runtime competitive with feed-forward baselines.

Method	Pub.	CLIPdir ↑	CLIPsim ↑	DINO ↑	Aesthetic ↑	Time ↓
DGE	ECCV'24	0.0655	0.2371	0.8849	5.7974	10 min
GaussCtrl	ECCV'24	0.0446	0.2126	0.8962	5.5164	20 min
TIP-Editor	SIG'24	0.1013	0.2316	0.8534	5.3854	45 min
GaussianEditor	CVPR'24	0.0680	0.2254	0.8671	5.4307	16 min
EditSplat	CVPR'25	0.0762	0.2299	0.8834	5.8071	18 min
Vipe3dedit	AAAI'25	0.0331	0.2154	0.8831	5.6823	10 min
TRACE	---	0.1514	0.2465	0.9058	6.1035	10 min

Qualitative

Main Visual Comparisons

Across shape transformation, object insertion, and structural deformation, TRACE produces stronger geometry fidelity and cleaner textures than prior 3D scene editing baselines.

Qualitative comparisons of TRACE with baselines — **Main comparison.** TRACE preserves local structure and visual detail across a wide range of editing cases.

Comparison with direct video editing methods — **Direct video editing comparison.** TRACE preserves static backgrounds and achieves more stable 3D-consistent asset placement than direct video editing baselines.

Method Overview

Core Modules

MV-TRACE And Multi-view Anchoring

Builds a multi-view consistent editing dataset and uses it to replace unstable single-view cues with geometry-aware anchors for reliable cross-view placement.

Jump to Section 1

Tangible Geometry Anchoring

Aligns generated assets with the target 3D scene through progressive geometric registration, resolving pose, scale, and coordinate mismatch.

Jump to Section 2

Contextual Video Masking

Refines edited content with geometry-aware video repainting, improving temporal stability, local realism, and background preservation.

Jump to Section 3

Module 1

MV-TRACE And Multi-view Anchoring

MV-TRACE is a multi-view consistent dataset for scene-coherent object addition and modification. Built on top of this supervision, Multi-view Anchoring replaces unstable single-view guidance with geometry-aware anchors, giving TRACE reliable cross-view placement and stronger background preservation during editing.

MV-TRACE dataset curation pipeline — **MV-TRACE dataset curation.** The pipeline constructs 3D-aware editing pairs through asset creation or retrieval, spatial alignment, dense view sampling, pair filtering, and refinement, providing reliable supervision for multi-view consistent editing.

Multiview editing ablation for TRACE — **Multi-view anchoring.** Compared with no LoRA and multi-angle LoRA baselines, TRACE's 3D LoRA achieves more faithful object placement, stronger viewpoint consistency, and better preservation of scene background structure.

Module 2

Tangible Geometry Anchoring (TGA)

Tangible Geometry Anchoring (TGA) resolves pose, scale, and coordinate mismatch between generated assets and the target scene. Its progressive alignment strategy moves from rough initialization to precise scene-consistent registration, ensuring the inserted geometry is ready for stable downstream rendering.

TRACE alignment pipeline — **TGA alignment.** The two-stage alignment pipeline progressively transforms a severely misaligned initialization into accurate scene-consistent asset placement, resolving orientation ambiguity and stabilizing geometric registration.

Module 3

Contextual Video Masking (CVM)

Contextual Video Masking (CVM) performs geometry-aware video repainting after anchoring. It refines the boundary between inserted content and the original scene, leading to sharper backgrounds, more plausible lighting, and temporally stable visual synthesis.

TRACE contextual video masking ablation — **CVM refinement.** Contextual Video Masking improves local object fidelity, surrounding illumination, and temporal stability, while keeping the rendered sequence sharp and physically coherent across the trajectory.

Comprehensive Editing Capabilities

Comprehensive TRACE capability overview — **Comprehensive editing capabilities.** TRACE supports addition and removal, stylization, object manipulation, and both local and global shape modification within one unified editing framework.

More Examples

Stone horse examples spanning large geometry changes, semantic replacements, and material edits.

Face scene edits with stable lighting, occlusion handling, and part-level insertions.

Additional person scene editing examples

Additional narrow-view and frontal-view editing cases with reliable alignment.

Additional global style transfer examples

Global scene stylization results including seasonal transitions and atmospheric changes.

Localized edits that preserve the original background while maintaining multi-view consistency.

BibTeX

@misc{hu2026tracehighfidelity3dscene,
          title={TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking}, 
          author={Jiyuan Hu and Zechuan Zhang and Zongxin Yang and Yi Yang},
          year={2026},
          eprint={2604.01207},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2604.01207}, 
    }