TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

1ReLER Lab, CCAI, Zhejiang University 2DBMI, HMS, Harvard University

TRACE enables high-fidelity 3D scene editing through geometry-aware multi-view anchoring, precise asset alignment, and temporally stable video-based refinement across diverse structural and appearance edits.

Abstract

We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE enables fine-grained, part-level manipulation, such as local pose shifting or component replacement, while preserving the structural integrity of the central subject.

Our approach consists of three stages: Multi-view 3D-Anchor Synthesis, which uses the MV-TRACE dataset to generate spatially consistent anchors; Tangible Geometry Anchoring (TGA), which aligns inserted meshes with the 3DGS scene through two-phase registration; and Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline for temporally stable and physically grounded rendering.

Extensive experiments show that TRACE consistently outperforms existing methods, especially in editing versatility and structural integrity.

Results

Quantitative

Benchmark Comparison

We evaluate on a benchmark of 8 scenes collected from IN2N, BlendedMVS, and Mip-NeRF 360, covering both geometry transformation and style transfer tasks with roughly 6 editing cases per scene. TRACE improves semantic alignment, structural consistency, and perceived visual quality while keeping runtime competitive with feed-forward baselines.

Method Pub. CLIPdir ↑ CLIPsim ↑ DINO ↑ Aesthetic ↑ Time ↓
DGEECCV'240.06550.23710.88495.797410 min
GaussCtrlECCV'240.04460.21260.89625.516420 min
TIP-EditorSIG'240.10130.23160.85345.385445 min
GaussianEditorCVPR'240.06800.22540.86715.430716 min
EditSplatCVPR'250.07620.22990.88345.807118 min
Vipe3deditAAAI'250.03310.21540.88315.682310 min
TRACE---0.15140.24650.90586.103510 min

Qualitative

Main Visual Comparisons

Across shape transformation, object insertion, and structural deformation, TRACE produces stronger geometry fidelity and cleaner textures than prior 3D scene editing baselines.

Qualitative comparisons of TRACE with baselines
Main comparison. TRACE preserves local structure and visual detail across a wide range of editing cases.
Comparison with direct video editing methods
Direct video editing comparison. TRACE preserves static backgrounds and achieves more stable 3D-consistent asset placement than direct video editing baselines.

Method Overview

TRACE method overview
Pipeline overview. TRACE first synthesizes geometry-aligned multi-view anchors, then aligns inserted meshes with the 3DGS scene through TGA, and finally uses CVM to propagate edits with temporally stable video repainting before reconstructing the edited 3D scene.

Module 1

MV-TRACE And Multi-view Anchoring

MV-TRACE is a multi-view consistent dataset for scene-coherent object addition and modification. Built on top of this supervision, Multi-view Anchoring replaces unstable single-view guidance with geometry-aware anchors, giving TRACE reliable cross-view placement and stronger background preservation during editing.

MV-TRACE dataset curation pipeline
MV-TRACE dataset curation. The pipeline constructs 3D-aware editing pairs through asset creation or retrieval, spatial alignment, dense view sampling, pair filtering, and refinement, providing reliable supervision for multi-view consistent editing.
Multiview editing ablation for TRACE
Multi-view anchoring. Compared with no LoRA and multi-angle LoRA baselines, TRACE's 3D LoRA achieves more faithful object placement, stronger viewpoint consistency, and better preservation of scene background structure.

Module 2

Tangible Geometry Anchoring (TGA)

Tangible Geometry Anchoring (TGA) resolves pose, scale, and coordinate mismatch between generated assets and the target scene. Its progressive alignment strategy moves from rough initialization to precise scene-consistent registration, ensuring the inserted geometry is ready for stable downstream rendering.

TRACE alignment pipeline
TGA alignment. The two-stage alignment pipeline progressively transforms a severely misaligned initialization into accurate scene-consistent asset placement, resolving orientation ambiguity and stabilizing geometric registration.

Module 3

Contextual Video Masking (CVM)

Contextual Video Masking (CVM) performs geometry-aware video repainting after anchoring. It refines the boundary between inserted content and the original scene, leading to sharper backgrounds, more plausible lighting, and temporally stable visual synthesis.

TRACE contextual video masking ablation
CVM refinement. Contextual Video Masking improves local object fidelity, surrounding illumination, and temporal stability, while keeping the rendered sequence sharp and physically coherent across the trajectory.

Comprehensive Editing Capabilities

Comprehensive TRACE capability overview
Comprehensive editing capabilities. TRACE supports addition and removal, stylization, object manipulation, and both local and global shape modification within one unified editing framework.

More Examples

BibTeX

@misc{hu2026tracehighfidelity3dscene,
          title={TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking}, 
          author={Jiyuan Hu and Zechuan Zhang and Zongxin Yang and Yi Yang},
          year={2026},
          eprint={2604.01207},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2604.01207}, 
    }