MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios

MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors

¹School of Automotive Studies, Tongji University
²Mach Drive
^*Project leader ^†Corresponding author

Abstract

Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism—comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement—to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.

The Proposed Method

Overview of the proposed MultiEditor framework. A dual-branch diffusion model is employed to edit multimodal data. Each branch incorporates a multi-level appearance control mechanism for fidelity, while a cross-modality condition module enhances consistency between modalities.

Cross-modality condition module. We perform bidirectional conditioning between LiDAR and camera modalities on latent representations, guided by depth priors and spatial transformations.

Qualitative Visualization

Editing results on regular vehicles. MultiEditor achieves better appearance and geometric fidelity than the baseline.

Editing results on atypical vehicles.

Downstream Task Benefits

Multimodal data generation for downstream tasks. We insert van-class vehicles at varying poses and distances into image and point cloud modalities. This enhances the diversity of training data and improves the performance of 2D and 3D detectors in recognizing van-class objects.

Detection performance on van-class objects using 2D and 3D detection models trained with real and augmented (real + generated) data.

BibTeX

@misc{lu2025multieditorcontrollablemultimodalobject, title={MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors}, author={Shouyi Lu and Zihan Lin and Chao Lu and Huanran Wang and Guirong Zhuo and Lianqing Zheng}, year={2025}, eprint={2507.21872}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2507.21872} }