Segment Any 4D Gaussians

Abstract

Modeling, understanding, and reconstructing the real world are crucial in XR/VR. Recently, 3D Gaussian Splatting (3D-GS) methods have shown remarkable success in modeling and understanding 3D scenes. Similarly, various 4D representations have demonstrated the ability to capture the dynamics of the 4D world. However, there is a dearth of research focusing on segmentation within 4D representations. In this paper, we propose to Segment Any 4D Gaussians (SA4D), the first framework to segment anything in the 4D digital world based on 4D Gaussians. In SA4D, an efficient temporal identity feature field is introduced to handle Gaussian drifting, with the potential to learn precise identity feature from noisy and sparse input. Additionally, a 4D segmentation refinement process is proposed to remove artifacts. Our SA4D achieves precise, high-quality segmentation within seconds in 4D Gaussians and shows the ability of removal, recoloring, composition and rendering high quality novel anything masks.

Given a pre-trained 4D Gaussians, SA4D can decompose the Gaussian at object level and supports removal/recoloring, object composition, and rendering anything masks.

Method

Overview of our training pipeline. Given a timestamp $t$ and canonical 3D Gaussians $\mathcal{G}$, the ID encoding $e$ and deformed 3D Gaussians $\mathcal{G}^{'}$ will be predicted by an optimizable temporal identity field network $\phi_{\theta}$ and frozen deformation field network $\mathcal{F}$, respectively. Then the ID encoding $e$ are splatted to $E$, then a convolutional classifier $\phi_c$ is used to predict each pixel's ID $f$ and the whole training pipeline is supervised by $I_{seg}$ predicted by video tracker with $\mathcal{L}_{loss}$.