MLP Splatting Object-Centric Neural Fields

  1. Shinjeong Kim*
  2. Yuzhou Cheng*
  3. Xin Kong*
  4. Paul H. J. Kelly
  5. Andrew J. Davison
  1. Imperial College London
  2. * Denotes Equal Contribution

arXiv

Teaser

TL;DR: Primitives composed of local light-fields, each centered on and weighted by an anisotropic Gaussian, learn to decompose the scene at the object or part level.

Decomposed objects

MLP-Splatting efficiently reconstructs even the most complex objects with just a few hundred primitives: matching Feature-3DGS in scene quality (+0.6 dB PSNR) while using 1/63 the primitives, 1/6 the memory, and rendering faster.

51 MLPs 483 MLPs 97 MLPs 44 MLPs 46 MLPs 143 MLPs

Method figures

MLP-Splatting offers the compactness of a neural scene representation while keeping each primitive’s influence independent, making downstream tasks such as scene editing straightforward. E.g. with added semantic guidance on top of the proposed method, language-guided editing becomes possible (see examples).

3-axis local primitive frame
MLP-Splatting teaser figure

Abstract

3D representations are fundamental to scene rendering, understanding, and interaction. Recent approaches, such as 3D Gaussian Splatting and Neural Radiance Fields, achieve impressive photorealistic novel-view synthesis, but lack the ability to easily decompose scene elements into a few primitives, requiring additional segmentation or grouping for object-level manipulation. We present MLP-Splatting, a method that enables scene decomposition via a few expressive light-field primitives while providing photorealistic novel-view synthesis.

MLP-Splatting models each primitive as an independent compact MLP with localized spatial support that predicts radiance and opacity. In contrast to low-level Gaussian primitives or a single global radiance field, our neural primitives provide greater expressive capacity while remaining spatially localized. Rendering is performed through efficient sparse volumetric compositing over ray–primitive interactions.

Our primitives are supervised using RGB supervision alone, which yields primitives that represent local scene regions often corresponding to objects or object parts, enabling interactive object-level editing without segmentation masks by selecting a handful of primitives. Our method, augmented with optional semantic feature distillation, enables open-vocabulary scene interaction and open-set instant segmentation. Compared to state-of-the-art methods, we achieve substantially lower memory usage ($1/15\times$) and faster rendering ($3\times$), as we show in our experiments compared to semantic 3DGS methods.

Interactive Demo

Below is an interactive demo of the easy object selection, editing, and language-guided editing enabled by our object-level primitives.

Language-guided segmentation and deletion.

Results

Table 1. Novel-view synthesis evaluation on Replica [Straub et al. 2019] and ScanNet [Dai et al. 2017] datasets.
Table 1. Novel-view synthesis evaluation on Replica and ScanNet datasets. Replica — PSNR: Feature-3DGS 36.18, Ours 36.25; SSIM: 0.964, 0.971; LPIPS: 0.079, 0.090. ScanNet — PSNR: 23.32, 25.35; SSIM: 0.817, 0.830; LPIPS: 0.362, 0.403.
3D semantic segmentation with LSeg-guided embeddings rendered on novel views. Four columns — Ground Truth, Rendered Image, Rendered Feature, Segmentation — comparing Feature-3DGS and Ours against ground truth. Top two rows: Replica dataset; bottom two rows: ScanNet dataset.
Figure 4. 3D semantic segmentation with LSeg-guided embeddings rendered on novel views. The top two rows are from the Replica dataset [Straub et al. 2019], and the bottom two rows are from the ScanNet [Dai et al. 2017] dataset.

BibTeX

@article{kim2026mlpsplatting,
  title   = {MLP Splatting: Object-Centric Neural Fields},
  author  = {Kim, Shinjeong and Cheng, Yuzhou and Kong, Xin and Kelly, Paul H. J. and Davison, Andrew J.},
  journal = {arXiv preprint arXiv:2606.03877},
  year    = {2026}
}