We present latentSplat, a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. Existing methods for generalizable 3D reconstruction either do not enable fast inference of high resolution novel views due to slow volume rendering, or are limited to interpolation of close input views, even in simpler settings with a single central object, where 360-degree generalization is possible. In this work, we combine a regression-based approach with a generative model, moving towards both of these capabilities within the same method, trained purely on readily available real video data. The core of our method are variational 3D Gaussians, a representation that efficiently encodes varying uncertainty within a latent space consisting of 3D feature Gaussians. From these Gaussians, specific instances can be sampled and rendered via efficient Gaussian splatting and a fast, generative decoder network. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.
We present latentSplat, a method for scalable generalizable 3D reconstruction from two views. The architecture follows an autoencoder structure. (Left) Two input reference views are encoded into a 3D variational Gaussian representation using an epipolar transformer and a Gaussian sampling head. (Center) Variational Gaussians allow sampling of spherical harmonics feature coefficients that determine a specific instance of semantic Gaussians. (Right) The sampled instance can be rendered efficiently via Gaussian splatting and a light-weight VAE-GAN decoder.
latentSplat is able to synthesize full 360° novel views for object-centric scenes without obvious geometric inconsistencies.
Videos on RealEstate10k appear realistic without flickering from pixel-level differences in generation between nearby frames. Note that we sample Gaussian features only once independent of the target views resulting in consistent renderings even in case of uncertainty.
Auxiliary renderings (4th column) suffer from blur in regions of high uncertainty as well as artifacts like floating Gaussians. PCA visualizations of the intermediate features (5th column) reveal that different parts of the objects are encoded by different latent features, while the background is clearly separated and filled with noise in areas of low density. We illustrate the uncertainty of our variational Gaussians directly by rendering their standard deviation. The resulting images (6th column) show generally higher uncertainty (dark) for the background, which is either completely invisible or only partly visible in the reference views, compared to the main object, for which the model learns a category-level prior. For the main object, the model is less certain about details like edges or the fur of teddybears than about plain uniform surfaces, which explains the advantage of the generative decoder w.r.t. a higher level of detail.
@inproceedings{wewer24latentsplat,
title = {latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction},
author = {Wewer, Christopher and Raj, Kevin and Ilg, Eddy and Schiele, Bernt and Lenssen, Jan Eric},
booktitle = {arXiv},
year = {2024},
}