Method Overview. (a) The SceneTok autoencoder encodes view sets into a set of compressed, unstructured scene tokens by chaining a VA-VAE image compressor and a perceiver module. The tokens can be rendered from novel views with a generative decoder based on rectified flows. (b) A latent diffusion transformer can perform scene generation by generating compressed scene tokens. Scene generation can be conditioned on a single or a few images and a set of anchor poses, defining the spatial scene extent.
Qualitative comparisons of SceneTok with baselines including MVSplat, MVSplat360, DepthSplat, and LVSM. Our method achieves superior qualitative results under higher compression, can render smooth trajectories, view-dependent appearance and small dynamics with minimal artifacts.
SceneTok is able to outperform LVSM and the baselines with explicit representation in most metrics on RealEstat10K, DL3DV-140, and zero-shot ACID. Long-LRM achieves superior performance on DL3DV in terms of PSNR, SSIM and LPIPS, but requires a much larger representation size and uses higher resolution images. MVSplat360 and DepthSplat for Dl3DV also uses higher resolution images (thus larger representation size) as inputs and for rendering. For fair comparison, we center-crop and resize their renderings to 256x256.
Token Sweep: Progressively unmasking tokens increases information content and reduces overall variance in the output renderings. The decoder then samples (renders) from a much narrower distribution. (Top-Row) RealEstate10K example, (Bottom-Row) DL3DV example.
Depth Decoding on Frozen SceneTok: We finetune our SceneTok decoder to render depth instead of RGB on the frozen token space. We train using pseudo ground-truth obtained from Video Depth Anything and observe that the latent space inherently encodes geometric cues without any additional supervision. Surprisingly, SceneTok learns to smooth out inconsistencies and reduces flickering that are present in the pseudo-GT. Alongside depth renderings, we also show the uncertainty in depth prediction and RGB renderings.
Compared to some previous works, such as RayZer, SceneTok can generalize to novel trajectories. Here, we render encoded SceneTok tokens from a camera trajectory taken from a different video, showing that SceneTok performs true NVS and can be considered a real scene representation. Rendered views become more uncertain and deteriorate as soon as we move outside of regions observed by the input views, which is expected, as the first stage autoencoder is not trained for extrapolation. For extrapolation, our second stage SceneGen can generate full novel scenes in latent space.
Note: We render the source scene with the transfer trajectory from another scene . The renderings of SceneTok is represented by .
Here, RayZer fails to follow the novel trajectory and renders the original one instead, LVSM follows the novel trajectory to some degree but degrades very quickly. Meanwhile, SceneTok can follow the novel trajectory and produces reasonable output for regions that are in-context.
We use True Pose Similarity (TPS) metric that measures pose accuracy of the renderings and ground-truth images (of the transfer scene) by predicting camera poses for both using VGGT, and measuring accuracy under different error thresholds.
Qualitative comparisons of SceneGen with baselines including DFM, DFoT, and SEVA. We show that generation in the token space is efficient, allowing for faster inference and comparable generation quality. Note that SEVA is trained on a large corpus of closed-source data whereas our method and the other baselines are trained purely on RealEstate10K.
We observe an overall improvement in global structures by shifting the timestep towards higher noise-levels during training and inference. Additionally, in several cases, higher guidance scale can also help improve overall visual quality though not necessarily consistently.
Sampling multiple token sets and rendering from the same view
Note: Same seed is applied for all settings
Our approach exhibits minor artifacts at regular intervals (every 16th frame), which stem from the tile-based video decoding strategy employed by VideoDCAE. Each chunk of consecutive frames is decoded independently, and overlapping regions are blended to ensure smooth transitions. This blending process can occasionally introduce slight blur.
Due to the strong compression applied in our representation, high-frequency details are not always faithfully preserved. This may lead to minor visual inconsistencies and can also impact downstream generation quality, as fine-grained information is partially absent from the latent space.
Addressing these limitations remains an important direction for future improvements to SceneTok.
@inproceedings{asim26scenetok,
title = {SceneTok: A Compressed, Diffusable Token Space for 3D Scenes},
author = {Asim, Mohammad and Wewer, Christopher and Lenssen, Jan Eric},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition ({CVPR})},
year = {2026},
}