SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

CVPR 2026

Mohammad Asim, Christopher Wewer, Jan Eric Lenssen

Max Planck Institute for Informatics, Saarland Informatics Campus

TL;DR:

We present SceneTok, a novel token representation for 3D scenes, which can be obtained from view sets and efficiently rendered with a generative decoder. The representation is 1-3 orders of magnitude more compressive than existing scene representations and allows for latent scene generation in 5 seconds on a single consumer-grade NVidia RTX 4090.

Note: If the video is lagging, stopped or desynchronized, try refreshing the browser.

DL3DV (256 ✕ 256)

LVSM

SceneTok

LVSM

SceneTok

LVSM

SceneTok

LVSM

SceneTok

LVSM

SceneTok

LVSM

SceneTok

RE10K (256 ✕ 256)

LVSM

SceneTok

LVSM

SceneTok

LVSM

SceneTok

LVSM

SceneTok

LVSM

SceneTok

LVSM

SceneTok

Stronger Compression

Higher Quality

Better Transferability

Method

Method Overview. (a) The SceneTok autoencoder encodes view sets into a set of compressed, unstructured scene tokens by chaining a VA-VAE image compressor and a perceiver module. The tokens can be rendered from novel views with a generative decoder based on rectified flows. (b) A latent diffusion transformer can perform scene generation by generating compressed scene tokens. Scene generation can be conditioned on a single or a few images and a set of anchor poses, defining the spatial scene extent.

Unstructured and Compact Representation
1-3 Orders of Magnitude Higher Scene Compression
Diffusable Token Space
Transferable Novel-View Synthesis
Fast Rendering @30FPS
Efficient Scene Generation in 5s

Results on Novel-View Synthesis

Qualitative comparisons of SceneTok with baselines including MVSplat, MVSplat360, DepthSplat, and LVSM. Our method achieves superior qualitative results under higher compression, can render smooth trajectories, view-dependent appearance and small dynamics with minimal artifacts.

DL3DV-140

SceneTok (model variants)

MVSplat360

DepthSplat

LVSM

With VideoDCAE

With WanVAE

Ground Truth

RealEstate10K

MVSplat

MVSplat360

DepthSplat

LVSM

SceneTok

Ground Truth

ACID (Zero-Shot)

MVSplat360

DepthSplat

LVSM

SceneTok

Ground Truth

Quantitative Comparisons On NVS

SceneTok is able to outperform LVSM and the baselines with explicit representation in most metrics on RealEstat10K, DL3DV-140, and zero-shot ACID. Long-LRM achieves superior performance on DL3DV in terms of PSNR, SSIM and LPIPS, but requires a much larger representation size and uses higher resolution images. MVSplat360 and DepthSplat for Dl3DV also uses higher resolution images (thus larger representation size) as inputs and for rendering. For fair comparison, we center-crop and resize their renderings to 256x256.

Analysis on the Token Space

Rendered View

Rendered Uncertainty

Token Sweep: Progressively unmasking tokens increases information content and reduces overall variance in the output renderings. The decoder then samples (renders) from a much narrower distribution. (Top-Row) RealEstate10K example, (Bottom-Row) DL3DV example.

Pseudo-GT Depth
(Video Depth Anything)

Predicted Depth

Depth Uncertainty

Predicted RGB

Depth Decoding on Frozen SceneTok: We finetune our SceneTok decoder to render depth instead of RGB on the frozen token space. We train using pseudo ground-truth obtained from Video Depth Anything and observe that the latent space inherently encodes geometric cues without any additional supervision. Surprisingly, SceneTok learns to smooth out inconsistencies and reduces flickering that are present in the pseudo-GT. Alongside depth renderings, we also show the uncertainty in depth prediction and RGB renderings.

Transferability on Novel Trajectories

Compared to some previous works, such as RayZer, SceneTok can generalize to novel trajectories. Here, we render encoded SceneTok tokens from a camera trajectory taken from a different video, showing that SceneTok performs true NVS and can be considered a real scene representation. Rendered views become more uncertain and deteriorate as soon as we move outside of regions observed by the input views, which is expected, as the first stage autoencoder is not trained for extrapolation. For extrapolation, our second stage SceneGen can generate full novel scenes in latent space.

Note: We render the source scene with the transfer trajectory from another scene . The renderings of SceneTok is represented by .

Source

Transfer

RayZer

LVSM

SceneTok

Here, RayZer fails to follow the novel trajectory and renders the original one instead, LVSM follows the novel trajectory to some degree but degrades very quickly. Meanwhile, SceneTok can follow the novel trajectory and produces reasonable output for regions that are in-context.

Quantitative Comparisons On Transferability

We use True Pose Similarity (TPS) metric that measures pose accuracy of the renderings and ground-truth images (of the transfer scene) by predicting camera poses for both using VGGT, and measuring accuracy under different error thresholds.

Results on Single-View Generation

Qualitative comparisons of SceneGen with baselines including DFM, DFoT, and SEVA. We show that generation in the token space is efficient, allowing for faster inference and comparable generation quality. Note that SEVA is trained on a large corpus of closed-source data whereas our method and the other baselines are trained purely on RealEstate10K.

RealEstate10K

SceneGen (model variants on the same latent space)

CFG Scale

Input

Ground Truth

DFoT

SEVA

DFM

Shift: 1

Shift: 4

Shift: 12

3.0

5.0

7.0

We observe an overall improvement in global structures by shifting the timestep towards higher noise-levels during training and inference. Additionally, in several cases, higher guidance scale can also help improve overall visual quality though not necessarily consistently.

Sampling Diversity

Sampling multiple token sets and rendering from the same view

Note: Same seed is applied for all settings

Conditioning Input

Timestep Shift: 1

Timestep Shift: 4

Timestep Shift: 12

Limitations

Our approach exhibits minor artifacts at regular intervals (every 16^th frame), which stem from the tile-based video decoding strategy employed by VideoDCAE. Each chunk of consecutive frames is decoded independently, and overlapping regions are blended to ensure smooth transitions. This blending process can occasionally introduce slight blur.
Due to the strong compression applied in our representation, high-frequency details are not always faithfully preserved. This may lead to minor visual inconsistencies and can also impact downstream generation quality, as fine-grained information is partially absent from the latent space.

Addressing these limitations remains an important direction for future improvements to SceneTok.

Acknowledgments

This project was partially funded by the Saarland/Intel Joint Program on the Future of Graphics and Media. Jan Eric Lenssen is supported by the German Research Foundation (DFG) - 556415750 (Emmy Noether Programme, project: Spatial Modeling and Reasoning). We acknowledge project support from the Max Planck Computing and Data Facility.

@inproceedings{asim26scenetok,
    title = {SceneTok: A Compressed, Diffusable Token Space for 3D Scenes},
    author = {Asim, Mohammad and Wewer, Christopher and Lenssen, Jan Eric},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition ({CVPR})},
    year = {2026},
}