Spatial Reasoning with Denoising Models

Abstract

We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%.

Motivation

Method

General Framework

We define reasoning over a set of continuous random variables as sampling, where x_i^t_i denotes the variable x_i with its own individual noise level t_i during the denoising process and t_i ≤ t_i'. Choosing noise levels t_i allows explicit control over amount and order of sequentialization.

Amount of Sequentialization

The SRM framework allows to define different amounts of sequentialization, i.e. parallel generation, autoregressive generation, or mixtures with varying overlap, modeled by differing levels of noise on individual variables at the same time.

Order of Sequentialization

Further, it provides different options to define the order of sequentialization, allowing random ordering, a greedy heuristic based on predicted uncertainty, and manually-defined graphs.

Noise Levels During Training

We train a denoising model with individual noise levels t_i for spatial variables like image patches. To ensure that the model sees all combinations of noise levels that are important for sampling, we propose a novel two-stage noise level sampling during training that guarantees the mean t̄ over all variables to be distributed uniformly.

Evaluation

Acknowledgments

This project was partially funded by the Saarland/Intel Joint Program on the Future of Graphics and Media. We thank Thomas Wimmer for proofreading and helpful discussions.

BibTeX

@inproceedings{wewer25srm,
      title     = {Spatial Reasoning with Denoising Models},
      author    = {Wewer, Christopher and Pogodzinski, Bartlomiej and Schiele, Bernt and Lenssen, Jan Eric},
      booktitle = {International Conference on Machine Learning ({ICML})},
      year      = {2025},
}

SRM: Spatial Reasoning with Denoising Models