(1.) Spatial Reasoning Models (SRMs) iteratively solve visual Sudoku (2.) via denoising with individual noise schedules, (3.) in predicted order based on uncertainty. (4.) Over the reasoning process, uncertainty in missing variables decreases.
We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%.
We define reasoning over a set of continuous random variables as sampling, where xiti denotes the variable xi with its own individual noise level ti during the denoising process and ti ≤ ti'. Choosing noise levels ti allows explicit control over amount and order of sequentialization.
The SRM framework allows to define different amounts of sequentialization, i.e. parallel generation, autoregressive generation, or mixtures with varying overlap, modeled by differing levels of noise on individual variables at the same time.
Further, it provides different options to define the order of sequentialization, allowing random ordering, a greedy heuristic based on predicted uncertainty, and manually-defined graphs.
We train a denoising model with individual noise levels ti for spatial variables like image patches. To ensure that the model sees all combinations of noise levels that are important for sampling, we propose a novel two-stage noise level sampling during training that guarantees the mean t̄ over all variables to be distributed uniformly.
We visualize the sampling process with the current sample xt, noise level t, estimated uncertainty (darker = lower) σθ (xt), and the single step to t=0 result x̂0. SRMs are able to reason over spatial variables by capturing complex dependencies.
Given an incomplete observation, SRMs sample different correct solutions in the case of uncertainty.
Besides the MNIST Sudoku dataset, we provide further benchmarks for visual reasoning in the form of the Even Pixels and Counting Polygons FFHQ datasets. Check out our paper for more details and our code for training your own SRMs for these tasks!
@inproceedings{wewer25srm,
title = {Spatial Reasoning with Denoising Models},
author = {Wewer, Christopher and Pogodzinski, Bartlomiej and Schiele, Bernt and Lenssen, Jan Eric},
booktitle = {arXiv},
year = {2025},
}