2025-01-22
arXiv

Accelerate High-Quality Diffusion Models with Inner Loop Feedback

Matthew Gwilliam , Han Cai , Di Wu , Abhinav Shrivastava , Zhiyu Cheng
The paper introduces Inner Loop Feedback (ILF), a method to speed up diffusion models' inference by predicting future features using a lightweight module. This approach reduces runtime while maintaining high-quality results, and it is effective for both class-to-image and text-to-image generation. The performance of ILF is validated through various metrics including FID, CLIP score, and qualitative comparisons.
We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models' inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the student, our model freezes the backbone, training only the feedback module. While many efforts to optimize diffusion models focus on achieving acceptable image quality in extremely few steps (1-4 steps), our emphasis is on matching best case results (typically achieved in 20 steps) while significantly reducing runtime. ILF achieves this balance effectively, demonstrating strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The quality of ILF's 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP Image Quality Assessment, ImageReward, and qualitative comparisons. Project information is available at https://mgwillia.github.io/ilf.
2025-01-21
arXiv

GPS as a Control Signal for Image Generation

Chao Feng , Ziyang Chen , Aleksander Holynski , Alexei A. Efros , Andrew Owens
The paper demonstrates that GPS tags in photo metadata can serve as a control signal for image generation, allowing models to generate images that reflect the unique characteristics of specific locations. The model, trained on both GPS and text, captures the distinct appearance of different areas within a city. Additionally, GPS conditioning enhances the accuracy of 3D structure reconstruction.
We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.