2025-01-23
arXiv

One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

Tao Liu , Kai Wang , Senmao Li , Joost van de Weijer , Fahad Shahbaz Khan
This paper introduces a training-free method, One-Prompt-One-Story, for consistent text-to-image generation that maintains character identity using a single prompt. The method concatenates all prompts into one input and refines the process with Singular-Value Reweighting and Identity-Preserving Cross-Attention. Experiments show its effectiveness compared to existing approaches.
Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.
2025-01-22
arXiv

Robust Representation Consistency Model via Contrastive Denoising

Jiachen Lei , Julius Berner , Jiongxiao Wang , Zhongzhu Chen , Zhongjia Ba
The paper introduces a new method for robust representation consistency via contrastive denoising, which improves the robustness of deep neural networks against adversarial perturbations and reduces computational overhead during inference. The method reformulates the generative modeling task as a discriminative task in the latent space, enabling implicit denoising-then-classification with a single prediction, and achieves state-of-the-art performance on various datasets.
Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples before making predictions with a standard classifier. While these methods excel at small perturbation radii, they struggle with larger perturbations and incur a significant computational overhead during inference compared to classical methods. To address this, we reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space. Specifically, we use instance discrimination to achieve consistent representations along the trajectories by aligning temporally adjacent points. After fine-tuning based on the learned representations, our model enables implicit denoising-then-classification via a single prediction, substantially reducing inference costs. We conduct extensive experiments on various datasets and achieve state-of-the-art performance with minimal computation budget during inference. For example, our method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by 5.3% on average, with up to 11.6% at larger radii, while reducing inference costs by 85$\times$ on average. Codes are available at: https://github.com/jiachenlei/rRCM.
2025-01-22
arXiv

Accelerate High-Quality Diffusion Models with Inner Loop Feedback

Matthew Gwilliam , Han Cai , Di Wu , Abhinav Shrivastava , Zhiyu Cheng
The paper introduces Inner Loop Feedback (ILF), a method to speed up diffusion models' inference by predicting future features using a lightweight module. This approach reduces runtime while maintaining high-quality results, and it is effective for both class-to-image and text-to-image generation. The performance of ILF is validated through various metrics including FID, CLIP score, and qualitative comparisons.
We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models' inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the student, our model freezes the backbone, training only the feedback module. While many efforts to optimize diffusion models focus on achieving acceptable image quality in extremely few steps (1-4 steps), our emphasis is on matching best case results (typically achieved in 20 steps) while significantly reducing runtime. ILF achieves this balance effectively, demonstrating strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The quality of ILF's 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP Image Quality Assessment, ImageReward, and qualitative comparisons. Project information is available at https://mgwillia.github.io/ilf.
2025-01-21
arXiv

GPS as a Control Signal for Image Generation

Chao Feng , Ziyang Chen , Aleksander Holynski , Alexei A. Efros , Andrew Owens
The paper demonstrates that GPS tags in photo metadata can serve as a control signal for image generation, allowing models to generate images that reflect the unique characteristics of specific locations. The model, trained on both GPS and text, captures the distinct appearance of different areas within a city. Additionally, GPS conditioning enhances the accuracy of 3D structure reconstruction.
We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.
2025-01-21
arXiv

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Hao Zhang , Zibo Zhao , Zeqiang Lai , Qingxiang Lin , Yunfei Zhao
Hunyuan3D 2.0 is an advanced system for generating high-resolution textured 3D assets, consisting of a shape generation model and a texture synthesis model. It outperforms previous state-of-the-art models in geometry details, condition alignment, and texture quality. The system also includes a user-friendly production platform, Hunyuan3D-Studio, for efficient 3D asset creation.
We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2
2025-01-21
arXiv

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Daniel Garibi , Shahar Yadin , Roni Paiss , Omer Tov , Shiran Zada
TokenVerse is a method for multi-concept personalization using a pre-trained text-to-image diffusion model, capable of disentangling and combining complex visual elements from multiple images. It leverages the semantic modulation space to enable localized control over various concepts, including objects, accessories, materials, pose, and lighting. The effectiveness of TokenVerse is demonstrated in challenging personalization settings, outperforming existing methods.
We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/