Latest Research Papers
2025-01-23
arXiv
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
This paper introduces a training-free method, One-Prompt-One-Story, for consistent text-to-image generation that maintains character identity using a single prompt. The method concatenates all prompts into one input and refines the process with Singular-Value Reweighting and Identity-Preserving Cross-Attention. Experiments show its effectiveness compared to existing approaches.
Text-to-image generation models can create high-quality images from input
prompts. However, they struggle to support the consistent generation of
identity-preserving requirements for storytelling. Existing approaches to this
problem typically require extensive training in large datasets or additional
modifications to the original model architectures. This limits their
applicability across different domains and diverse diffusion model
configurations. In this paper, we first observe the inherent capability of
language models, coined context consistency, to comprehend identity through
context with a single prompt. Drawing inspiration from the inherent context
consistency, we propose a novel training-free method for consistent
text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story).
Our approach 1Prompt1Story concatenates all prompts into a single input for T2I
diffusion models, initially preserving character identities. We then refine the
generation process using two novel techniques: Singular-Value Reweighting and
Identity-Preserving Cross-Attention, ensuring better alignment with the input
description for each frame. In our experiments, we compare our method against
various existing consistent T2I generation approaches to demonstrate its
effectiveness through quantitative metrics and qualitative assessments. Code is
available at https://github.com/byliutao/1Prompt1Story.
2025-01-22
arXiv
Robust Representation Consistency Model via Contrastive Denoising
The paper introduces a new method for robust representation consistency via contrastive denoising, which improves the robustness of deep neural networks against adversarial perturbations and reduces computational overhead during inference. The method reformulates the generative modeling task as a discriminative task in the latent space, enabling implicit denoising-then-classification with a single prediction, and achieves state-of-the-art performance on various datasets.
Robustness is essential for deep neural networks, especially in
security-sensitive applications. To this end, randomized smoothing provides
theoretical guarantees for certifying robustness against adversarial
perturbations. Recently, diffusion models have been successfully employed for
randomized smoothing to purify noise-perturbed samples before making
predictions with a standard classifier. While these methods excel at small
perturbation radii, they struggle with larger perturbations and incur a
significant computational overhead during inference compared to classical
methods. To address this, we reformulate the generative modeling task along the
diffusion trajectories in pixel space as a discriminative task in the latent
space. Specifically, we use instance discrimination to achieve consistent
representations along the trajectories by aligning temporally adjacent points.
After fine-tuning based on the learned representations, our model enables
implicit denoising-then-classification via a single prediction, substantially
reducing inference costs. We conduct extensive experiments on various datasets
and achieve state-of-the-art performance with minimal computation budget during
inference. For example, our method outperforms the certified accuracy of
diffusion-based methods on ImageNet across all perturbation radii by 5.3% on
average, with up to 11.6% at larger radii, while reducing inference costs by
85$\times$ on average. Codes are available at:
https://github.com/jiachenlei/rRCM.
2025-01-22
arXiv
Accelerate High-Quality Diffusion Models with Inner Loop Feedback
The paper introduces Inner Loop Feedback (ILF), a method to speed up diffusion models' inference by predicting future features using a lightweight module. This approach reduces runtime while maintaining high-quality results, and it is effective for both class-to-image and text-to-image generation. The performance of ILF is validated through various metrics including FID, CLIP score, and qualitative comparisons.
We propose Inner Loop Feedback (ILF), a novel approach to accelerate
diffusion models' inference. ILF trains a lightweight module to predict future
features in the denoising process by leveraging the outputs from a chosen
diffusion backbone block at a given time step. This approach exploits two key
intuitions; (1) the outputs of a given block at adjacent time steps are
similar, and (2) performing partial computations for a step imposes a lower
burden on the model than skipping the step entirely. Our method is highly
flexible, since we find that the feedback module itself can simply be a block
from the diffusion backbone, with all settings copied. Its influence on the
diffusion forward can be tempered with a learnable scaling factor from zero
initialization. We train this module using distillation losses; however, unlike
some prior work where a full diffusion backbone serves as the student, our
model freezes the backbone, training only the feedback module. While many
efforts to optimize diffusion models focus on achieving acceptable image
quality in extremely few steps (1-4 steps), our emphasis is on matching best
case results (typically achieved in 20 steps) while significantly reducing
runtime. ILF achieves this balance effectively, demonstrating strong
performance for both class-to-image generation with diffusion transformer (DiT)
and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The
quality of ILF's 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP
Image Quality Assessment, ImageReward, and qualitative comparisons. Project
information is available at https://mgwillia.github.io/ilf.
2025-01-21
arXiv
GPS as a Control Signal for Image Generation
The paper demonstrates that GPS tags in photo metadata can serve as a control signal for image generation, allowing models to generate images that reflect the unique characteristics of specific locations. The model, trained on both GPS and text, captures the distinct appearance of different areas within a city. Additionally, GPS conditioning enhances the accuracy of 3D structure reconstruction.
We show that the GPS tags contained in photo metadata provide a useful
control signal for image generation. We train GPS-to-image models and use them
for tasks that require a fine-grained understanding of how images vary within a
city. In particular, we train a diffusion model to generate images conditioned
on both GPS and text. The learned model generates images that capture the
distinctive appearance of different neighborhoods, parks, and landmarks. We
also extract 3D models from 2D GPS-to-image models through score distillation
sampling, using GPS conditioning to constrain the appearance of the
reconstruction from each viewpoint. Our evaluations suggest that our
GPS-conditioned models successfully learn to generate images that vary based on
location, and that GPS conditioning improves estimated 3D structure.
2025-01-21
arXiv
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Hunyuan3D 2.0 is an advanced system for generating high-resolution textured 3D assets, consisting of a shape generation model and a texture synthesis model. It outperforms previous state-of-the-art models in geometry details, condition alignment, and texture quality. The system also includes a user-friendly production platform, Hunyuan3D-Studio, for efficient 3D asset creation.
We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for
generating high-resolution textured 3D assets. This system includes two
foundation components: a large-scale shape generation model -- Hunyuan3D-DiT,
and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape
generative model, built on a scalable flow-based diffusion transformer, aims to
create geometry that properly aligns with a given condition image, laying a
solid foundation for downstream applications. The texture synthesis model,
benefiting from strong geometric and diffusion priors, produces high-resolution
and vibrant texture maps for either generated or hand-crafted meshes.
Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production
platform that simplifies the re-creation process of 3D assets. It allows both
professional and amateur users to manipulate or even animate their meshes
efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0
outperforms previous state-of-the-art models, including the open-source models
and closed-source models in geometry details, condition alignment, texture
quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps
in the open-source 3D community for large-scale foundation generative models.
The code and pre-trained weights of our models are available at:
https://github.com/Tencent/Hunyuan3D-2
2025-01-21
arXiv
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
TokenVerse is a method for multi-concept personalization using a pre-trained text-to-image diffusion model, capable of disentangling and combining complex visual elements from multiple images. It leverages the semantic modulation space to enable localized control over various concepts, including objects, accessories, materials, pose, and lighting. The effectiveness of TokenVerse is demonstrated in challenging personalization settings, outperforming existing methods.
We present TokenVerse -- a method for multi-concept personalization,
leveraging a pre-trained text-to-image diffusion model. Our framework can
disentangle complex visual elements and attributes from as little as a single
image, while enabling seamless plug-and-play generation of combinations of
concepts extracted from multiple images. As opposed to existing works,
TokenVerse can handle multiple images with multiple concepts each, and supports
a wide-range of concepts, including objects, accessories, materials, pose, and
lighting. Our work exploits a DiT-based text-to-image model, in which the input
text affects the generation through both attention and modulation (shift and
scale). We observe that the modulation space is semantic and enables localized
control over complex concepts. Building on this insight, we devise an
optimization-based framework that takes as input an image and a text
description, and finds for each word a distinct direction in the modulation
space. These directions can then be used to generate new images that combine
the learned concepts in a desired configuration. We demonstrate the
effectiveness of TokenVerse in challenging personalization settings, and
showcase its advantages over existing methods. project's webpage in
https://token-verse.github.io/