Latest Research Papers
2025-01-22
arXiv
Evolution and The Knightian Blindspot of Machine Learning
The paper highlights a critical blind spot in machine learning, specifically its inability to handle Knightian uncertainty, and contrasts this with the robustness of biological evolution. It argues for the importance of addressing this gap to create more robust AI, especially in open-world scenarios.
This paper claims that machine learning (ML) largely overlooks an important
facet of general intelligence: robustness to a qualitatively unknown future in
an open world. Such robustness relates to Knightian uncertainty (KU) in
economics, i.e. uncertainty that cannot be quantified, which is excluded from
consideration in ML's key formalisms. This paper aims to identify this blind
spot, argue its importance, and catalyze research into addressing it, which we
believe is necessary to create truly robust open-world AI. To help illuminate
the blind spot, we contrast one area of ML, reinforcement learning (RL), with
the process of biological evolution. Despite staggering ongoing progress, RL
still struggles in open-world situations, often failing under unforeseen
situations. For example, the idea of zero-shot transferring a self-driving car
policy trained only in the US to the UK currently seems exceedingly ambitious.
In dramatic contrast, biological evolution routinely produces agents that
thrive within an open world, sometimes even to situations that are remarkably
out-of-distribution (e.g. invasive species; or humans, who do undertake such
zero-shot international driving). Interestingly, evolution achieves such
robustness without explicit theory, formalisms, or mathematical gradients. We
explore the assumptions underlying RL's typical formalisms, showing how they
limit RL's engagement with the unknown unknowns characteristic of an
ever-changing complex world. Further, we identify mechanisms through which
evolutionary processes foster robustness to novel and unpredictable challenges,
and discuss potential pathways to algorithmically embody them. The conclusion
is that the intriguing remaining fragility of ML may result from blind spots in
its formalisms, and that significant gains may result from direct confrontation
with the challenge of KU.
2025-01-22
arXiv
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
The paper introduces SRMT, a method that enhances coordination in multi-agent systems by sharing and broadcasting working memories. SRMT outperforms various baselines in partially observable pathfinding tasks, particularly under sparse rewards. The results show that shared recurrent memory can improve cooperation in decentralized multi-agent settings.
Multi-agent reinforcement learning (MARL) demonstrates significant progress
in solving cooperative and competitive multi-agent problems in various
environments. One of the principal challenges in MARL is the need for explicit
prediction of the agents' behavior to achieve cooperation. To resolve this
issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends
memory transformers to multi-agent settings by pooling and globally
broadcasting individual working memories, enabling agents to exchange
information implicitly and coordinate their actions. We evaluate SRMT on the
Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck
navigation task that requires agents to pass through a narrow corridor and on a
POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently
outperforms a variety of reinforcement learning baselines, especially under
sparse rewards, and generalizes effectively to longer corridors than those seen
during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is
competitive with recent MARL, hybrid, and planning-based algorithms. These
results suggest that incorporating shared recurrent memory into the
transformer-based architectures can enhance coordination in decentralized
multi-agent systems. The source code for training and evaluation is available
on GitHub: https://github.com/Aloriosa/srmt.
2025-01-22
arXiv
Robust Representation Consistency Model via Contrastive Denoising
The paper introduces a new method for robust representation consistency via contrastive denoising, which improves the robustness of deep neural networks against adversarial perturbations and reduces computational overhead during inference. The method reformulates the generative modeling task as a discriminative task in the latent space, enabling implicit denoising-then-classification with a single prediction, and achieves state-of-the-art performance on various datasets.
Robustness is essential for deep neural networks, especially in
security-sensitive applications. To this end, randomized smoothing provides
theoretical guarantees for certifying robustness against adversarial
perturbations. Recently, diffusion models have been successfully employed for
randomized smoothing to purify noise-perturbed samples before making
predictions with a standard classifier. While these methods excel at small
perturbation radii, they struggle with larger perturbations and incur a
significant computational overhead during inference compared to classical
methods. To address this, we reformulate the generative modeling task along the
diffusion trajectories in pixel space as a discriminative task in the latent
space. Specifically, we use instance discrimination to achieve consistent
representations along the trajectories by aligning temporally adjacent points.
After fine-tuning based on the learned representations, our model enables
implicit denoising-then-classification via a single prediction, substantially
reducing inference costs. We conduct extensive experiments on various datasets
and achieve state-of-the-art performance with minimal computation budget during
inference. For example, our method outperforms the certified accuracy of
diffusion-based methods on ImageNet across all perturbation radii by 5.3% on
average, with up to 11.6% at larger radii, while reducing inference costs by
85$\times$ on average. Codes are available at:
https://github.com/jiachenlei/rRCM.
2025-01-21
arXiv
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
UI-TARS is a native GUI agent model that performs human-like interactions based on screenshots, outperforming existing frameworks in benchmarks. It incorporates enhanced perception, unified action modeling, system-2 reasoning, and iterative training with reflective online traces. The model continuously learns and adapts with minimal human intervention.
This paper introduces UI-TARS, a native GUI agent model that solely perceives
the screenshots as input and performs human-like interactions (e.g., keyboard
and mouse operations). Unlike prevailing agent frameworks that depend on
heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts
and workflows, UI-TARS is an end-to-end model that outperforms these
sophisticated frameworks. Experiments demonstrate its superior performance:
UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating
perception, grounding, and GUI task execution. Notably, in the OSWorld
benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15
steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld,
UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several
key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of
GUI screenshots for context-aware understanding of UI elements and precise
captioning; (2) Unified Action Modeling, which standardizes actions into a
unified space across platforms and achieves precise grounding and interaction
through large-scale action traces; (3) System-2 Reasoning, which incorporates
deliberate reasoning into multi-step decision making, involving multiple
reasoning patterns such as task decomposition, reflection thinking, milestone
recognition, etc. (4) Iterative Training with Reflective Online Traces, which
addresses the data bottleneck by automatically collecting, filtering, and
reflectively refining new interaction traces on hundreds of virtual machines.
Through iterative training and reflection tuning, UI-TARS continuously learns
from its mistakes and adapts to unforeseen situations with minimal human
intervention. We also analyze the evolution path of GUI agents to guide the
further development of this domain.
2025-01-21
arXiv
Physics of Skill Learning
The paper explores the physics of skill learning in neural networks, introducing three models to understand the sequential learning process and its dynamics. These models provide insights into neural scaling laws, learning dynamics, and the benefits of modularity. The models also inspire practical algorithmic changes that can improve the training efficiency of deep learning models.
We aim to understand physics of skill learning, i.e., how skills are learned
in neural networks during training. We start by observing the Domino effect,
i.e., skills are learned sequentially, and notably, some skills kick off
learning right after others complete learning, similar to the sequential fall
of domino cards. To understand the Domino effect and relevant behaviors of
skill learning, we take physicists' approach of abstraction and simplification.
We propose three models with varying complexities -- the Geometry model, the
Resource model, and the Domino model, trading between reality and simplicity.
The Domino effect can be reproduced in the Geometry model, whose resource
interpretation inspires the Resource model, which can be further simplified to
the Domino model. These models present different levels of abstraction and
simplification; each is useful to study some aspects of skill learning. The
Geometry model provides interesting insights into neural scaling laws and
optimizers; the Resource model sheds light on the learning dynamics of
compositional tasks; the Domino model reveals the benefits of modularity. These
models are not only conceptually interesting -- e.g., we show how Chinchilla
scaling laws can emerge from the Geometry model, but also are useful in
practice by inspiring algorithmic development -- e.g., we show how simple
algorithmic changes, motivated by these toy models, can speed up the training
of deep learning models.
2025-01-21
arXiv
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Hunyuan3D 2.0 is an advanced system for generating high-resolution textured 3D assets, consisting of a shape generation model and a texture synthesis model. It outperforms previous state-of-the-art models in geometry details, condition alignment, and texture quality. The system also includes a user-friendly production platform, Hunyuan3D-Studio, for efficient 3D asset creation.
We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for
generating high-resolution textured 3D assets. This system includes two
foundation components: a large-scale shape generation model -- Hunyuan3D-DiT,
and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape
generative model, built on a scalable flow-based diffusion transformer, aims to
create geometry that properly aligns with a given condition image, laying a
solid foundation for downstream applications. The texture synthesis model,
benefiting from strong geometric and diffusion priors, produces high-resolution
and vibrant texture maps for either generated or hand-crafted meshes.
Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production
platform that simplifies the re-creation process of 3D assets. It allows both
professional and amateur users to manipulate or even animate their meshes
efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0
outperforms previous state-of-the-art models, including the open-source models
and closed-source models in geometry details, condition alignment, texture
quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps
in the open-source 3D community for large-scale foundation generative models.
The code and pre-trained weights of our models are available at:
https://github.com/Tencent/Hunyuan3D-2
2025-01-21
arXiv
Evaluating many-body stabilizer Rényi entropy by sampling reduced Pauli strings: singularities, volume law, and nonlocal magic
A new quantum Monte Carlo method for evaluating the α-stabilizer Rényi entropy (SRE) is introduced, allowing for efficient computation of SRE and its derivatives. The method separates the free energy contribution from the characteristic function, revealing that α-SRE does not always peak at quantum critical points. Volume-law corrections to ground-state magic are also studied, showing stronger diagnostics for criticalities than full-state magic.
We present a novel quantum Monte Carlo scheme for evaluating the
$\alpha$-stabilizer R\'enyi entropy (SRE) with any integer $\alpha\ge 2$. By
interpreting $\alpha$-SRE as a ratio of generalized partition functions, we
prove that it can be simulated by sampling reduced Pauli strings within a
reduced configuration space. This allows for straightforward computation of the
values and derivatives of $\alpha$-SRE using techniques such as
reweight-annealing and thermodynamic integration. Moreover, our approach
separates the free energy contribution in $\alpha$-SRE, thus the contribution
solely from the characteristic function can be studied, which is directly tied
to magic. In our applications to the ground states of 1D and 2D transverse
field Ising (TFI) model, we reveal that the behavior of $2$-SRE is governed by
the interplay between the characteristic function and the free energy
contributions, with singularities hidden in both of their derivatives at
quantum critical points. This indicates that $\alpha$-SRE does not necessarily
exhibit a peak at the quantum critical point for a general many-body system. We
also study the volume-law corrections to the ground-state magic. These
corrections slightly violate the strict volume law and suggest discontinuity at
quantum critical points, which we attribute to the abrupt change of the
ground-state magical structure. Our findings suggest that volume-law
corrections of magic are stronger diagnostics for criticalities than the
full-state magic. Lastly, we study the finite-temperature phase transition of
the 2D TFI model, where the $2$-SRE is not a well-defined magic measure. The
nonphysical results we obtain also prove the ineffectiveness of $2$-SRE for
mixed states. Our method enables scalable and efficient evaluation of
$\alpha$-SRE in large-scale quantum systems, providing a powerful tool for
exploring the roles of magic in many-body systems.
2025-01-21
arXiv
GPS as a Control Signal for Image Generation
The paper demonstrates that GPS tags in photo metadata can serve as a control signal for image generation, allowing models to generate images that reflect the unique characteristics of specific locations. The model, trained on both GPS and text, captures the distinct appearance of different areas within a city. Additionally, GPS conditioning enhances the accuracy of 3D structure reconstruction.
We show that the GPS tags contained in photo metadata provide a useful
control signal for image generation. We train GPS-to-image models and use them
for tasks that require a fine-grained understanding of how images vary within a
city. In particular, we train a diffusion model to generate images conditioned
on both GPS and text. The learned model generates images that capture the
distinctive appearance of different neighborhoods, parks, and landmarks. We
also extract 3D models from 2D GPS-to-image models through score distillation
sampling, using GPS conditioning to constrain the appearance of the
reconstruction from each viewpoint. Our evaluations suggest that our
GPS-conditioned models successfully learn to generate images that vary based on
location, and that GPS conditioning improves estimated 3D structure.
2025-01-21
arXiv
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
TokenVerse is a method for multi-concept personalization using a pre-trained text-to-image diffusion model, capable of disentangling and combining complex visual elements from multiple images. It leverages the semantic modulation space to enable localized control over various concepts, including objects, accessories, materials, pose, and lighting. The effectiveness of TokenVerse is demonstrated in challenging personalization settings, outperforming existing methods.
We present TokenVerse -- a method for multi-concept personalization,
leveraging a pre-trained text-to-image diffusion model. Our framework can
disentangle complex visual elements and attributes from as little as a single
image, while enabling seamless plug-and-play generation of combinations of
concepts extracted from multiple images. As opposed to existing works,
TokenVerse can handle multiple images with multiple concepts each, and supports
a wide-range of concepts, including objects, accessories, materials, pose, and
lighting. Our work exploits a DiT-based text-to-image model, in which the input
text affects the generation through both attention and modulation (shift and
scale). We observe that the modulation space is semantic and enables localized
control over complex concepts. Building on this insight, we devise an
optimization-based framework that takes as input an image and a text
description, and finds for each word a distinct direction in the modulation
space. These directions can then be used to generate new images that combine
the learned concepts in a desired configuration. We demonstrate the
effectiveness of TokenVerse in challenging personalization settings, and
showcase its advantages over existing methods. project's webpage in
https://token-verse.github.io/
2025-01-21
arXiv
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
The paper introduces InternVideo2.5, which enhances video MLLMs by incorporating long and rich context (LRC) modeling, improving their ability to handle fine-grained details and long-form temporal structures. This approach significantly boosts performance on video understanding benchmarks and enables the model to process much longer video inputs. The work underscores the importance of multimodal context richness for advancing MLLM capabilities.
This paper aims to improve the performance of video multimodal large language
models (MLLM) via long and rich context (LRC) modeling. As a result, we develop
a new version of InternVideo2.5 with a focus on enhancing the original MLLMs'
ability to perceive fine-grained details and capture long-form temporal
structure in videos. Specifically, our approach incorporates dense vision task
annotations into MLLMs using direct preference optimization and develops
compact spatiotemporal representations through adaptive hierarchical token
compression. Experimental results demonstrate this unique design of LRC greatly
improves the results of video MLLM in mainstream video understanding benchmarks
(short & long), enabling the MLLM to memorize significantly longer video inputs
(at least 6x longer than the original), and master specialized vision
capabilities like object tracking and segmentation. Our work highlights the
importance of multimodal context richness (length and fineness) in empowering
MLLM's innate abilites (focus and memory), providing new insights for future
research on video MLLM. Code and models are available at
https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5