Latest Research Papers from arXiv

2025-01-22

arXiv

Evolution and The Knightian Blindspot of Machine Learning

Joel Lehman , Elliot Meyerson , Tarek El-Gaaly , Kenneth O. Stanley , Tarin Ziyaee

The paper highlights a critical blind spot in machine learning, specifically its inability to handle Knightian uncertainty, and contrasts this with the robustness of biological evolution. It argues for the importance of addressing this gap to create more robust AI, especially in open-world scenarios.

This paper claims that machine learning (ML) largely overlooks an important facet of general intelligence: robustness to a qualitatively unknown future in an open world. Such robustness relates to Knightian uncertainty (KU) in economics, i.e. uncertainty that cannot be quantified, which is excluded from consideration in ML's key formalisms. This paper aims to identify this blind spot, argue its importance, and catalyze research into addressing it, which we believe is necessary to create truly robust open-world AI. To help illuminate the blind spot, we contrast one area of ML, reinforcement learning (RL), with the process of biological evolution. Despite staggering ongoing progress, RL still struggles in open-world situations, often failing under unforeseen situations. For example, the idea of zero-shot transferring a self-driving car policy trained only in the US to the UK currently seems exceedingly ambitious. In dramatic contrast, biological evolution routinely produces agents that thrive within an open world, sometimes even to situations that are remarkably out-of-distribution (e.g. invasive species; or humans, who do undertake such zero-shot international driving). Interestingly, evolution achieves such robustness without explicit theory, formalisms, or mathematical gradients. We explore the assumptions underlying RL's typical formalisms, showing how they limit RL's engagement with the unknown unknowns characteristic of an ever-changing complex world. Further, we identify mechanisms through which evolutionary processes foster robustness to novel and unpredictable challenges, and discuss potential pathways to algorithmically embody them. The conclusion is that the intriguing remaining fragility of ML may result from blind spots in its formalisms, and that significant gains may result from direct confrontation with the challenge of KU.

PDF arXiv

2025-01-22

arXiv

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Alsu Sagirova , Yuri Kuratov , Mikhail Burtsev

The paper introduces SRMT, a method that enhances coordination in multi-agent systems by sharing and broadcasting working memories. SRMT outperforms various baselines in partially observable pathfinding tasks, particularly under sparse rewards. The results show that shared recurrent memory can improve cooperation in decentralized multi-agent settings.

Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.

Decentralized Coordination Multi-agent Reinforcement Learning Pathfinding Shared Recurrent Memory Transformers

PDF arXiv

2025-01-22

arXiv

Robust Representation Consistency Model via Contrastive Denoising

Jiachen Lei , Julius Berner , Jiongxiao Wang , Zhongzhu Chen , Zhongjia Ba

The paper introduces a new method for robust representation consistency via contrastive denoising, which improves the robustness of deep neural networks against adversarial perturbations and reduces computational overhead during inference. The method reformulates the generative modeling task as a discriminative task in the latent space, enabling implicit denoising-then-classification with a single prediction, and achieves state-of-the-art performance on various datasets.

Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples before making predictions with a standard classifier. While these methods excel at small perturbation radii, they struggle with larger perturbations and incur a significant computational overhead during inference compared to classical methods. To address this, we reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space. Specifically, we use instance discrimination to achieve consistent representations along the trajectories by aligning temporally adjacent points. After fine-tuning based on the learned representations, our model enables implicit denoising-then-classification via a single prediction, substantially reducing inference costs. We conduct extensive experiments on various datasets and achieve state-of-the-art performance with minimal computation budget during inference. For example, our method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by 5.3% on average, with up to 11.6% at larger radii, while reducing inference costs by 85$\times$ on average. Codes are available at: https://github.com/jiachenlei/rRCM.

Adversarial Perturbations Contrastive Learning Diffusion Models Efficient Inference Instance Discrimination Randomized Smoothing Robustness in Deep Neural Networks

PDF arXiv

2025-01-21

arXiv

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Xin Liu , Yujia Qin , Yining Ye , Junjie Fang , Haoming Wang

UI-TARS is a native GUI agent model that performs human-like interactions based on screenshots, outperforming existing frameworks in benchmarks. It incorporates enhanced perception, unified action modeling, system-2 reasoning, and iterative training with reflective online traces. The model continuously learns and adapts with minimal human intervention.

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

Benchmark Performance End-to-End Models Enhanced Perception GUI Agents Human-like Interactions Iterative Training Reflective Online Traces System-2 Reasoning Unified Action Modeling

PDF arXiv

2025-01-21

arXiv

Physics of Skill Learning

Ziming Liu , Yizhou Liu , Eric J. Michaud , Jeff Gore , Max Tegmark

The paper explores the physics of skill learning in neural networks, introducing three models to understand the sequential learning process and its dynamics. These models provide insights into neural scaling laws, learning dynamics, and the benefits of modularity. The models also inspire practical algorithmic changes that can improve the training efficiency of deep learning models.

We aim to understand physics of skill learning, i.e., how skills are learned in neural networks during training. We start by observing the Domino effect, i.e., skills are learned sequentially, and notably, some skills kick off learning right after others complete learning, similar to the sequential fall of domino cards. To understand the Domino effect and relevant behaviors of skill learning, we take physicists' approach of abstraction and simplification. We propose three models with varying complexities -- the Geometry model, the Resource model, and the Domino model, trading between reality and simplicity. The Domino effect can be reproduced in the Geometry model, whose resource interpretation inspires the Resource model, which can be further simplified to the Domino model. These models present different levels of abstraction and simplification; each is useful to study some aspects of skill learning. The Geometry model provides interesting insights into neural scaling laws and optimizers; the Resource model sheds light on the learning dynamics of compositional tasks; the Domino model reveals the benefits of modularity. These models are not only conceptually interesting -- e.g., we show how Chinchilla scaling laws can emerge from the Geometry model, but also are useful in practice by inspiring algorithmic development -- e.g., we show how simple algorithmic changes, motivated by these toy models, can speed up the training of deep learning models.

Algorithmic Development Modularity in Neural Networks Neural Networks Neural Scaling Laws Physics of Learning Sequential Learning Skill Learning

PDF arXiv

2025-01-21

arXiv

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Hao Zhang , Zibo Zhao , Zeqiang Lai , Qingxiang Lin , Yunfei Zhao

Hunyuan3D 2.0 is an advanced system for generating high-resolution textured 3D assets, consisting of a shape generation model and a texture synthesis model. It outperforms previous state-of-the-art models in geometry details, condition alignment, and texture quality. The system also includes a user-friendly production platform, Hunyuan3D-Studio, for efficient 3D asset creation.

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2

3D Asset Generation 3D Production Platform Diffusion Models Diffusion Priors Flow-based Diffusion Transformer Geometric Priors Shape Generation Texture Synthesis

PDF arXiv

2025-01-21

arXiv

Evaluating many-body stabilizer Rényi entropy by sampling reduced Pauli strings: singularities, volume law, and nonlocal magic

Yi-Ming Ding , Zhe Wang , Zheng Yan

A new quantum Monte Carlo method for evaluating the α-stabilizer Rényi entropy (SRE) is introduced, allowing for efficient computation of SRE and its derivatives. The method separates the free energy contribution from the characteristic function, revealing that α-SRE does not always peak at quantum critical points. Volume-law corrections to ground-state magic are also studied, showing stronger diagnostics for criticalities than full-state magic.

We present a novel quantum Monte Carlo scheme for evaluating the $\alpha$-stabilizer R\'enyi entropy (SRE) with any integer $\alpha\ge 2$. By interpreting $\alpha$-SRE as a ratio of generalized partition functions, we prove that it can be simulated by sampling reduced Pauli strings within a reduced configuration space. This allows for straightforward computation of the values and derivatives of $\alpha$-SRE using techniques such as reweight-annealing and thermodynamic integration. Moreover, our approach separates the free energy contribution in $\alpha$-SRE, thus the contribution solely from the characteristic function can be studied, which is directly tied to magic. In our applications to the ground states of 1D and 2D transverse field Ising (TFI) model, we reveal that the behavior of $2$-SRE is governed by the interplay between the characteristic function and the free energy contributions, with singularities hidden in both of their derivatives at quantum critical points. This indicates that $\alpha$-SRE does not necessarily exhibit a peak at the quantum critical point for a general many-body system. We also study the volume-law corrections to the ground-state magic. These corrections slightly violate the strict volume law and suggest discontinuity at quantum critical points, which we attribute to the abrupt change of the ground-state magical structure. Our findings suggest that volume-law corrections of magic are stronger diagnostics for criticalities than the full-state magic. Lastly, we study the finite-temperature phase transition of the 2D TFI model, where the $2$-SRE is not a well-defined magic measure. The nonphysical results we obtain also prove the ineffectiveness of $2$-SRE for mixed states. Our method enables scalable and efficient evaluation of $\alpha$-SRE in large-scale quantum systems, providing a powerful tool for exploring the roles of magic in many-body systems.

Ground-State Magic Many-Body Systems Quantum Critical Points Quantum Monte Carlo Stabilizer Rényi Entropy Volume Law

PDF arXiv

2025-01-21

arXiv

GPS as a Control Signal for Image Generation

Chao Feng , Ziyang Chen , Aleksander Holynski , Alexei A. Efros , Andrew Owens

The paper demonstrates that GPS tags in photo metadata can serve as a control signal for image generation, allowing models to generate images that reflect the unique characteristics of specific locations. The model, trained on both GPS and text, captures the distinct appearance of different areas within a city. Additionally, GPS conditioning enhances the accuracy of 3D structure reconstruction.

We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.

3D Reconstruction Diffusion Models GPS Metadata Image Generation Score Distillation Sampling Text Conditioning

PDF arXiv

2025-01-21

arXiv

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Daniel Garibi , Shahar Yadin , Roni Paiss , Omer Tov , Shiran Zada

TokenVerse is a method for multi-concept personalization using a pre-trained text-to-image diffusion model, capable of disentangling and combining complex visual elements from multiple images. It leverages the semantic modulation space to enable localized control over various concepts, including objects, accessories, materials, pose, and lighting. The effectiveness of TokenVerse is demonstrated in challenging personalization settings, outperforming existing methods.

We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/

Diffusion Models Multi-concept Learning Personalization Semantic Modulation Text-to-Image Generation

PDF arXiv

2025-01-21

arXiv

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Yi Wang , Xinhao Li , Ziang Yan , Yinan He , Jiashuo Yu

The paper introduces InternVideo2.5, which enhances video MLLMs by incorporating long and rich context (LRC) modeling, improving their ability to handle fine-grained details and long-form temporal structures. This approach significantly boosts performance on video understanding benchmarks and enables the model to process much longer video inputs. The work underscores the importance of multimodal context richness for advancing MLLM capabilities.

This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5

Adaptive Hierarchical Token Compression Dense Vision Task Annotations Long and Rich Context Modeling Object Tracking and Segmentation Video Multimodal Large Language Models Video Understanding Benchmarks

PDF arXiv