Latest Research Papers

2025-01-21

arXiv

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Daniel Garibi , Shahar Yadin , Roni Paiss , Omer Tov , Shiran Zada

TokenVerse is a method for multi-concept personalization using a pre-trained text-to-image diffusion model, capable of disentangling and combining complex visual elements from multiple images. It leverages the semantic modulation space to enable localized control over various concepts, including objects, accessories, materials, pose, and lighting. The effectiveness of TokenVerse is demonstrated in challenging personalization settings, outperforming existing methods.

We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/

PDF arXiv

2025-01-20

arXiv

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

Siyu Yuan , Zehui Chen , Zhiheng Xi , Junjie Ye , Zhengyin Du

This paper introduces Agent-R, an iterative self-training framework that enables language model agents to reflect and recover from errors in real-time. It uses MCTS to construct training data from erroneous trajectories and a model-guided critique mechanism for timely revision. Experiments show that Agent-R improves error recovery and outperforms baseline methods.

Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).

Agent Reflection Error Recovery Interactive Environments Iterative Self-Training Large Language Models MCTS (Monte Carlo Tree Search)

PDF arXiv

2025-01-17

arXiv

Evolving Deeper LLM Thinking

Dale Schuurmans (Google Research) , Kuang-Huei Lee , Ian Fischer , Yueh-Hua Wu , Dave Marwood

The paper introduces Mind Evolution, an evolutionary search strategy for scaling inference in Large Language Models, which outperforms other methods like Best-of-N and Sequential Revision in natural language planning tasks without the need for a formal solver.

We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.

Evolutionary Search Strategy Inference Scaling Large Language Models Mind Evolution Natural Language Planning

PDF arXiv

2025-01-16

arXiv

Foundations of Large Language Models

Tong Xiao , Jingbo Zhu

The book focuses on foundational concepts of large language models, covering pre-training, generative models, prompting techniques, and alignment methods. It is designed for college students, professionals, and practitioners in NLP and related fields.

This is a book about large language models. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into four main chapters, each exploring a key area: pre-training, generative models, prompting techniques, and alignment methods. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in large language models.

Alignment Methods Generative Models Large Language Models Natural Language Processing Pre-training Prompting Techniques

PDF arXiv

2025-01-08

arXiv

FinSphere: A Conversational Stock Analysis Agent Equipped with Quantitative Tools based on Real-Time Database

Shijie Han , Changhai Zhou , Yiqing Shen , Tianning Sun , Yuhua Zhou

This paper introduces FinSphere, a conversational stock analysis agent, along with a curated dataset and an evaluation framework to improve the quality of stock analysis. The system demonstrates superior performance compared to existing LLMs and agent-based systems.

Current financial Large Language Models (LLMs) struggle with two critical limitations: a lack of depth in stock analysis, which impedes their ability to generate professional-grade insights, and the absence of objective evaluation metrics to assess the quality of stock analysis reports. To address these challenges, this paper introduces FinSphere, a conversational stock analysis agent, along with three major contributions: (1) Stocksis, a dataset curated by industry experts to enhance LLMs' stock analysis capabilities, (2) AnalyScore, a systematic evaluation framework for assessing stock analysis quality, and (3) FinSphere, an AI agent that can generate high-quality stock analysis reports in response to user queries. Experiments demonstrate that FinSphere achieves superior performance compared to both general and domain-specific LLMs, as well as existing agent-based systems, even when they are enhanced with real-time data access and few-shot guidance. The integrated framework, which combines real-time data feeds, quantitative tools, and an instruction-tuned LLM, yields substantial improvements in both analytical quality and practical applicability for real-world stock analysis.

Conversational Agents Evaluation Metrics Financial Large Language Models Quantitative Tools Real-Time Data Stock Analysis

PDF arXiv

2025-01-02

arXiv

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye , Lequn Chen , Ruihang Lai , Wuwei Lin , Yineng Zhang

FlashInfer is a customizable and efficient attention engine for LLM serving, which optimizes memory access and reduces redundancy using block-sparse format and composable formats. It supports Just-In-Time (JIT) compilation for flexibility and includes a load-balanced scheduling algorithm. FlashInfer significantly improves performance, reducing inter-token-latency by 29-69% and latency for long-context inference by 28-30%.

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

Attention Mechanisms Block-Sparse Format GPU Optimization Inference Serving Just-In-Time Compilation KV-cache Storage Large Language Models Load-Balanced Scheduling

PDF arXiv

2024-12-30

arXiv

Distributed Mixture-of-Agents for Edge Inference with Large Language Models

Purbesh Mitra , Priyanka Kaswan , Sennur Ulukus

This paper explores a distributed Mixture-of-Agents (MoA) architecture for edge inference with large language models, using decentralized gossip algorithms to enable collaboration among edge devices. It ensures queuing stability and demonstrates that certain MoA configurations produce higher-quality responses, as evaluated on the AlpacaEval 2.0 benchmark.

Mixture-of-Agents (MoA) has recently been proposed as a method to enhance performance of large language models (LLMs), enabling multiple individual LLMs to work together for collaborative inference. This collaborative approach results in improved responses to user prompts compared to relying on a single LLM. In this paper, we consider such an MoA architecture in a distributed setting, where LLMs operate on individual edge devices, each uniquely associated with a user and equipped with its own distributed computing power. These devices exchange information using decentralized gossip algorithms, allowing different device nodes to talk without the supervision of a centralized server. In the considered setup, different users have their own LLM models to address user prompts. Additionally, the devices gossip either their own user-specific prompts or augmented prompts to generate more refined answers to certain queries. User prompts are temporarily stored in the device queues when their corresponding LLMs are busy. Given the memory limitations of edge devices, it is crucial to ensure that the average queue sizes in the system remain bounded. In this paper, we address this by theoretically calculating the queuing stability conditions for the device queues under reasonable assumptions, which we validate experimentally as well. Further, we demonstrate through experiments, leveraging open-source LLMs for the implementation of distributed MoA, that certain MoA configurations produce higher-quality responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The implementation is available at: https://github.com/purbeshmitra/distributed_moa.

Decentralized Gossip Algorithms Distributed Computing Edge Inference Large Language Models Mixture-of-Agents Queue Stability

PDF arXiv

2024-12-30

arXiv

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Zhaojian Yu , Yilun Zhao , Arman Cohan , Xiao-Ping Zhang

The paper introduces self-invoking code generation, a new task to evaluate LLMs' progressive reasoning and problem-solving capabilities. It proposes three new benchmarks (HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro) and finds that while most LLMs perform well on traditional benchmarks, their performance drops significantly on self-invoking tasks. The study also identifies failure modes in the evaluation results, highlighting the need for further research in this area.

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.

Benchmarking Code Reasoning Large Language Models Progressive Problem Solving Self-Invoking Code Generation

PDF arXiv

2024-12-30

arXiv

Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

Tim Tsz-Kit Lau , Weijian Li , Chenwei Xu , Han Liu , Mladen Kolar

This paper introduces adaptive batch size schedules for the distributed training of language models, which improve both training efficiency and generalization performance. The proposed methods are compatible with data and model parallelism and have been empirically validated on Llama family models. Theoretical convergence guarantees are also provided for these adaptive schedules.

An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization performance often deteriorates due to small amounts of gradient noise. Despite this dilemma, the common practice of choosing batch sizes in language model training often prioritizes training efficiency -- employing either constant large sizes with data parallelism or implementing batch size warmup schedules. However, such batch size schedule designs remain heuristic and often fail to adapt to training dynamics, presenting the challenge of designing adaptive batch size schedules. Given the abundance of available datasets and the data-hungry nature of language models, data parallelism has become an indispensable distributed training paradigm, enabling the use of larger batch sizes for gradient computation. However, vanilla data parallelism requires replicas of model parameters, gradients, and optimizer states at each worker, which prohibits training larger models with billions of parameters. To optimize memory usage, more advanced parallelism strategies must be employed. In this work, we propose general-purpose and theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. We develop a practical implementation with PyTorch Fully Sharded Data Parallel, facilitating the pretraining of language models of different sizes. We empirically demonstrate that our proposed approaches outperform constant batch sizes and heuristic batch size warmup schedules in the pretraining of models in the Llama family, with particular focus on smaller models with up to 3 billion parameters. We also establish theoretical convergence guarantees for such adaptive batch size schedules with Adam for general smooth nonconvex objectives.

Batch Size Schedules Data Parallelism Distributed Training Generalization Performance Language Models Model Parallelism Training Efficiency

PDF arXiv

2024-12-27

arXiv

A Survey on Large Language Model Acceleration based on KV Cache Management

Qing Li , Haoyang Li , Yiming Li , Anxin Tian , Tianhao Tang

This survey provides a comprehensive overview of Key-Value (KV) cache management strategies for accelerating Large Language Model (LLM) inference, categorizing them into token-level, model-level, and system-level optimizations. It aims to offer insights and support the development of efficient and scalable KV cache management techniques for practical LLM deployment.

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications. The curated paper list for KV cache management is in: \href{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}.

Computational Efficiency Computer Vision Inference Acceleration KV Cache Management Large Language Models Memory Utilization Model-Level Optimization Multi-Modal Tasks Natural Language Processing System-Level Optimization Token-Level Optimization

PDF arXiv