Latest Research Papers
2025-01-21
arXiv
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
TokenVerse is a method for multi-concept personalization using a pre-trained text-to-image diffusion model, capable of disentangling and combining complex visual elements from multiple images. It leverages the semantic modulation space to enable localized control over various concepts, including objects, accessories, materials, pose, and lighting. The effectiveness of TokenVerse is demonstrated in challenging personalization settings, outperforming existing methods.
We present TokenVerse -- a method for multi-concept personalization,
leveraging a pre-trained text-to-image diffusion model. Our framework can
disentangle complex visual elements and attributes from as little as a single
image, while enabling seamless plug-and-play generation of combinations of
concepts extracted from multiple images. As opposed to existing works,
TokenVerse can handle multiple images with multiple concepts each, and supports
a wide-range of concepts, including objects, accessories, materials, pose, and
lighting. Our work exploits a DiT-based text-to-image model, in which the input
text affects the generation through both attention and modulation (shift and
scale). We observe that the modulation space is semantic and enables localized
control over complex concepts. Building on this insight, we devise an
optimization-based framework that takes as input an image and a text
description, and finds for each word a distinct direction in the modulation
space. These directions can then be used to generate new images that combine
the learned concepts in a desired configuration. We demonstrate the
effectiveness of TokenVerse in challenging personalization settings, and
showcase its advantages over existing methods. project's webpage in
https://token-verse.github.io/
2025-01-20
arXiv
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
This paper introduces Agent-R, an iterative self-training framework that enables language model agents to reflect and recover from errors in real-time. It uses MCTS to construct training data from erroneous trajectories and a model-guided critique mechanism for timely revision. Experiments show that Agent-R improves error recovery and outperforms baseline methods.
Large Language Models (LLMs) agents are increasingly pivotal for addressing
complex tasks in interactive environments. Existing work mainly focuses on
enhancing performance through behavior cloning from stronger experts, yet such
approaches often falter in real-world applications, mainly due to the inability
to recover from errors. However, step-level critique data is difficult and
expensive to collect. Automating and dynamically constructing self-critique
datasets is thus crucial to empowering models with intelligent agent
capabilities. In this work, we propose an iterative self-training framework,
Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional
methods that reward or penalize actions based on correctness, Agent-R leverages
MCTS to construct training data that recover correct trajectories from
erroneous ones. A key challenge of agent reflection lies in the necessity for
timely revision rather than waiting until the end of a rollout. To address
this, we introduce a model-guided critique construction mechanism: the actor
model identifies the first error step (within its current capability) in a
failed trajectory. Starting from it, we splice it with the adjacent correct
path, which shares the same parent node in the tree. This strategy enables the
model to learn reflection based on its current policy, therefore yielding
better learning efficiency. To further explore the scalability of this
self-improvement paradigm, we investigate iterative refinement of both error
correction capabilities and dataset construction. Our findings demonstrate that
Agent-R continuously improves the model's ability to recover from errors and
enables timely error correction. Experiments on three interactive environments
show that Agent-R effectively equips agents to correct erroneous actions while
avoiding loops, achieving superior performance compared to baseline methods
(+5.59%).
2025-01-17
arXiv
Evolving Deeper LLM Thinking
The paper introduces Mind Evolution, an evolutionary search strategy for scaling inference in Large Language Models, which outperforms other methods like Best-of-N and Sequential Revision in natural language planning tasks without the need for a formal solver.
We explore an evolutionary search strategy for scaling inference time compute
in Large Language Models. The proposed approach, Mind Evolution, uses a
language model to generate, recombine and refine candidate responses. The
proposed approach avoids the need to formalize the underlying inference problem
whenever a solution evaluator is available. Controlling for inference cost, we
find that Mind Evolution significantly outperforms other inference strategies
such as Best-of-N and Sequential Revision in natural language planning tasks.
In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more
than 98% of the problem instances using Gemini 1.5 Pro without the use of a
formal solver.
2025-01-16
arXiv
Foundations of Large Language Models
The book focuses on foundational concepts of large language models, covering pre-training, generative models, prompting techniques, and alignment methods. It is designed for college students, professionals, and practitioners in NLP and related fields.
This is a book about large language models. As indicated by the title, it
primarily focuses on foundational concepts rather than comprehensive coverage
of all cutting-edge technologies. The book is structured into four main
chapters, each exploring a key area: pre-training, generative models, prompting
techniques, and alignment methods. It is intended for college students,
professionals, and practitioners in natural language processing and related
fields, and can serve as a reference for anyone interested in large language
models.
2025-01-08
arXiv
FinSphere: A Conversational Stock Analysis Agent Equipped with Quantitative Tools based on Real-Time Database
This paper introduces FinSphere, a conversational stock analysis agent, along with a curated dataset and an evaluation framework to improve the quality of stock analysis. The system demonstrates superior performance compared to existing LLMs and agent-based systems.
Current financial Large Language Models (LLMs) struggle with two critical
limitations: a lack of depth in stock analysis, which impedes their ability to
generate professional-grade insights, and the absence of objective evaluation
metrics to assess the quality of stock analysis reports. To address these
challenges, this paper introduces FinSphere, a conversational stock analysis
agent, along with three major contributions: (1) Stocksis, a dataset curated by
industry experts to enhance LLMs' stock analysis capabilities, (2) AnalyScore,
a systematic evaluation framework for assessing stock analysis quality, and (3)
FinSphere, an AI agent that can generate high-quality stock analysis reports in
response to user queries. Experiments demonstrate that FinSphere achieves
superior performance compared to both general and domain-specific LLMs, as well
as existing agent-based systems, even when they are enhanced with real-time
data access and few-shot guidance. The integrated framework, which combines
real-time data feeds, quantitative tools, and an instruction-tuned LLM, yields
substantial improvements in both analytical quality and practical applicability
for real-world stock analysis.
2025-01-02
arXiv
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
FlashInfer is a customizable and efficient attention engine for LLM serving, which optimizes memory access and reduces redundancy using block-sparse format and composable formats. It supports Just-In-Time (JIT) compilation for flexibility and includes a load-balanced scheduling algorithm. FlashInfer significantly improves performance, reducing inter-token-latency by 29-69% and latency for long-context inference by 28-30%.
Transformers, driven by attention mechanisms, form the foundation of large
language models (LLMs). As these models scale up, efficient GPU attention
kernels become essential for high-throughput and low-latency inference. Diverse
LLM applications demand flexible and high-performance attention solutions. We
present FlashInfer: a customizable and efficient attention engine for LLM
serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse
format and composable formats to optimize memory access and reduce redundancy.
It also offers a customizable attention template, enabling adaptation to
various settings through Just-In-Time (JIT) compilation. Additionally,
FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user
requests while maintaining compatibility with CUDAGraph which requires static
configuration. FlashInfer have been integrated into leading LLM serving
frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and
end-to-end evaluations demonstrate FlashInfer's ability to significantly boost
kernel performance across diverse inference scenarios: compared to
state-of-the-art LLM serving solutions, FlashInfer achieve 29-69%
inter-token-latency reduction compared to compiler backends for LLM serving
benchmark, 28-30% latency reduction for long-context inference, and 13-17%
speedup for LLM serving with parallel generation.
2024-12-30
arXiv
Distributed Mixture-of-Agents for Edge Inference with Large Language Models
This paper explores a distributed Mixture-of-Agents (MoA) architecture for edge inference with large language models, using decentralized gossip algorithms to enable collaboration among edge devices. It ensures queuing stability and demonstrates that certain MoA configurations produce higher-quality responses, as evaluated on the AlpacaEval 2.0 benchmark.
Mixture-of-Agents (MoA) has recently been proposed as a method to enhance
performance of large language models (LLMs), enabling multiple individual LLMs
to work together for collaborative inference. This collaborative approach
results in improved responses to user prompts compared to relying on a single
LLM. In this paper, we consider such an MoA architecture in a distributed
setting, where LLMs operate on individual edge devices, each uniquely
associated with a user and equipped with its own distributed computing power.
These devices exchange information using decentralized gossip algorithms,
allowing different device nodes to talk without the supervision of a
centralized server. In the considered setup, different users have their own LLM
models to address user prompts. Additionally, the devices gossip either their
own user-specific prompts or augmented prompts to generate more refined answers
to certain queries. User prompts are temporarily stored in the device queues
when their corresponding LLMs are busy. Given the memory limitations of edge
devices, it is crucial to ensure that the average queue sizes in the system
remain bounded. In this paper, we address this by theoretically calculating the
queuing stability conditions for the device queues under reasonable
assumptions, which we validate experimentally as well. Further, we demonstrate
through experiments, leveraging open-source LLMs for the implementation of
distributed MoA, that certain MoA configurations produce higher-quality
responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The
implementation is available at:
https://github.com/purbeshmitra/distributed_moa.
2024-12-30
arXiv
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
The paper introduces self-invoking code generation, a new task to evaluate LLMs' progressive reasoning and problem-solving capabilities. It proposes three new benchmarks (HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro) and finds that while most LLMs perform well on traditional benchmarks, their performance drops significantly on self-invoking tasks. The study also identifies failure modes in the evaluation results, highlighting the need for further research in this area.
We introduce self-invoking code generation, a new task designed to evaluate
the progressive reasoning and problem-solving capabilities of LLMs. In this
task, models are presented with a base problem and a related, more complex
problem. They must solve the base problem and then utilize its solution to
address the more complex one. This work features three key contributions.
First, we propose a general recipe for generating more challenging versions of
existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP
Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on
self-invoking code generation. Second, from the analysis of experimental
results over twenty LLMs on our benchmarks, we have two important observations:
(i) Most LLMs excel in traditional code generation benchmarks like HumanEval
and MBPP, but their performance declines on self-invoking tasks. For example,
o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro.
(ii) On self-invoking code generation task, the instruction-tuned models
demonstrate only marginal improvements compared to the base models. Third, we
disclose the types of failure modes that exist in our evaluation results. All
these results underscore the need for further advancements in self-invoking
code generation tasks and provide a new direction for future research on
enhancing LLMs' code reasoning capabilities.
2024-12-30
arXiv
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
This paper introduces adaptive batch size schedules for the distributed training of language models, which improve both training efficiency and generalization performance. The proposed methods are compatible with data and model parallelism and have been empirically validated on Llama family models. Theoretical convergence guarantees are also provided for these adaptive schedules.
An appropriate choice of batch sizes in large-scale model training is
crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch
training improves training efficiency in terms of memory utilization, while
generalization performance often deteriorates due to small amounts of gradient
noise. Despite this dilemma, the common practice of choosing batch sizes in
language model training often prioritizes training efficiency -- employing
either constant large sizes with data parallelism or implementing batch size
warmup schedules. However, such batch size schedule designs remain heuristic
and often fail to adapt to training dynamics, presenting the challenge of
designing adaptive batch size schedules. Given the abundance of available
datasets and the data-hungry nature of language models, data parallelism has
become an indispensable distributed training paradigm, enabling the use of
larger batch sizes for gradient computation. However, vanilla data parallelism
requires replicas of model parameters, gradients, and optimizer states at each
worker, which prohibits training larger models with billions of parameters. To
optimize memory usage, more advanced parallelism strategies must be employed.
In this work, we propose general-purpose and theoretically principled adaptive
batch size schedules compatible with data parallelism and model parallelism. We
develop a practical implementation with PyTorch Fully Sharded Data Parallel,
facilitating the pretraining of language models of different sizes. We
empirically demonstrate that our proposed approaches outperform constant batch
sizes and heuristic batch size warmup schedules in the pretraining of models in
the Llama family, with particular focus on smaller models with up to 3 billion
parameters. We also establish theoretical convergence guarantees for such
adaptive batch size schedules with Adam for general smooth nonconvex
objectives.
2024-12-27
arXiv
A Survey on Large Language Model Acceleration based on KV Cache Management
This survey provides a comprehensive overview of Key-Value (KV) cache management strategies for accelerating Large Language Model (LLM) inference, categorizing them into token-level, model-level, and system-level optimizations. It aims to offer insights and support the development of efficient and scalable KV cache management techniques for practical LLM deployment.
Large Language Models (LLMs) have revolutionized a wide range of domains such
as natural language processing, computer vision, and multi-modal tasks due to
their ability to comprehend context and perform logical reasoning. However, the
computational and memory demands of LLMs, particularly during inference, pose
significant challenges when scaling them to real-world, long-context, and
real-time applications. Key-Value (KV) cache management has emerged as a
critical optimization technique for accelerating LLM inference by reducing
redundant computations and improving memory utilization. This survey provides a
comprehensive overview of KV cache management strategies for LLM acceleration,
categorizing them into token-level, model-level, and system-level
optimizations. Token-level strategies include KV cache selection, budget
allocation, merging, quantization, and low-rank decomposition, while
model-level optimizations focus on architectural innovations and attention
mechanisms to enhance KV reuse. System-level approaches address memory
management, scheduling, and hardware-aware designs to improve efficiency across
diverse computing environments. Additionally, the survey provides an overview
of both text and multimodal datasets and benchmarks used to evaluate these
strategies. By presenting detailed taxonomies and comparative analyses, this
work aims to offer useful insights for researchers and practitioners to support
the development of efficient and scalable KV cache management techniques,
contributing to the practical deployment of LLMs in real-world applications.
The curated paper list for KV cache management is in:
\href{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}.