Latest Research Papers
2025-01-28
arXiv
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies
The paper discusses the limitations of using Reinforcement Learning (RL) to ensure safety in advanced LLMs like DeepSeek-R1 and proposes a hybrid approach combining RL and Supervised Fine-Tuning (SFT) to mitigate harmful outputs.
Large Language Models (LLMs) have achieved remarkable progress in reasoning,
alignment, and task-specific performance. However, ensuring harmlessness in
these systems remains a critical challenge, particularly in advanced models
like DeepSeek-R1. This paper examines the limitations of Reinforcement Learning
(RL) as the primary approach for reducing harmful outputs in DeepSeek-R1 and
compares it with Supervised Fine-Tuning (SFT). While RL improves reasoning
capabilities, it faces challenges such as reward hacking, generalization
failures, language mixing, and high computational costs. We propose hybrid
training approaches combining RL and SFT to achieve robust harmlessness
reduction. Usage recommendations and future directions for deploying
DeepSeek-R1 responsibly are also presented.
2025-01-23
arXiv
On the Reasoning Capacity of AI Models and How to Quantify It
The paper proposes a new method to evaluate the reasoning capabilities of AI models, using positional bias and two phenomenological models to decompose model responses into reasoning, memorization, and guessing. It shows that current models often rely on memorization and pattern matching rather than true logical reasoning.
Recent advances in Large Language Models (LLMs) have intensified the debate
surrounding the fundamental nature of their reasoning capabilities. While
achieving high performance on benchmarks such as GPQA and MMLU, these models
exhibit limitations in more complex reasoning tasks, highlighting the need for
more rigorous evaluation methodologies. We propose a novel phenomenological
approach that goes beyond traditional accuracy metrics to probe the underlying
mechanisms of model behavior, establishing a framework that could broadly
impact how we analyze and understand AI systems. Using positional bias in
multiple-choice reasoning tasks as a case study, we demonstrate how systematic
perturbations can reveal fundamental aspects of model decision-making. To
analyze these behaviors, we develop two complementary phenomenological models:
a Probabilistic Mixture Model (PMM) that decomposes model responses into
reasoning, memorization, and guessing components and an Information-Theoretic
Consistency (ITC) analysis that quantifies the relationship between model
confidence and strategy selection. Through controlled experiments on reasoning
benchmarks, we show that true reasoning remains challenging for current models,
with apparent success often relying on sophisticated combinations of
memorization and pattern matching rather than genuine logical deduction. More
fundamentally, we demonstrate that accuracy alone often overstates a model's
reasoning abilities, as model behavior can be characterized through underlying
mechanisms in the phase space of cognitive strategies, revealing how models
dynamically balance different approaches when responding to queries. This
framework enables quantitative criteria for real-world deployments, allowing
applications to specify reliability thresholds based on strategy distributions
rather than aggregate performance metrics.
2025-01-22
arXiv
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
The paper introduces DeepSeek-R1-Zero, a model trained with reinforcement learning that exhibits strong reasoning capabilities but faces readability and language mixing issues. To improve these aspects, DeepSeek-R1 is developed, which uses multi-stage training and cold-start data, achieving performance on par with OpenAI-o1-1217. The models and additional resources are open-sourced.
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and
DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement
learning (RL) without supervised fine-tuning (SFT) as a preliminary step,
demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero
naturally emerges with numerous powerful and intriguing reasoning behaviors.
However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we
introduce DeepSeek-R1, which incorporates multi-stage training and cold-start
data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217
on reasoning tasks. To support the research community, we open-source
DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,
70B) distilled from DeepSeek-R1 based on Qwen and Llama.
2025-01-22
arXiv
Kimi k1.5: Scaling Reinforcement Learning with LLMs
The paper introduces Kimi k1.5, a multi-modal LLM trained with reinforcement learning (RL), which achieves state-of-the-art reasoning performance across multiple benchmarks without relying on complex techniques. It also presents effective long2short methods that improve short-CoT models, significantly outperforming existing models.
Language model pretraining with next token prediction has proved effective
for scaling compute but is limited to the amount of available training data.
Scaling reinforcement learning (RL) unlocks a new axis for the continued
improvement of artificial intelligence, with the promise that large language
models (LLMs) can scale their training data by learning to explore with
rewards. However, prior published work has not produced competitive results. In
light of this, we report on the training practice of Kimi k1.5, our latest
multi-modal LLM trained with RL, including its RL training techniques,
multi-modal data recipes, and infrastructure optimization. Long context scaling
and improved policy optimization methods are key ingredients of our approach,
which establishes a simplistic, effective RL framework without relying on more
complex techniques such as Monte Carlo tree search, value functions, and
process reward models. Notably, our system achieves state-of-the-art reasoning
performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME,
96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching
OpenAI's o1. Moreover, we present effective long2short methods that use
long-CoT techniques to improve short-CoT models, yielding state-of-the-art
short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on
LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and
Claude Sonnet 3.5 by a large margin (up to +550%).
2025-01-20
arXiv
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
This paper introduces Agent-R, an iterative self-training framework that enables language model agents to reflect and recover from errors in real-time. It uses MCTS to construct training data from erroneous trajectories and a model-guided critique mechanism for timely revision. Experiments show that Agent-R improves error recovery and outperforms baseline methods.
Large Language Models (LLMs) agents are increasingly pivotal for addressing
complex tasks in interactive environments. Existing work mainly focuses on
enhancing performance through behavior cloning from stronger experts, yet such
approaches often falter in real-world applications, mainly due to the inability
to recover from errors. However, step-level critique data is difficult and
expensive to collect. Automating and dynamically constructing self-critique
datasets is thus crucial to empowering models with intelligent agent
capabilities. In this work, we propose an iterative self-training framework,
Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional
methods that reward or penalize actions based on correctness, Agent-R leverages
MCTS to construct training data that recover correct trajectories from
erroneous ones. A key challenge of agent reflection lies in the necessity for
timely revision rather than waiting until the end of a rollout. To address
this, we introduce a model-guided critique construction mechanism: the actor
model identifies the first error step (within its current capability) in a
failed trajectory. Starting from it, we splice it with the adjacent correct
path, which shares the same parent node in the tree. This strategy enables the
model to learn reflection based on its current policy, therefore yielding
better learning efficiency. To further explore the scalability of this
self-improvement paradigm, we investigate iterative refinement of both error
correction capabilities and dataset construction. Our findings demonstrate that
Agent-R continuously improves the model's ability to recover from errors and
enables timely error correction. Experiments on three interactive environments
show that Agent-R effectively equips agents to correct erroneous actions while
avoiding loops, achieving superior performance compared to baseline methods
(+5.59%).
2025-01-17
arXiv
Evolving Deeper LLM Thinking
The paper introduces Mind Evolution, an evolutionary search strategy for scaling inference in Large Language Models, which outperforms other methods like Best-of-N and Sequential Revision in natural language planning tasks without the need for a formal solver.
We explore an evolutionary search strategy for scaling inference time compute
in Large Language Models. The proposed approach, Mind Evolution, uses a
language model to generate, recombine and refine candidate responses. The
proposed approach avoids the need to formalize the underlying inference problem
whenever a solution evaluator is available. Controlling for inference cost, we
find that Mind Evolution significantly outperforms other inference strategies
such as Best-of-N and Sequential Revision in natural language planning tasks.
In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more
than 98% of the problem instances using Gemini 1.5 Pro without the use of a
formal solver.
2025-01-16
arXiv
Foundations of Large Language Models
The book focuses on foundational concepts of large language models, covering pre-training, generative models, prompting techniques, and alignment methods. It is designed for college students, professionals, and practitioners in NLP and related fields.
This is a book about large language models. As indicated by the title, it
primarily focuses on foundational concepts rather than comprehensive coverage
of all cutting-edge technologies. The book is structured into four main
chapters, each exploring a key area: pre-training, generative models, prompting
techniques, and alignment methods. It is intended for college students,
professionals, and practitioners in natural language processing and related
fields, and can serve as a reference for anyone interested in large language
models.
2025-01-02
arXiv
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
FlashInfer is a customizable and efficient attention engine for LLM serving, which optimizes memory access and reduces redundancy using block-sparse format and composable formats. It supports Just-In-Time (JIT) compilation for flexibility and includes a load-balanced scheduling algorithm. FlashInfer significantly improves performance, reducing inter-token-latency by 29-69% and latency for long-context inference by 28-30%.
Transformers, driven by attention mechanisms, form the foundation of large
language models (LLMs). As these models scale up, efficient GPU attention
kernels become essential for high-throughput and low-latency inference. Diverse
LLM applications demand flexible and high-performance attention solutions. We
present FlashInfer: a customizable and efficient attention engine for LLM
serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse
format and composable formats to optimize memory access and reduce redundancy.
It also offers a customizable attention template, enabling adaptation to
various settings through Just-In-Time (JIT) compilation. Additionally,
FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user
requests while maintaining compatibility with CUDAGraph which requires static
configuration. FlashInfer have been integrated into leading LLM serving
frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and
end-to-end evaluations demonstrate FlashInfer's ability to significantly boost
kernel performance across diverse inference scenarios: compared to
state-of-the-art LLM serving solutions, FlashInfer achieve 29-69%
inter-token-latency reduction compared to compiler backends for LLM serving
benchmark, 28-30% latency reduction for long-context inference, and 13-17%
speedup for LLM serving with parallel generation.
2024-12-30
arXiv
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
The paper introduces self-invoking code generation, a new task to evaluate LLMs' progressive reasoning and problem-solving capabilities. It proposes three new benchmarks (HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro) and finds that while most LLMs perform well on traditional benchmarks, their performance drops significantly on self-invoking tasks. The study also identifies failure modes in the evaluation results, highlighting the need for further research in this area.
We introduce self-invoking code generation, a new task designed to evaluate
the progressive reasoning and problem-solving capabilities of LLMs. In this
task, models are presented with a base problem and a related, more complex
problem. They must solve the base problem and then utilize its solution to
address the more complex one. This work features three key contributions.
First, we propose a general recipe for generating more challenging versions of
existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP
Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on
self-invoking code generation. Second, from the analysis of experimental
results over twenty LLMs on our benchmarks, we have two important observations:
(i) Most LLMs excel in traditional code generation benchmarks like HumanEval
and MBPP, but their performance declines on self-invoking tasks. For example,
o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro.
(ii) On self-invoking code generation task, the instruction-tuned models
demonstrate only marginal improvements compared to the base models. Third, we
disclose the types of failure modes that exist in our evaluation results. All
these results underscore the need for further advancements in self-invoking
code generation tasks and provide a new direction for future research on
enhancing LLMs' code reasoning capabilities.
2024-12-30
arXiv
Distributed Mixture-of-Agents for Edge Inference with Large Language Models
This paper explores a distributed Mixture-of-Agents (MoA) architecture for edge inference with large language models, using decentralized gossip algorithms to enable collaboration among edge devices. It ensures queuing stability and demonstrates that certain MoA configurations produce higher-quality responses, as evaluated on the AlpacaEval 2.0 benchmark.
Mixture-of-Agents (MoA) has recently been proposed as a method to enhance
performance of large language models (LLMs), enabling multiple individual LLMs
to work together for collaborative inference. This collaborative approach
results in improved responses to user prompts compared to relying on a single
LLM. In this paper, we consider such an MoA architecture in a distributed
setting, where LLMs operate on individual edge devices, each uniquely
associated with a user and equipped with its own distributed computing power.
These devices exchange information using decentralized gossip algorithms,
allowing different device nodes to talk without the supervision of a
centralized server. In the considered setup, different users have their own LLM
models to address user prompts. Additionally, the devices gossip either their
own user-specific prompts or augmented prompts to generate more refined answers
to certain queries. User prompts are temporarily stored in the device queues
when their corresponding LLMs are busy. Given the memory limitations of edge
devices, it is crucial to ensure that the average queue sizes in the system
remain bounded. In this paper, we address this by theoretically calculating the
queuing stability conditions for the device queues under reasonable
assumptions, which we validate experimentally as well. Further, we demonstrate
through experiments, leveraging open-source LLMs for the implementation of
distributed MoA, that certain MoA configurations produce higher-quality
responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The
implementation is available at:
https://github.com/purbeshmitra/distributed_moa.