Latest Research Papers
2025-01-23
arXiv
On the Reasoning Capacity of AI Models and How to Quantify It
The paper proposes a new method to evaluate the reasoning capabilities of AI models, using positional bias and two phenomenological models to decompose model responses into reasoning, memorization, and guessing. It shows that current models often rely on memorization and pattern matching rather than true logical reasoning.
Recent advances in Large Language Models (LLMs) have intensified the debate
surrounding the fundamental nature of their reasoning capabilities. While
achieving high performance on benchmarks such as GPQA and MMLU, these models
exhibit limitations in more complex reasoning tasks, highlighting the need for
more rigorous evaluation methodologies. We propose a novel phenomenological
approach that goes beyond traditional accuracy metrics to probe the underlying
mechanisms of model behavior, establishing a framework that could broadly
impact how we analyze and understand AI systems. Using positional bias in
multiple-choice reasoning tasks as a case study, we demonstrate how systematic
perturbations can reveal fundamental aspects of model decision-making. To
analyze these behaviors, we develop two complementary phenomenological models:
a Probabilistic Mixture Model (PMM) that decomposes model responses into
reasoning, memorization, and guessing components and an Information-Theoretic
Consistency (ITC) analysis that quantifies the relationship between model
confidence and strategy selection. Through controlled experiments on reasoning
benchmarks, we show that true reasoning remains challenging for current models,
with apparent success often relying on sophisticated combinations of
memorization and pattern matching rather than genuine logical deduction. More
fundamentally, we demonstrate that accuracy alone often overstates a model's
reasoning abilities, as model behavior can be characterized through underlying
mechanisms in the phase space of cognitive strategies, revealing how models
dynamically balance different approaches when responding to queries. This
framework enables quantitative criteria for real-world deployments, allowing
applications to specify reliability thresholds based on strategy distributions
rather than aggregate performance metrics.
2025-01-22
arXiv
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
The paper introduces DeepSeek-R1-Zero, a model trained with reinforcement learning that exhibits strong reasoning capabilities but faces readability and language mixing issues. To improve these aspects, DeepSeek-R1 is developed, which uses multi-stage training and cold-start data, achieving performance on par with OpenAI-o1-1217. The models and additional resources are open-sourced.
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and
DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement
learning (RL) without supervised fine-tuning (SFT) as a preliminary step,
demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero
naturally emerges with numerous powerful and intriguing reasoning behaviors.
However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we
introduce DeepSeek-R1, which incorporates multi-stage training and cold-start
data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217
on reasoning tasks. To support the research community, we open-source
DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,
70B) distilled from DeepSeek-R1 based on Qwen and Llama.