Latest Research Papers
2025-01-28
arXiv
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies
The paper discusses the limitations of using Reinforcement Learning (RL) to ensure safety in advanced LLMs like DeepSeek-R1 and proposes a hybrid approach combining RL and Supervised Fine-Tuning (SFT) to mitigate harmful outputs.
Large Language Models (LLMs) have achieved remarkable progress in reasoning,
alignment, and task-specific performance. However, ensuring harmlessness in
these systems remains a critical challenge, particularly in advanced models
like DeepSeek-R1. This paper examines the limitations of Reinforcement Learning
(RL) as the primary approach for reducing harmful outputs in DeepSeek-R1 and
compares it with Supervised Fine-Tuning (SFT). While RL improves reasoning
capabilities, it faces challenges such as reward hacking, generalization
failures, language mixing, and high computational costs. We propose hybrid
training approaches combining RL and SFT to achieve robust harmlessness
reduction. Usage recommendations and future directions for deploying
DeepSeek-R1 responsibly are also presented.
2025-01-23
arXiv
Improving Video Generation with Human Feedback
This paper introduces a pipeline that uses human feedback to improve video generation, including a new reward model and alignment algorithms. The proposed methods, particularly Flow-DPO and Flow-NRG, show significant improvements over existing techniques.
Video generation has achieved significant advances through rectified flow
techniques, but issues like unsmooth motion and misalignment between videos and
prompts persist. In this work, we develop a systematic pipeline that harnesses
human feedback to mitigate these problems and refine the video generation
model. Specifically, we begin by constructing a large-scale human preference
dataset focused on modern video generation models, incorporating pairwise
annotations across multi-dimensions. We then introduce VideoReward, a
multi-dimensional video reward model, and examine how annotations and various
design choices impact its rewarding efficacy. From a unified reinforcement
learning perspective aimed at maximizing reward with KL regularization, we
introduce three alignment algorithms for flow-based models by extending those
from diffusion models. These include two training-time strategies: direct
preference optimization for flow (Flow-DPO) and reward weighted regression for
flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies
reward guidance directly to noisy videos. Experimental results indicate that
VideoReward significantly outperforms existing reward models, and Flow-DPO
demonstrates superior performance compared to both Flow-RWR and standard
supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom
weights to multiple objectives during inference, meeting personalized video
quality needs. Project page: https://gongyeliu.github.io/videoalign.
2025-01-22
arXiv
Kimi k1.5: Scaling Reinforcement Learning with LLMs
The paper introduces Kimi k1.5, a multi-modal LLM trained with reinforcement learning (RL), which achieves state-of-the-art reasoning performance across multiple benchmarks without relying on complex techniques. It also presents effective long2short methods that improve short-CoT models, significantly outperforming existing models.
Language model pretraining with next token prediction has proved effective
for scaling compute but is limited to the amount of available training data.
Scaling reinforcement learning (RL) unlocks a new axis for the continued
improvement of artificial intelligence, with the promise that large language
models (LLMs) can scale their training data by learning to explore with
rewards. However, prior published work has not produced competitive results. In
light of this, we report on the training practice of Kimi k1.5, our latest
multi-modal LLM trained with RL, including its RL training techniques,
multi-modal data recipes, and infrastructure optimization. Long context scaling
and improved policy optimization methods are key ingredients of our approach,
which establishes a simplistic, effective RL framework without relying on more
complex techniques such as Monte Carlo tree search, value functions, and
process reward models. Notably, our system achieves state-of-the-art reasoning
performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME,
96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching
OpenAI's o1. Moreover, we present effective long2short methods that use
long-CoT techniques to improve short-CoT models, yielding state-of-the-art
short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on
LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and
Claude Sonnet 3.5 by a large margin (up to +550%).
2025-01-22
arXiv
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
The paper introduces DeepSeek-R1-Zero, a model trained with reinforcement learning that exhibits strong reasoning capabilities but faces readability and language mixing issues. To improve these aspects, DeepSeek-R1 is developed, which uses multi-stage training and cold-start data, achieving performance on par with OpenAI-o1-1217. The models and additional resources are open-sourced.
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and
DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement
learning (RL) without supervised fine-tuning (SFT) as a preliminary step,
demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero
naturally emerges with numerous powerful and intriguing reasoning behaviors.
However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we
introduce DeepSeek-R1, which incorporates multi-stage training and cold-start
data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217
on reasoning tasks. To support the research community, we open-source
DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,
70B) distilled from DeepSeek-R1 based on Qwen and Llama.
2025-01-22
arXiv
Evolution and The Knightian Blindspot of Machine Learning
The paper highlights a critical blind spot in machine learning, specifically its inability to handle Knightian uncertainty, and contrasts this with the robustness of biological evolution. It argues for the importance of addressing this gap to create more robust AI, especially in open-world scenarios.
This paper claims that machine learning (ML) largely overlooks an important
facet of general intelligence: robustness to a qualitatively unknown future in
an open world. Such robustness relates to Knightian uncertainty (KU) in
economics, i.e. uncertainty that cannot be quantified, which is excluded from
consideration in ML's key formalisms. This paper aims to identify this blind
spot, argue its importance, and catalyze research into addressing it, which we
believe is necessary to create truly robust open-world AI. To help illuminate
the blind spot, we contrast one area of ML, reinforcement learning (RL), with
the process of biological evolution. Despite staggering ongoing progress, RL
still struggles in open-world situations, often failing under unforeseen
situations. For example, the idea of zero-shot transferring a self-driving car
policy trained only in the US to the UK currently seems exceedingly ambitious.
In dramatic contrast, biological evolution routinely produces agents that
thrive within an open world, sometimes even to situations that are remarkably
out-of-distribution (e.g. invasive species; or humans, who do undertake such
zero-shot international driving). Interestingly, evolution achieves such
robustness without explicit theory, formalisms, or mathematical gradients. We
explore the assumptions underlying RL's typical formalisms, showing how they
limit RL's engagement with the unknown unknowns characteristic of an
ever-changing complex world. Further, we identify mechanisms through which
evolutionary processes foster robustness to novel and unpredictable challenges,
and discuss potential pathways to algorithmically embody them. The conclusion
is that the intriguing remaining fragility of ML may result from blind spots in
its formalisms, and that significant gains may result from direct confrontation
with the challenge of KU.
2024-12-18
arXiv
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective
This paper outlines a roadmap to reproducing OpenAI o1 from a reinforcement learning perspective, emphasizing four key components: policy initialization, reward design, search, and learning. These components enable the model to develop human-like reasoning, generate high-quality solutions, and improve performance with more data and parameters. The analysis provides insights into how learning and search drive the advancement of large language models.
OpenAI o1 represents a significant milestone in Artificial Inteiligence,
which achieves expert-level performances on many challanging tasks that require
strong reasoning ability.OpenAI has claimed that the main techinique behinds o1
is the reinforcement learining. Recent works use alternative approaches like
knowledge distillation to imitate o1's reasoning style, but their effectiveness
is limited by the capability ceiling of the teacher model. Therefore, this
paper analyzes the roadmap to achieving o1 from the perspective of
reinforcement learning, focusing on four key components: policy initialization,
reward design, search, and learning. Policy initialization enables models to
develop human-like reasoning behaviors, equipping them with the ability to
effectively explore solution spaces for complex problems. Reward design
provides dense and effective signals via reward shaping or reward modeling,
which is the guidance for both search and learning. Search plays a crucial role
in generating high-quality solutions during both training and testing phases,
which can produce better solutions with more computation. Learning utilizes the
data generated by search for improving policy, which can achieve the better
performance with more parameters and more searched data. Existing open-source
projects that attempt to reproduce o1 can be seem as a part or a variant of our
roadmap. Collectively, these components underscore how learning and search
drive o1's advancement, making meaningful contributions to the development of
LLM.