Latest Research Papers
2025-01-28
arXiv
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies
The paper discusses the limitations of using Reinforcement Learning (RL) to ensure safety in advanced LLMs like DeepSeek-R1 and proposes a hybrid approach combining RL and Supervised Fine-Tuning (SFT) to mitigate harmful outputs.
Large Language Models (LLMs) have achieved remarkable progress in reasoning,
alignment, and task-specific performance. However, ensuring harmlessness in
these systems remains a critical challenge, particularly in advanced models
like DeepSeek-R1. This paper examines the limitations of Reinforcement Learning
(RL) as the primary approach for reducing harmful outputs in DeepSeek-R1 and
compares it with Supervised Fine-Tuning (SFT). While RL improves reasoning
capabilities, it faces challenges such as reward hacking, generalization
failures, language mixing, and high computational costs. We propose hybrid
training approaches combining RL and SFT to achieve robust harmlessness
reduction. Usage recommendations and future directions for deploying
DeepSeek-R1 responsibly are also presented.
2025-01-23
arXiv
A RAG-Based Institutional Assistant
This paper introduces a RAG-based virtual assistant for the University of São Paulo, which integrates relevant document fragments to improve LLM performance. The system's accuracy significantly increases when provided with correct document chunks, highlighting the importance of database access and the limitations of current semantic search methods.
Although large language models (LLMs) demonstrate strong text generation
capabilities, they struggle in scenarios requiring access to structured
knowledge bases or specific documents, limiting their effectiveness in
knowledge-intensive tasks. To address this limitation, retrieval-augmented
generation (RAG) models have been developed, enabling generative models to
incorporate relevant document fragments into their inputs. In this paper, we
design and evaluate a RAG-based virtual assistant specifically tailored for the
University of S\~ao Paulo. Our system architecture comprises two key modules: a
retriever and a generative model. We experiment with different types of models
for both components, adjusting hyperparameters such as chunk size and the
number of retrieved documents. Our optimal retriever model achieves a Top-5
accuracy of 30%, while our most effective generative model scores 22.04\%
against ground truth answers. Notably, when the correct document chunks are
supplied to the LLMs, accuracy significantly improves to 54.02%, an increase of
over 30 percentage points. Conversely, without contextual input, performance
declines to 13.68%. These findings highlight the critical role of database
access in enhancing LLM performance. They also reveal the limitations of
current semantic search methods in accurately identifying relevant documents
and underscore the ongoing challenges LLMs face in generating precise
responses.
2025-01-23
arXiv
On the Reasoning Capacity of AI Models and How to Quantify It
The paper proposes a new method to evaluate the reasoning capabilities of AI models, using positional bias and two phenomenological models to decompose model responses into reasoning, memorization, and guessing. It shows that current models often rely on memorization and pattern matching rather than true logical reasoning.
Recent advances in Large Language Models (LLMs) have intensified the debate
surrounding the fundamental nature of their reasoning capabilities. While
achieving high performance on benchmarks such as GPQA and MMLU, these models
exhibit limitations in more complex reasoning tasks, highlighting the need for
more rigorous evaluation methodologies. We propose a novel phenomenological
approach that goes beyond traditional accuracy metrics to probe the underlying
mechanisms of model behavior, establishing a framework that could broadly
impact how we analyze and understand AI systems. Using positional bias in
multiple-choice reasoning tasks as a case study, we demonstrate how systematic
perturbations can reveal fundamental aspects of model decision-making. To
analyze these behaviors, we develop two complementary phenomenological models:
a Probabilistic Mixture Model (PMM) that decomposes model responses into
reasoning, memorization, and guessing components and an Information-Theoretic
Consistency (ITC) analysis that quantifies the relationship between model
confidence and strategy selection. Through controlled experiments on reasoning
benchmarks, we show that true reasoning remains challenging for current models,
with apparent success often relying on sophisticated combinations of
memorization and pattern matching rather than genuine logical deduction. More
fundamentally, we demonstrate that accuracy alone often overstates a model's
reasoning abilities, as model behavior can be characterized through underlying
mechanisms in the phase space of cognitive strategies, revealing how models
dynamically balance different approaches when responding to queries. This
framework enables quantitative criteria for real-world deployments, allowing
applications to specify reliability thresholds based on strategy distributions
rather than aggregate performance metrics.
2025-01-23
arXiv
Improving Video Generation with Human Feedback
This paper introduces a pipeline that uses human feedback to improve video generation, including a new reward model and alignment algorithms. The proposed methods, particularly Flow-DPO and Flow-NRG, show significant improvements over existing techniques.
Video generation has achieved significant advances through rectified flow
techniques, but issues like unsmooth motion and misalignment between videos and
prompts persist. In this work, we develop a systematic pipeline that harnesses
human feedback to mitigate these problems and refine the video generation
model. Specifically, we begin by constructing a large-scale human preference
dataset focused on modern video generation models, incorporating pairwise
annotations across multi-dimensions. We then introduce VideoReward, a
multi-dimensional video reward model, and examine how annotations and various
design choices impact its rewarding efficacy. From a unified reinforcement
learning perspective aimed at maximizing reward with KL regularization, we
introduce three alignment algorithms for flow-based models by extending those
from diffusion models. These include two training-time strategies: direct
preference optimization for flow (Flow-DPO) and reward weighted regression for
flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies
reward guidance directly to noisy videos. Experimental results indicate that
VideoReward significantly outperforms existing reward models, and Flow-DPO
demonstrates superior performance compared to both Flow-RWR and standard
supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom
weights to multiple objectives during inference, meeting personalized video
quality needs. Project page: https://gongyeliu.github.io/videoalign.
2025-01-23
arXiv
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
This paper introduces a training-free method, One-Prompt-One-Story, for consistent text-to-image generation that maintains character identity using a single prompt. The method concatenates all prompts into one input and refines the process with Singular-Value Reweighting and Identity-Preserving Cross-Attention. Experiments show its effectiveness compared to existing approaches.
Text-to-image generation models can create high-quality images from input
prompts. However, they struggle to support the consistent generation of
identity-preserving requirements for storytelling. Existing approaches to this
problem typically require extensive training in large datasets or additional
modifications to the original model architectures. This limits their
applicability across different domains and diverse diffusion model
configurations. In this paper, we first observe the inherent capability of
language models, coined context consistency, to comprehend identity through
context with a single prompt. Drawing inspiration from the inherent context
consistency, we propose a novel training-free method for consistent
text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story).
Our approach 1Prompt1Story concatenates all prompts into a single input for T2I
diffusion models, initially preserving character identities. We then refine the
generation process using two novel techniques: Singular-Value Reweighting and
Identity-Preserving Cross-Attention, ensuring better alignment with the input
description for each frame. In our experiments, we compare our method against
various existing consistent T2I generation approaches to demonstrate its
effectiveness through quantitative metrics and qualitative assessments. Code is
available at https://github.com/byliutao/1Prompt1Story.
2025-01-22
arXiv
Kimi k1.5: Scaling Reinforcement Learning with LLMs
The paper introduces Kimi k1.5, a multi-modal LLM trained with reinforcement learning (RL), which achieves state-of-the-art reasoning performance across multiple benchmarks without relying on complex techniques. It also presents effective long2short methods that improve short-CoT models, significantly outperforming existing models.
Language model pretraining with next token prediction has proved effective
for scaling compute but is limited to the amount of available training data.
Scaling reinforcement learning (RL) unlocks a new axis for the continued
improvement of artificial intelligence, with the promise that large language
models (LLMs) can scale their training data by learning to explore with
rewards. However, prior published work has not produced competitive results. In
light of this, we report on the training practice of Kimi k1.5, our latest
multi-modal LLM trained with RL, including its RL training techniques,
multi-modal data recipes, and infrastructure optimization. Long context scaling
and improved policy optimization methods are key ingredients of our approach,
which establishes a simplistic, effective RL framework without relying on more
complex techniques such as Monte Carlo tree search, value functions, and
process reward models. Notably, our system achieves state-of-the-art reasoning
performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME,
96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching
OpenAI's o1. Moreover, we present effective long2short methods that use
long-CoT techniques to improve short-CoT models, yielding state-of-the-art
short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on
LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and
Claude Sonnet 3.5 by a large margin (up to +550%).
2025-01-22
arXiv
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
The paper introduces DeepSeek-R1-Zero, a model trained with reinforcement learning that exhibits strong reasoning capabilities but faces readability and language mixing issues. To improve these aspects, DeepSeek-R1 is developed, which uses multi-stage training and cold-start data, achieving performance on par with OpenAI-o1-1217. The models and additional resources are open-sourced.
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and
DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement
learning (RL) without supervised fine-tuning (SFT) as a preliminary step,
demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero
naturally emerges with numerous powerful and intriguing reasoning behaviors.
However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we
introduce DeepSeek-R1, which incorporates multi-stage training and cold-start
data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217
on reasoning tasks. To support the research community, we open-source
DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,
70B) distilled from DeepSeek-R1 based on Qwen and Llama.
2025-01-22
arXiv
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
The paper introduces SRMT, a method that enhances coordination in multi-agent systems by sharing and broadcasting working memories. SRMT outperforms various baselines in partially observable pathfinding tasks, particularly under sparse rewards. The results show that shared recurrent memory can improve cooperation in decentralized multi-agent settings.
Multi-agent reinforcement learning (MARL) demonstrates significant progress
in solving cooperative and competitive multi-agent problems in various
environments. One of the principal challenges in MARL is the need for explicit
prediction of the agents' behavior to achieve cooperation. To resolve this
issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends
memory transformers to multi-agent settings by pooling and globally
broadcasting individual working memories, enabling agents to exchange
information implicitly and coordinate their actions. We evaluate SRMT on the
Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck
navigation task that requires agents to pass through a narrow corridor and on a
POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently
outperforms a variety of reinforcement learning baselines, especially under
sparse rewards, and generalizes effectively to longer corridors than those seen
during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is
competitive with recent MARL, hybrid, and planning-based algorithms. These
results suggest that incorporating shared recurrent memory into the
transformer-based architectures can enhance coordination in decentralized
multi-agent systems. The source code for training and evaluation is available
on GitHub: https://github.com/Aloriosa/srmt.
2025-01-22
arXiv
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3, a vision-centric multimodal foundation model, enhances image and video understanding through a four-stage training process that leverages high-quality image-text data. The model's design allows for the encoding of variable-resolution images and compact representation of videos, leading to superior performance in benchmarks.
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation
model for image and video understanding. The core design philosophy of
VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the
vision-centric training paradigm and vision-centric framework design. The key
insight of our vision-centric training paradigm is that high-quality image-text
data is crucial for both image and video understanding. Instead of preparing
massive video-text datasets, we focus on constructing large-scale and
high-quality image-text datasets. VideoLLaMA3 has four training stages: 1)
Vision Encoder Adaptation, which enables vision encoder to accept images of
variable resolutions as input; 2) Vision-Language Alignment, which jointly
tunes the vision encoder, projector, and LLM with large-scale image-text data
covering multiple types (including scene images, documents, charts) as well as
text-only data. 3) Multi-task Fine-tuning, which incorporates image-text SFT
data for downstream tasks and video-text data to establish a foundation for
video understanding. 4) Video-centric Fine-tuning, which further improves the
model's capability in video understanding. As for the framework design, to
better capture fine-grained details in images, the pretrained vision encoder is
adapted to encode images of varying sizes into vision tokens with corresponding
numbers, rather than a fixed number of tokens. For video inputs, we reduce the
number of vision tokens according to their similarity so that the
representation of videos will be more precise and compact. Benefit from
vision-centric designs, VideoLLaMA3 achieves compelling performances in both
image and video understanding benchmarks.
2025-01-22
arXiv
Robust Representation Consistency Model via Contrastive Denoising
The paper introduces a new method for robust representation consistency via contrastive denoising, which improves the robustness of deep neural networks against adversarial perturbations and reduces computational overhead during inference. The method reformulates the generative modeling task as a discriminative task in the latent space, enabling implicit denoising-then-classification with a single prediction, and achieves state-of-the-art performance on various datasets.
Robustness is essential for deep neural networks, especially in
security-sensitive applications. To this end, randomized smoothing provides
theoretical guarantees for certifying robustness against adversarial
perturbations. Recently, diffusion models have been successfully employed for
randomized smoothing to purify noise-perturbed samples before making
predictions with a standard classifier. While these methods excel at small
perturbation radii, they struggle with larger perturbations and incur a
significant computational overhead during inference compared to classical
methods. To address this, we reformulate the generative modeling task along the
diffusion trajectories in pixel space as a discriminative task in the latent
space. Specifically, we use instance discrimination to achieve consistent
representations along the trajectories by aligning temporally adjacent points.
After fine-tuning based on the learned representations, our model enables
implicit denoising-then-classification via a single prediction, substantially
reducing inference costs. We conduct extensive experiments on various datasets
and achieve state-of-the-art performance with minimal computation budget during
inference. For example, our method outperforms the certified accuracy of
diffusion-based methods on ImageNet across all perturbation radii by 5.3% on
average, with up to 11.6% at larger radii, while reducing inference costs by
85$\times$ on average. Codes are available at:
https://github.com/jiachenlei/rRCM.