Latest Research Papers on AI Safety

2025-01-28

arXiv

Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies

Manojkumar Parmar , Yuvaraj Govindarajulu

The paper discusses the limitations of using Reinforcement Learning (RL) to ensure safety in advanced LLMs like DeepSeek-R1 and proposes a hybrid approach combining RL and Supervised Fine-Tuning (SFT) to mitigate harmful outputs.

Large Language Models (LLMs) have achieved remarkable progress in reasoning, alignment, and task-specific performance. However, ensuring harmlessness in these systems remains a critical challenge, particularly in advanced models like DeepSeek-R1. This paper examines the limitations of Reinforcement Learning (RL) as the primary approach for reducing harmful outputs in DeepSeek-R1 and compares it with Supervised Fine-Tuning (SFT). While RL improves reasoning capabilities, it faces challenges such as reward hacking, generalization failures, language mixing, and high computational costs. We propose hybrid training approaches combining RL and SFT to achieve robust harmlessness reduction. Usage recommendations and future directions for deploying DeepSeek-R1 responsibly are also presented.

PDF arXiv

2024-12-18

arXiv

Clio: Privacy-Preserving Insights into Real-World AI Use

Jack Clark , Deep Ganguli , Shan Carter , Wes Mitchell , William McEachen

Clio, a privacy-preserving platform, uses AI assistants to analyze and aggregate usage patterns from millions of conversations, providing insights into real-world AI use without compromising user privacy. It identifies common use cases and language-specific trends, and helps in detecting system abuse and monitoring during critical periods. The platform aims to support empirically grounded AI safety and governance.

How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregated usage patterns across millions of conversations, without the need for human reviewers to read raw conversations. We validate this can be done with a high degree of accuracy and privacy by conducting extensive evaluations. We demonstrate Clio's usefulness in two broad ways. First, we share insights about how models are being used in the real world from one million Claude.ai Free and Pro conversations, ranging from providing advice on hairstyles to providing guidance on Git operations and concepts. We also identify the most common high-level use cases on Claude.ai (coding, writing, and research tasks) as well as patterns that differ across languages (e.g., conversations in Japanese discuss elder care and aging populations at higher-than-typical rates). Second, we use Clio to make our systems safer by identifying coordinated attempts to abuse our systems, monitoring for unknown unknowns during critical periods like launches of new capabilities or major world events, and improving our existing monitoring systems. We also discuss the limitations of our approach, as well as risks and ethical concerns. By enabling analysis of real-world AI usage, Clio provides a scalable platform for empirically grounded AI safety and governance.

AI Assistants AI Governance AI Safety Privacy-Preserving Analysis Real-World AI Usage Usage Patterns

PDF arXiv