2024-12-26
arXiv

Jasper and Stella: distillation of SOTA embedding models

Dun Zhang , FulongWang
The paper introduces a distillation technique to create smaller, efficient embedding models and a method to reduce vector dimensions, along with alignment training for multimodal encoding, achieving high scores on MTEB benchmarks.
A crucial component of many deep learning applications (such as FAQ and RAG) is dense retrieval, in which embedding models are used to convert raw text to numerical vectors and then get the most similar text by MIPS (Maximum Inner Product Search). Some text embedding benchmarks (e.g. MTEB, BEIR, and AIR-Bench) have been established to evaluate embedding models accurately. Thanks to these benchmarks, we can use SOTA models; however, the deployment and application of these models in industry were hampered by their large vector dimensions and numerous parameters. To alleviate this problem, 1) we present a distillation technique that can enable a smaller student model to achieve good performance. 2) Inspired by MRL we present a training approach of reducing the vector dimensions based on its own vectors or its teacher vectors. 3) We do simple yet effective alignment training between images and text to make our model a multimodal encoder. We trained Stella and Jasper models using the technologies above and achieved high scores on the MTEB leaderboard. We release the model and data at Hugging Face Hub (https://huggingface.co/infgrad/jasper_en_vision_language_v1) and the training logs are at https://api.wandb.ai/links/dunnzhang0/z8jqoqpb.
2024-12-22
arXiv

GraphAgent: Agentic Graph Language Assistant

Yuhao Yang , Jiabin Tang , Lianghao Xia , Xingchen Zou , Yuxuan Liang
GraphAgent is an automated pipeline that integrates structured and unstructured data, using language and graph language models to handle predictive and generative tasks. It consists of three components: a Graph Generator Agent, a Task Planning Agent, and a Task Execution Agent, which collaborate to interpret user queries and execute tasks. The effectiveness of GraphAgent is demonstrated through extensive experiments on various datasets.
Real-world data is represented in both structured (e.g., graph connections) and unstructured (e.g., textual, visual information) formats, encompassing complex relationships that include explicit links (such as social connections and user behaviors) and implicit interdependencies among semantic entities, often illustrated through knowledge graphs. In this work, we propose GraphAgent, an automated agent pipeline that addresses both explicit graph dependencies and implicit graph-enhanced semantic inter-dependencies, aligning with practical data scenarios for predictive tasks (e.g., node classification) and generative tasks (e.g., text generation). GraphAgent comprises three key components: (i) a Graph Generator Agent that builds knowledge graphs to reflect complex semantic dependencies; (ii) a Task Planning Agent that interprets diverse user queries and formulates corresponding tasks through agentic self-planning; and (iii) a Task Execution Agent that efficiently executes planned tasks while automating tool matching and invocation in response to user queries. These agents collaborate seamlessly, integrating language models with graph language models to uncover intricate relational information and data semantic dependencies. Through extensive experiments on various graph-related predictive and text generative tasks on diverse datasets, we demonstrate the effectiveness of our GraphAgent across various settings. We have made our proposed GraphAgent open-source at: https://github.com/HKUDS/GraphAgent.
2024-12-18
arXiv

SAFERec: Self-Attention and Frequency Enriched Model for Next Basket Recommendation

Oleg Lashinin , Denis Krasilnikov , Aleksandr Milogradskii , Marina Ananyeva
SAFERec, a new algorithm for Next-Basket Recommendation, enhances transformer-based models by incorporating item frequency information, improving their performance on NBR tasks. Experiments show SAFERec outperforms other baselines, with an 8% improvement in Recall@10.
Transformer-based approaches such as BERT4Rec and SASRec demonstrate strong performance in Next Item Recommendation (NIR) tasks. However, applying these architectures to Next-Basket Recommendation (NBR) tasks, which often involve highly repetitive interactions, is challenging due to the vast number of possible item combinations in a basket. Moreover, frequency-based methods such as TIFU-KNN and UP-CF still demonstrate strong performance in NBR tasks, frequently outperforming deep-learning approaches. This paper introduces SAFERec, a novel algorithm for NBR that enhances transformer-based architectures from NIR by incorporating item frequency information, consequently improving their applicability to NBR tasks. Extensive experiments on multiple datasets show that SAFERec outperforms all other baselines, specifically achieving an 8\% improvement in Recall@10.
2024-12-18
arXiv

Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective

Xuanjing Huang , Zhiyuan Zeng , Qinyuan Cheng , Zhangyue Yin , Bo Wang
This paper outlines a roadmap to reproducing OpenAI o1 from a reinforcement learning perspective, emphasizing four key components: policy initialization, reward design, search, and learning. These components enable the model to develop human-like reasoning, generate high-quality solutions, and improve performance with more data and parameters. The analysis provides insights into how learning and search drive the advancement of large language models.
OpenAI o1 represents a significant milestone in Artificial Inteiligence, which achieves expert-level performances on many challanging tasks that require strong reasoning ability.OpenAI has claimed that the main techinique behinds o1 is the reinforcement learining. Recent works use alternative approaches like knowledge distillation to imitate o1's reasoning style, but their effectiveness is limited by the capability ceiling of the teacher model. Therefore, this paper analyzes the roadmap to achieving o1 from the perspective of reinforcement learning, focusing on four key components: policy initialization, reward design, search, and learning. Policy initialization enables models to develop human-like reasoning behaviors, equipping them with the ability to effectively explore solution spaces for complex problems. Reward design provides dense and effective signals via reward shaping or reward modeling, which is the guidance for both search and learning. Search plays a crucial role in generating high-quality solutions during both training and testing phases, which can produce better solutions with more computation. Learning utilizes the data generated by search for improving policy, which can achieve the better performance with more parameters and more searched data. Existing open-source projects that attempt to reproduce o1 can be seem as a part or a variant of our roadmap. Collectively, these components underscore how learning and search drive o1's advancement, making meaningful contributions to the development of LLM.
2024-12-18
arXiv

Clio: Privacy-Preserving Insights into Real-World AI Use

Jared Kaplan , Jack Clark , Alex Tamkin , Miles McCain , Kunal Handa
Clio, a privacy-preserving platform, uses AI assistants to analyze and aggregate usage patterns from millions of conversations, providing insights into real-world AI use without compromising user privacy. It identifies common use cases and language-specific trends, and helps in detecting system abuse and monitoring during critical periods. The platform aims to support empirically grounded AI safety and governance.
How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregated usage patterns across millions of conversations, without the need for human reviewers to read raw conversations. We validate this can be done with a high degree of accuracy and privacy by conducting extensive evaluations. We demonstrate Clio's usefulness in two broad ways. First, we share insights about how models are being used in the real world from one million Claude.ai Free and Pro conversations, ranging from providing advice on hairstyles to providing guidance on Git operations and concepts. We also identify the most common high-level use cases on Claude.ai (coding, writing, and research tasks) as well as patterns that differ across languages (e.g., conversations in Japanese discuss elder care and aging populations at higher-than-typical rates). Second, we use Clio to make our systems safer by identifying coordinated attempts to abuse our systems, monitoring for unknown unknowns during critical periods like launches of new capabilities or major world events, and improving our existing monitoring systems. We also discuss the limitations of our approach, as well as risks and ethical concerns. By enabling analysis of real-world AI usage, Clio provides a scalable platform for empirically grounded AI safety and governance.
2024-12-01
arXiv

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Jiahao Cui , Hui Li , Yun Zhan , Hanlin Shang , Kaihui Cheng
This paper introduces a pretrained transformer-based video generative model for highly dynamic and realistic portrait animation, addressing challenges in non-frontal perspectives, dynamic objects, and immersive backgrounds. It uses a new identity reference network and investigates speech audio conditioning and motion frame mechanisms to maintain consistent facial identity and generate continuous video. The method shows significant improvements over prior techniques on benchmark and wild datasets.
Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://fudan-generative-vision.github.io/hallo3/.
2024-10-25
arXiv

Knowledge Graph Enhanced Language Agents for Recommendation

Taicheng Guo , Xiangliang Zhang , Chaochun Liu , Hai Wang , Varun Mannam
This paper introduces Knowledge Graph Enhanced Language Agents (KGLA), a framework that integrates knowledge graphs with language agents to improve recommendation systems by enriching user profiles and capturing complex relationships between users and items. The method significantly enhances recommendation performance, as demonstrated by substantial improvements in NDCG@1 on three widely used benchmarks.
Language agents have recently been used to simulate human behavior and user-item interactions for recommendation systems. However, current language agent simulations do not understand the relationships between users and items, leading to inaccurate user profiles and ineffective recommendations. In this work, we explore the utility of Knowledge Graphs (KGs), which contain extensive and reliable relationships between users and items, for recommendation. Our key insight is that the paths in a KG can capture complex relationships between users and items, eliciting the underlying reasons for user preferences and enriching user profiles. Leveraging this insight, we propose Knowledge Graph Enhanced Language Agents(KGLA), a framework that unifies language agents and KG for recommendation systems. In the simulated recommendation scenario, we position the user and item within the KG and integrate KG paths as natural language descriptions into the simulation. This allows language agents to interact with each other and discover sufficient rationale behind their interactions, making the simulation more accurate and aligned with real-world cases, thus improving recommendation performance. Our experimental results show that KGLA significantly improves recommendation performance (with a 33%-95% boost in NDCG@1 among three widely used benchmarks) compared to the previous best baseline method.
2024-10-21
arXiv

STAR: A Simple Training-free Approach for Recommendations using Large Language Models

Dong-Ho Lee , Adam Kraft , Long Jin , Nikhil Mehta , Taibai Xu
This paper introduces STAR, a training-free approach for recommendation systems using large language models (LLMs) that combines semantic embeddings and collaborative user information. The method achieves competitive performance on next-item prediction tasks, demonstrating the potential of LLMs without fine-tuning. Experimental results show significant improvements in Hits@10 on various categories of the Amazon Review dataset.
Recent progress in large language models (LLMs) offers promising new approaches for recommendation system (RecSys) tasks. While the current state-of-the-art methods rely on fine-tuning LLMs to achieve optimal results, this process is costly and introduces significant engineering complexities. Conversely, methods that bypass fine-tuning and use LLMs directly are less resource-intensive but often fail to fully capture both semantic and collaborative information, resulting in sub-optimal performance compared to their fine-tuned counterparts. In this paper, we propose a Simple Training-free Approach for Recommendation (STAR), a framework that utilizes LLMs and can be applied to various recommendation tasks without the need for fine-tuning. Our approach involves a retrieval stage that uses semantic embeddings from LLMs combined with collaborative user information to retrieve candidate items. We then apply an LLM for pairwise ranking to enhance next-item prediction. Experimental results on the Amazon Review dataset show competitive performance for next item prediction, even with our retrieval stage alone. Our full method achieves Hits@10 performance of +23.8% on Beauty, +37.5% on Toys and Games, and -1.8% on Sports and Outdoors relative to the best supervised models. This framework offers an effective alternative to traditional supervised models, highlighting the potential of LLMs in recommendation systems without extensive training or custom architectures.
2024-09-12
arXiv

Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG

Gabriel de Souza P. Moreira , Ronay Ak , Benedikt Schifferer , Mengyao Xu , Radek Osmulski
This paper benchmarks and evaluates various ranking models for enhancing the accuracy of text retrieval in question-answering tasks, introducing a new model, NV-RerankQA-Mistral-4B-v3, that significantly improves accuracy. It also discusses the trade-offs between model size, accuracy, and system requirements in real-world applications.
Ranking models play a crucial role in enhancing overall accuracy of text retrieval systems. These multi-stage systems typically utilize either dense embedding models or sparse lexical indices to retrieve relevant passages based on a given query, followed by ranking models that refine the ordering of the candidate passages by its relevance to the query. This paper benchmarks various publicly available ranking models and examines their impact on ranking accuracy. We focus on text retrieval for question-answering tasks, a common use case for Retrieval-Augmented Generation systems. Our evaluation benchmarks include models some of which are commercially viable for industrial applications. We introduce a state-of-the-art ranking model, NV-RerankQA-Mistral-4B-v3, which achieves a significant accuracy increase of ~14% compared to pipelines with other rerankers. We also provide an ablation study comparing the fine-tuning of ranking models with different sizes, losses and self-attention mechanisms. Finally, we discuss challenges of text retrieval pipelines with ranking models in real-world industry applications, in particular the trade-offs among model size, ranking accuracy and system requirements like indexing and serving latency / throughput.
2024-08-28
arXiv

Conan-embedding: General Text Embedding with More and Better Negative Samples

Shiyu Li , Yang Tang , Shizhe Chen , Xi Chen
The paper introduces the conan-embedding model, which improves text embedding by using a dynamic hard negative mining method and a Cross-GPU balancing Loss to increase the number and quality of negative examples. It also leverages LLM-generated prompt-response pairs for training, achieving top performance on a Chinese text embedding benchmark.
With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark