Latest Research Papers
2024-12-26
arXiv
Jasper and Stella: distillation of SOTA embedding models
The paper introduces a distillation technique to create smaller, efficient embedding models and a method to reduce vector dimensions, along with alignment training for multimodal encoding, achieving high scores on MTEB benchmarks.
A crucial component of many deep learning applications (such as FAQ and RAG)
is dense retrieval, in which embedding models are used to convert raw text to
numerical vectors and then get the most similar text by MIPS (Maximum Inner
Product Search). Some text embedding benchmarks (e.g. MTEB, BEIR, and
AIR-Bench) have been established to evaluate embedding models accurately.
Thanks to these benchmarks, we can use SOTA models; however, the deployment and
application of these models in industry were hampered by their large vector
dimensions and numerous parameters. To alleviate this problem, 1) we present a
distillation technique that can enable a smaller student model to achieve good
performance. 2) Inspired by MRL we present a training approach of reducing the
vector dimensions based on its own vectors or its teacher vectors. 3) We do
simple yet effective alignment training between images and text to make our
model a multimodal encoder. We trained Stella and Jasper models using the
technologies above and achieved high scores on the MTEB leaderboard. We release
the model and data at Hugging Face Hub
(https://huggingface.co/infgrad/jasper_en_vision_language_v1) and the training
logs are at https://api.wandb.ai/links/dunnzhang0/z8jqoqpb.
2024-12-22
arXiv
GraphAgent: Agentic Graph Language Assistant
GraphAgent is an automated pipeline that integrates structured and unstructured data, using language and graph language models to handle predictive and generative tasks. It consists of three components: a Graph Generator Agent, a Task Planning Agent, and a Task Execution Agent, which collaborate to interpret user queries and execute tasks. The effectiveness of GraphAgent is demonstrated through extensive experiments on various datasets.
Real-world data is represented in both structured (e.g., graph connections)
and unstructured (e.g., textual, visual information) formats, encompassing
complex relationships that include explicit links (such as social connections
and user behaviors) and implicit interdependencies among semantic entities,
often illustrated through knowledge graphs. In this work, we propose
GraphAgent, an automated agent pipeline that addresses both explicit graph
dependencies and implicit graph-enhanced semantic inter-dependencies, aligning
with practical data scenarios for predictive tasks (e.g., node classification)
and generative tasks (e.g., text generation). GraphAgent comprises three key
components: (i) a Graph Generator Agent that builds knowledge graphs to reflect
complex semantic dependencies; (ii) a Task Planning Agent that interprets
diverse user queries and formulates corresponding tasks through agentic
self-planning; and (iii) a Task Execution Agent that efficiently executes
planned tasks while automating tool matching and invocation in response to user
queries. These agents collaborate seamlessly, integrating language models with
graph language models to uncover intricate relational information and data
semantic dependencies. Through extensive experiments on various graph-related
predictive and text generative tasks on diverse datasets, we demonstrate the
effectiveness of our GraphAgent across various settings. We have made our
proposed GraphAgent open-source at: https://github.com/HKUDS/GraphAgent.
2024-12-18
arXiv
SAFERec: Self-Attention and Frequency Enriched Model for Next Basket Recommendation
SAFERec, a new algorithm for Next-Basket Recommendation, enhances transformer-based models by incorporating item frequency information, improving their performance on NBR tasks. Experiments show SAFERec outperforms other baselines, with an 8% improvement in Recall@10.
Transformer-based approaches such as BERT4Rec and SASRec demonstrate strong
performance in Next Item Recommendation (NIR) tasks. However, applying these
architectures to Next-Basket Recommendation (NBR) tasks, which often involve
highly repetitive interactions, is challenging due to the vast number of
possible item combinations in a basket. Moreover, frequency-based methods such
as TIFU-KNN and UP-CF still demonstrate strong performance in NBR tasks,
frequently outperforming deep-learning approaches. This paper introduces
SAFERec, a novel algorithm for NBR that enhances transformer-based
architectures from NIR by incorporating item frequency information,
consequently improving their applicability to NBR tasks. Extensive experiments
on multiple datasets show that SAFERec outperforms all other baselines,
specifically achieving an 8\% improvement in Recall@10.
2024-12-18
arXiv
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective
This paper outlines a roadmap to reproducing OpenAI o1 from a reinforcement learning perspective, emphasizing four key components: policy initialization, reward design, search, and learning. These components enable the model to develop human-like reasoning, generate high-quality solutions, and improve performance with more data and parameters. The analysis provides insights into how learning and search drive the advancement of large language models.
OpenAI o1 represents a significant milestone in Artificial Inteiligence,
which achieves expert-level performances on many challanging tasks that require
strong reasoning ability.OpenAI has claimed that the main techinique behinds o1
is the reinforcement learining. Recent works use alternative approaches like
knowledge distillation to imitate o1's reasoning style, but their effectiveness
is limited by the capability ceiling of the teacher model. Therefore, this
paper analyzes the roadmap to achieving o1 from the perspective of
reinforcement learning, focusing on four key components: policy initialization,
reward design, search, and learning. Policy initialization enables models to
develop human-like reasoning behaviors, equipping them with the ability to
effectively explore solution spaces for complex problems. Reward design
provides dense and effective signals via reward shaping or reward modeling,
which is the guidance for both search and learning. Search plays a crucial role
in generating high-quality solutions during both training and testing phases,
which can produce better solutions with more computation. Learning utilizes the
data generated by search for improving policy, which can achieve the better
performance with more parameters and more searched data. Existing open-source
projects that attempt to reproduce o1 can be seem as a part or a variant of our
roadmap. Collectively, these components underscore how learning and search
drive o1's advancement, making meaningful contributions to the development of
LLM.
2024-12-18
arXiv
Clio: Privacy-Preserving Insights into Real-World AI Use
Clio, a privacy-preserving platform, uses AI assistants to analyze and aggregate usage patterns from millions of conversations, providing insights into real-world AI use without compromising user privacy. It identifies common use cases and language-specific trends, and helps in detecting system abuse and monitoring during critical periods. The platform aims to support empirically grounded AI safety and governance.
How are AI assistants being used in the real world? While model providers in
theory have a window into this impact via their users' data, both privacy
concerns and practical challenges have made analyzing this data difficult. To
address these issues, we present Clio (Claude insights and observations), a
privacy-preserving platform that uses AI assistants themselves to analyze and
surface aggregated usage patterns across millions of conversations, without the
need for human reviewers to read raw conversations. We validate this can be
done with a high degree of accuracy and privacy by conducting extensive
evaluations. We demonstrate Clio's usefulness in two broad ways. First, we
share insights about how models are being used in the real world from one
million Claude.ai Free and Pro conversations, ranging from providing advice on
hairstyles to providing guidance on Git operations and concepts. We also
identify the most common high-level use cases on Claude.ai (coding, writing,
and research tasks) as well as patterns that differ across languages (e.g.,
conversations in Japanese discuss elder care and aging populations at
higher-than-typical rates). Second, we use Clio to make our systems safer by
identifying coordinated attempts to abuse our systems, monitoring for unknown
unknowns during critical periods like launches of new capabilities or major
world events, and improving our existing monitoring systems. We also discuss
the limitations of our approach, as well as risks and ethical concerns. By
enabling analysis of real-world AI usage, Clio provides a scalable platform for
empirically grounded AI safety and governance.
2024-12-01
arXiv
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
This paper introduces a pretrained transformer-based video generative model for highly dynamic and realistic portrait animation, addressing challenges in non-frontal perspectives, dynamic objects, and immersive backgrounds. It uses a new identity reference network and investigates speech audio conditioning and motion frame mechanisms to maintain consistent facial identity and generate continuous video. The method shows significant improvements over prior techniques on benchmark and wild datasets.
Existing methodologies for animating portrait images face significant
challenges, particularly in handling non-frontal perspectives, rendering
dynamic objects around the portrait, and generating immersive, realistic
backgrounds. In this paper, we introduce the first application of a pretrained
transformer-based video generative model that demonstrates strong
generalization capabilities and generates highly dynamic, realistic videos for
portrait animation, effectively addressing these challenges. The adoption of a
new video backbone model makes previous U-Net-based methods for identity
maintenance, audio conditioning, and video extrapolation inapplicable. To
address this limitation, we design an identity reference network consisting of
a causal 3D VAE combined with a stacked series of transformer layers, ensuring
consistent facial identity across video sequences. Additionally, we investigate
various speech audio conditioning and motion frame mechanisms to enable the
generation of continuous video driven by speech audio. Our method is validated
through experiments on benchmark and newly proposed wild datasets,
demonstrating substantial improvements over prior methods in generating
realistic portraits characterized by diverse orientations within dynamic and
immersive scenes. Further visualizations and the source code are available at:
https://fudan-generative-vision.github.io/hallo3/.
2024-10-25
arXiv
Knowledge Graph Enhanced Language Agents for Recommendation
This paper introduces Knowledge Graph Enhanced Language Agents (KGLA), a framework that integrates knowledge graphs with language agents to improve recommendation systems by enriching user profiles and capturing complex relationships between users and items. The method significantly enhances recommendation performance, as demonstrated by substantial improvements in NDCG@1 on three widely used benchmarks.
Language agents have recently been used to simulate human behavior and
user-item interactions for recommendation systems. However, current language
agent simulations do not understand the relationships between users and items,
leading to inaccurate user profiles and ineffective recommendations. In this
work, we explore the utility of Knowledge Graphs (KGs), which contain extensive
and reliable relationships between users and items, for recommendation. Our key
insight is that the paths in a KG can capture complex relationships between
users and items, eliciting the underlying reasons for user preferences and
enriching user profiles. Leveraging this insight, we propose Knowledge Graph
Enhanced Language Agents(KGLA), a framework that unifies language agents and KG
for recommendation systems. In the simulated recommendation scenario, we
position the user and item within the KG and integrate KG paths as natural
language descriptions into the simulation. This allows language agents to
interact with each other and discover sufficient rationale behind their
interactions, making the simulation more accurate and aligned with real-world
cases, thus improving recommendation performance. Our experimental results show
that KGLA significantly improves recommendation performance (with a 33%-95%
boost in NDCG@1 among three widely used benchmarks) compared to the previous
best baseline method.
2024-10-21
arXiv
STAR: A Simple Training-free Approach for Recommendations using Large Language Models
This paper introduces STAR, a training-free approach for recommendation systems using large language models (LLMs) that combines semantic embeddings and collaborative user information. The method achieves competitive performance on next-item prediction tasks, demonstrating the potential of LLMs without fine-tuning. Experimental results show significant improvements in Hits@10 on various categories of the Amazon Review dataset.
Recent progress in large language models (LLMs) offers promising new
approaches for recommendation system (RecSys) tasks. While the current
state-of-the-art methods rely on fine-tuning LLMs to achieve optimal results,
this process is costly and introduces significant engineering complexities.
Conversely, methods that bypass fine-tuning and use LLMs directly are less
resource-intensive but often fail to fully capture both semantic and
collaborative information, resulting in sub-optimal performance compared to
their fine-tuned counterparts. In this paper, we propose a Simple Training-free
Approach for Recommendation (STAR), a framework that utilizes LLMs and can be
applied to various recommendation tasks without the need for fine-tuning. Our
approach involves a retrieval stage that uses semantic embeddings from LLMs
combined with collaborative user information to retrieve candidate items. We
then apply an LLM for pairwise ranking to enhance next-item prediction.
Experimental results on the Amazon Review dataset show competitive performance
for next item prediction, even with our retrieval stage alone. Our full method
achieves Hits@10 performance of +23.8% on Beauty, +37.5% on Toys and Games, and
-1.8% on Sports and Outdoors relative to the best supervised models. This
framework offers an effective alternative to traditional supervised models,
highlighting the potential of LLMs in recommendation systems without extensive
training or custom architectures.
2024-09-12
arXiv
Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG
This paper benchmarks and evaluates various ranking models for enhancing the accuracy of text retrieval in question-answering tasks, introducing a new model, NV-RerankQA-Mistral-4B-v3, that significantly improves accuracy. It also discusses the trade-offs between model size, accuracy, and system requirements in real-world applications.
Ranking models play a crucial role in enhancing overall accuracy of text
retrieval systems. These multi-stage systems typically utilize either dense
embedding models or sparse lexical indices to retrieve relevant passages based
on a given query, followed by ranking models that refine the ordering of the
candidate passages by its relevance to the query.
This paper benchmarks various publicly available ranking models and examines
their impact on ranking accuracy. We focus on text retrieval for
question-answering tasks, a common use case for Retrieval-Augmented Generation
systems. Our evaluation benchmarks include models some of which are
commercially viable for industrial applications.
We introduce a state-of-the-art ranking model, NV-RerankQA-Mistral-4B-v3,
which achieves a significant accuracy increase of ~14% compared to pipelines
with other rerankers. We also provide an ablation study comparing the
fine-tuning of ranking models with different sizes, losses and self-attention
mechanisms.
Finally, we discuss challenges of text retrieval pipelines with ranking
models in real-world industry applications, in particular the trade-offs among
model size, ranking accuracy and system requirements like indexing and serving
latency / throughput.
2024-08-28
arXiv
Conan-embedding: General Text Embedding with More and Better Negative Samples
The paper introduces the conan-embedding model, which improves text embedding by using a dynamic hard negative mining method and a Cross-GPU balancing Loss to increase the number and quality of negative examples. It also leverages LLM-generated prompt-response pairs for training, achieving top performance on a Chinese text embedding benchmark.
With the growing popularity of RAG, the capabilities of embedding models are
gaining increasing attention. Embedding models are primarily trained through
contrastive loss learning, with negative examples being a key component.
Previous work has proposed various hard negative mining strategies, but these
strategies are typically employed as preprocessing steps. In this paper, we
propose the conan-embedding model, which maximizes the utilization of more and
higher-quality negative examples. Specifically, since the model's ability to
handle preprocessed negative examples evolves during training, we propose
dynamic hard negative mining method to expose the model to more challenging
negative examples throughout the training process. Secondly, contrastive
learning requires as many negative examples as possible but is limited by GPU
memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide
more negative examples for embedding training and balance the batch size across
multiple tasks. Moreover, we also discovered that the prompt-response pairs
from LLMs can be used for embedding training. Our approach effectively enhances
the capabilities of embedding models, currently ranking first on the Chinese
leaderboard of Massive text embedding benchmark