Latest Research Papers
2024-12-30
arXiv
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
This paper introduces adaptive batch size schedules for the distributed training of language models, which improve both training efficiency and generalization performance. The proposed methods are compatible with data and model parallelism and have been empirically validated on Llama family models. Theoretical convergence guarantees are also provided for these adaptive schedules.
An appropriate choice of batch sizes in large-scale model training is
crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch
training improves training efficiency in terms of memory utilization, while
generalization performance often deteriorates due to small amounts of gradient
noise. Despite this dilemma, the common practice of choosing batch sizes in
language model training often prioritizes training efficiency -- employing
either constant large sizes with data parallelism or implementing batch size
warmup schedules. However, such batch size schedule designs remain heuristic
and often fail to adapt to training dynamics, presenting the challenge of
designing adaptive batch size schedules. Given the abundance of available
datasets and the data-hungry nature of language models, data parallelism has
become an indispensable distributed training paradigm, enabling the use of
larger batch sizes for gradient computation. However, vanilla data parallelism
requires replicas of model parameters, gradients, and optimizer states at each
worker, which prohibits training larger models with billions of parameters. To
optimize memory usage, more advanced parallelism strategies must be employed.
In this work, we propose general-purpose and theoretically principled adaptive
batch size schedules compatible with data parallelism and model parallelism. We
develop a practical implementation with PyTorch Fully Sharded Data Parallel,
facilitating the pretraining of language models of different sizes. We
empirically demonstrate that our proposed approaches outperform constant batch
sizes and heuristic batch size warmup schedules in the pretraining of models in
the Llama family, with particular focus on smaller models with up to 3 billion
parameters. We also establish theoretical convergence guarantees for such
adaptive batch size schedules with Adam for general smooth nonconvex
objectives.
2024-12-22
arXiv
GraphAgent: Agentic Graph Language Assistant
GraphAgent is an automated pipeline that integrates structured and unstructured data, using language and graph language models to handle predictive and generative tasks. It consists of three components: a Graph Generator Agent, a Task Planning Agent, and a Task Execution Agent, which collaborate to interpret user queries and execute tasks. The effectiveness of GraphAgent is demonstrated through extensive experiments on various datasets.
Real-world data is represented in both structured (e.g., graph connections)
and unstructured (e.g., textual, visual information) formats, encompassing
complex relationships that include explicit links (such as social connections
and user behaviors) and implicit interdependencies among semantic entities,
often illustrated through knowledge graphs. In this work, we propose
GraphAgent, an automated agent pipeline that addresses both explicit graph
dependencies and implicit graph-enhanced semantic inter-dependencies, aligning
with practical data scenarios for predictive tasks (e.g., node classification)
and generative tasks (e.g., text generation). GraphAgent comprises three key
components: (i) a Graph Generator Agent that builds knowledge graphs to reflect
complex semantic dependencies; (ii) a Task Planning Agent that interprets
diverse user queries and formulates corresponding tasks through agentic
self-planning; and (iii) a Task Execution Agent that efficiently executes
planned tasks while automating tool matching and invocation in response to user
queries. These agents collaborate seamlessly, integrating language models with
graph language models to uncover intricate relational information and data
semantic dependencies. Through extensive experiments on various graph-related
predictive and text generative tasks on diverse datasets, we demonstrate the
effectiveness of our GraphAgent across various settings. We have made our
proposed GraphAgent open-source at: https://github.com/HKUDS/GraphAgent.
2023-05-17
arXiv
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
The paper introduces Tree of Thoughts (ToT), a new framework for language model inference that enhances problem-solving by enabling exploration and strategic decision-making. ToT allows LMs to consider multiple reasoning paths and self-evaluate choices, improving performance on tasks requiring planning or search. Experiments show significant improvements in problem-solving abilities, such as increasing the success rate in the Game of 24 from 4% to 74%.
Language models are increasingly being deployed for general problem solving
across a wide range of tasks, but are still confined to token-level,
left-to-right decision-making processes during inference. This means they can
fall short in tasks that require exploration, strategic lookahead, or where
initial decisions play a pivotal role. To surmount these challenges, we
introduce a new framework for language model inference, Tree of Thoughts (ToT),
which generalizes over the popular Chain of Thought approach to prompting
language models, and enables exploration over coherent units of text (thoughts)
that serve as intermediate steps toward problem solving. ToT allows LMs to
perform deliberate decision making by considering multiple different reasoning
paths and self-evaluating choices to decide the next course of action, as well
as looking ahead or backtracking when necessary to make global choices. Our
experiments show that ToT significantly enhances language models'
problem-solving abilities on three novel tasks requiring non-trivial planning
or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in
Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of
tasks, our method achieved a success rate of 74%. Code repo with all prompts:
https://github.com/princeton-nlp/tree-of-thought-llm.