2024-12-30
arXiv

Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

Tim Tsz-Kit Lau , Weijian Li , Chenwei Xu , Han Liu , Mladen Kolar
This paper introduces adaptive batch size schedules for the distributed training of language models, which improve both training efficiency and generalization performance. The proposed methods are compatible with data and model parallelism and have been empirically validated on Llama family models. Theoretical convergence guarantees are also provided for these adaptive schedules.
An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization performance often deteriorates due to small amounts of gradient noise. Despite this dilemma, the common practice of choosing batch sizes in language model training often prioritizes training efficiency -- employing either constant large sizes with data parallelism or implementing batch size warmup schedules. However, such batch size schedule designs remain heuristic and often fail to adapt to training dynamics, presenting the challenge of designing adaptive batch size schedules. Given the abundance of available datasets and the data-hungry nature of language models, data parallelism has become an indispensable distributed training paradigm, enabling the use of larger batch sizes for gradient computation. However, vanilla data parallelism requires replicas of model parameters, gradients, and optimizer states at each worker, which prohibits training larger models with billions of parameters. To optimize memory usage, more advanced parallelism strategies must be employed. In this work, we propose general-purpose and theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. We develop a practical implementation with PyTorch Fully Sharded Data Parallel, facilitating the pretraining of language models of different sizes. We empirically demonstrate that our proposed approaches outperform constant batch sizes and heuristic batch size warmup schedules in the pretraining of models in the Llama family, with particular focus on smaller models with up to 3 billion parameters. We also establish theoretical convergence guarantees for such adaptive batch size schedules with Adam for general smooth nonconvex objectives.
2024-12-22
arXiv

GraphAgent: Agentic Graph Language Assistant

Lianghao Xia , Chao Huang , Yuxuan Liang , Xingchen Zou , Jiabin Tang
GraphAgent is an automated pipeline that integrates structured and unstructured data, using language and graph language models to handle predictive and generative tasks. It consists of three components: a Graph Generator Agent, a Task Planning Agent, and a Task Execution Agent, which collaborate to interpret user queries and execute tasks. The effectiveness of GraphAgent is demonstrated through extensive experiments on various datasets.
Real-world data is represented in both structured (e.g., graph connections) and unstructured (e.g., textual, visual information) formats, encompassing complex relationships that include explicit links (such as social connections and user behaviors) and implicit interdependencies among semantic entities, often illustrated through knowledge graphs. In this work, we propose GraphAgent, an automated agent pipeline that addresses both explicit graph dependencies and implicit graph-enhanced semantic inter-dependencies, aligning with practical data scenarios for predictive tasks (e.g., node classification) and generative tasks (e.g., text generation). GraphAgent comprises three key components: (i) a Graph Generator Agent that builds knowledge graphs to reflect complex semantic dependencies; (ii) a Task Planning Agent that interprets diverse user queries and formulates corresponding tasks through agentic self-planning; and (iii) a Task Execution Agent that efficiently executes planned tasks while automating tool matching and invocation in response to user queries. These agents collaborate seamlessly, integrating language models with graph language models to uncover intricate relational information and data semantic dependencies. Through extensive experiments on various graph-related predictive and text generative tasks on diverse datasets, we demonstrate the effectiveness of our GraphAgent across various settings. We have made our proposed GraphAgent open-source at: https://github.com/HKUDS/GraphAgent.
2023-05-17
arXiv

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Dian Yu (Google DeepMind) , Jeffrey Zhao (Princeton University) , Izhak Shafran (Google DeepMind) , Thomas L. Griffiths (Princeton University) , Yuan Cao (Google DeepMind)
The paper introduces Tree of Thoughts (ToT), a new framework for language model inference that enhances problem-solving by enabling exploration and strategic decision-making. ToT allows LMs to consider multiple reasoning paths and self-evaluate choices, improving performance on tasks requiring planning or search. Experiments show significant improvements in problem-solving abilities, such as increasing the success rate in the Game of 24 from 4% to 74%.
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.