2022-01-28
arXiv

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei (Google Research) , Xuezhi Wang (Google Research) , Dale Schuurmans (Google Research) , Maarten Bosma (Google Research) , Brian Ichter (Google Research)
The paper demonstrates that providing a chain of thought in prompts significantly enhances the reasoning capabilities of large language models, leading to improved performance on various reasoning tasks. This method, even with few exemplars, can achieve state-of-the-art accuracy, surpassing fine-tuned models.
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
2019-08-27
arXiv

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers , Iryna Gurevych
Sentence-BERT (SBERT) modifies BERT to produce semantically meaningful sentence embeddings, reducing the computational cost of finding similar sentence pairs from 65 hours to about 5 seconds while maintaining accuracy. It outperforms other state-of-the-art methods on common semantic textual similarity and transfer learning tasks.
BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.
2017-06-12
arXiv

Attention Is All You Need

Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones
The paper introduces the Transformer, a new neural network architecture that relies entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This model outperforms existing methods in machine translation tasks, achieving state-of-the-art results with faster training times. The Transformer also demonstrates strong performance in other tasks like English constituency parsing.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.