Latest Research Papers
2022-01-28
arXiv
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
The paper demonstrates that providing a chain of thought in prompts significantly enhances the reasoning capabilities of large language models, leading to improved performance on various reasoning tasks. This method, even with few exemplars, can achieve state-of-the-art accuracy, surpassing fine-tuned models.
We explore how generating a chain of thought -- a series of intermediate
reasoning steps -- significantly improves the ability of large language models
to perform complex reasoning. In particular, we show how such reasoning
abilities emerge naturally in sufficiently large language models via a simple
method called chain of thought prompting, where a few chain of thought
demonstrations are provided as exemplars in prompting. Experiments on three
large language models show that chain of thought prompting improves performance
on a range of arithmetic, commonsense, and symbolic reasoning tasks. The
empirical gains can be striking. For instance, prompting a 540B-parameter
language model with just eight chain of thought exemplars achieves state of the
art accuracy on the GSM8K benchmark of math word problems, surpassing even
finetuned GPT-3 with a verifier.
2019-08-27
arXiv
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT (SBERT) modifies BERT to produce semantically meaningful sentence embeddings, reducing the computational cost of finding similar sentence pairs from 65 hours to about 5 seconds while maintaining accuracy. It outperforms other state-of-the-art methods on common semantic textual similarity and transfer learning tasks.
BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new
state-of-the-art performance on sentence-pair regression tasks like semantic
textual similarity (STS). However, it requires that both sentences are fed into
the network, which causes a massive computational overhead: Finding the most
similar pair in a collection of 10,000 sentences requires about 50 million
inference computations (~65 hours) with BERT. The construction of BERT makes it
unsuitable for semantic similarity search as well as for unsupervised tasks
like clustering.
In this publication, we present Sentence-BERT (SBERT), a modification of the
pretrained BERT network that use siamese and triplet network structures to
derive semantically meaningful sentence embeddings that can be compared using
cosine-similarity. This reduces the effort for finding the most similar pair
from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while
maintaining the accuracy from BERT.
We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning
tasks, where it outperforms other state-of-the-art sentence embeddings methods.
2017-06-12
arXiv
Attention Is All You Need
The paper introduces the Transformer, a new neural network architecture that relies entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This model outperforms existing methods in machine translation tasks, achieving state-of-the-art results with faster training times. The Transformer also demonstrates strong performance in other tasks like English constituency parsing.
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks in an encoder-decoder configuration. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer, based
solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014
English-to-German translation task, improving over the existing best results,
including ensembles by over 2 BLEU. On the WMT 2014 English-to-French
translation task, our model establishes a new single-model state-of-the-art
BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction
of the training costs of the best models from the literature. We show that the
Transformer generalizes well to other tasks by applying it successfully to
English constituency parsing both with large and limited training data.