Abstract
This paper introduces the Transformer, a novel network architecture for sequence transduction that relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The proposed model utilizes stacked self-attention and feed-forward networks to draw global dependencies between input and output, allowing for significantly more parallelization and reduced training time compared to recurrent neural networks. Experiments on machine translation tasks (WMT 2014 English-to-German and English-to-French) demonstrate that the Transformer achieves state-of-the-art quality at a fraction of the computational cost of existing best models.
Key Concepts
- Transformer Architecture: A sequence transduction model composed of stacked encoder and decoder layers using only attention and feed-forward mechanisms.
- Self-Attention: An attention mechanism relating different positions within a single sequence to compute representations, capturing dependencies regardless of distance.
- Multi-Head Attention: Projects queries, keys, and values into multiple representation subspaces to jointly attend to information from different positions and subspaces.
- Scaled Dot-Product Attention: Calculates attention weights by taking the dot product of queries and keys, scaled by , to prevent large values from pushing softmax into small gradient regions.
- Positional Encoding: Injects information about token order using sine and cosine functions of different frequencies since the model lacks recurrence.
- Residual Connections and Layer Normalization: Residual paths around sub-layers, followed by Layer Normalization, facilitate training deep networks.
- Encoder-Decoder Attention: Connects the encoder and decoder where queries come from the decoder and keys/values from the encoder output.
Key Equations and Algorithms
- Scaled Dot-Product Attention: , computes attention output as a weighted sum of values.
- Position-wise Feed-Forward Network: , applied identically to each position in the sequence.
- Learning Rate Schedule: Linearly increasing for first warmup steps, decreasing proportional to inverse square root of the step number thereafter.
- Residual Connection: , applies normalization to the sum of sublayer input and output.
Key Claims and Findings
- The Transformer achieves a BLEU score of 28.4 on WMT 2014 English-to-German, surpassing all previously reported ensembles by over 2.0 BLEU.
- The model requires significantly less time to train than recurrent or convolutional counterparts, with large models trained in 3.5 days on eight GPUs.
- The architecture allows for significantly more parallelization because it does not require sequential computation within training examples.
- Multi-head attention enables the model to jointly attend to information from different representation subspaces at different positions without the inhibition caused by averaging in single-head attention.
- Sinusoidal positional encoding allows the model to extrapolate to sequence lengths longer than those encountered during training.
Terminology
- Scaled Dot-Product Attention: An attention function where dot products between queries and keys are scaled by before softmax normalization.
- Encoder-Decoder Attention: An attention mechanism where queries originate from the previous decoder layer and keys/values originate from the encoder output.
- Position-wise Feed-Forward Network: A fully connected network consisting of two linear transformations and a ReLU activation, applied independently to each position.
- Label Smoothing: A regularization technique during training that improves accuracy and BLEU scores by smoothing target distributions.
- Residual Dropout: Dropout applied to the output of each sub-layer and sums of embeddings before normalization to prevent overfitting.
Connections to Existing Wiki Pages
- index - General context for neural sequence transduction and model architecture topics.
- nvidia - Hardware reference for training infrastructure (8 NVIDIA P100 GPUs).
- NCP-AAI_Part_1_Exam_Prep_FULL - Likely resource for studying foundational AI model architectures covered in this paper.