Non-Autoregressive Sequence-to-Sequence Models

Non-autoregressive seq2seq models produce outputs in parallel rather than one word at a time.

Autoregressive vs Non-Autoregressive

Definition: Autoregressive (Gu 2017): An autoregressive model generates tokens conditioned on the sequence of tokens previously generated. In other words, it operates one step at a time: it generates each token conditioned on the sequence of tokens previously generated. Examples of autoregressive models include RNNs, LSTMs, CNNs (masked convolution layers), and Transformers (from Gu 2017).

Definition: Non-Autoregressive (Gu 2017): A non-autoregressive model removes the conditional dependence between output tokens and generates them in parallel. See equation 3 of Gu 2017 for an example.

Note: There are also global models like Conditional Random Fields, which are not autoregressive and not non-autoregressive by the above definitions. Instead they perform inference (using dynamic programming, MCMC, etc) to maximize a global scoring function.

Summary

From Zhou & Keung 2020 - Improving Non-autoregressive Neural Machine Translation with Monolingual Data:

Many non-autoregressive (NAR) translation methods have been proposed, including latent space models (Gu et al., 2017; Ma et al., 2019; Shu et al., 2019), iterative refinement methods (Lee et al., 2018; Ghazvininejad et al., 2019), and alternative loss functions (Libovicky and Helcl, 2018; Wang et al., 2019; Wei et al., 2019; Li et al., 2019; Shao et al., 2019). The decoding speedup for NAR models is typically 2-15× depending on the specific setup (e.g., the number of length candidates, number of latent samples, etc.), and NAR models can be tuned to achieve different trade-offs between time complexity and decoding quality (Gu et al., 2017; Wei et al., 2019; Ghazvininejad et al., 2019; Ma et al., 2019).

All these methods are based on transformer modules (Vaswani et al., 2017), and depend on a well-trained AR model to obtain its output translations to create targets for NAR model training.

Key Papers

Gu et al 2017 - Non-Autoregressive Neural Machine Translation
Kaiser et al 2018 - Fast Decoding in Sequence Models Using Discrete Latent Variables
Ghazvininejad et al 2019 - Mask-Predict: Parallel Decoding of Conditional Masked Language Models
Lee et al 2020 - On the Discrepancy between Density Estimation and Sequence Generation
Zhou & Keung 2020 - Improving Non-autoregressive Neural Machine Translation with Monolingual Data
Gu & Kong 2020 - Fully Non-autoregressive Neural Machine Translation:Tricks of the Trade Extensive experiments pushing the boundary of non-autoregressive methods toward auto-regressive performance. Introduces CTC loss
Kasai et al 2020 - Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation
Liu et al 2021 - Enriching Non-Autoregressive Transformer with Syntactic and Semantic Structures for Neural Machine Translation (not very good, not published. Changmao plans to redo this with AMR structures as input)
Non-Autoregressive Machine Translation with Latent Alignments From Google, uses CTC loss from Gu & Kong 2020
Kasai et al 2020 - Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
Wu et al 2024 - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Papers

Santilli et al 2023 - Accelerating Transformer Inference for Translation via Parallel Decoding

NLP Wiki

Table of Contents

Non-Autoregressive Sequence-to-Sequence Models

Autoregressive vs Non-Autoregressive

Summary

Key Papers

Papers