====== Non-Autoregressive Sequence-to-Sequence Models ======
Non-autoregressive seq2seq models produce outputs in parallel rather than one word at a time.

===== Autoregressive vs Non-Autoregressive =====
**Definition: Autoregressive ([[https://arxiv.org/pdf/1711.02281.pdf|Gu 2017]]):** An autoregressive model 
generates tokens conditioned on the sequence of tokens previously generated.  In other words, it operates one step at a time: it generates each token conditioned on the sequence of tokens previously generated. Examples of autoregressive models include RNNs, LSTMs, CNNs (masked convolution layers), and Transformers (from [[https://arxiv.org/pdf/1711.02281.pdf|Gu 2017]]).

**Definition: Non-Autoregressive ([[https://arxiv.org/pdf/1711.02281.pdf|Gu 2017]]):** A non-autoregressive model removes the conditional dependence between output tokens and generates them in parallel.  See equation 3 of [[https://arxiv.org/pdf/1711.02281.pdf|Gu 2017]] for an example.

**Note:** There are also **global models** like [[ml:Conditional Random Field|Conditional Random Fields]], which are not autoregressive and not non-autoregressive by the above definitions.  Instead they perform inference (using dynamic programming, MCMC, etc) to maximize a global scoring function.

===== Summary =====
From  [[https://www.aclweb.org/anthology/2020.acl-main.171.pdf|Zhou & Keung 2020 - Improving Non-autoregressive Neural Machine Translation with Monolingual Data]]:

<blockquote>
Many non-autoregressive (NAR)
translation methods have been proposed, including
latent space models (Gu et al., 2017; Ma et al.,
2019; Shu et al., 2019), iterative refinement methods (Lee et al., 2018; Ghazvininejad et al., 2019),
and alternative loss functions (Libovicky and Helcl,
2018; Wang et al., 2019; Wei et al., 2019; Li et al.,
2019; Shao et al., 2019). The decoding speedup
for NAR models is typically 2-15× depending on
the specific setup (e.g., the number of length candidates, number of latent samples, etc.), and NAR
models can be tuned to achieve different trade-offs
between time complexity and decoding quality (Gu
et al., 2017; Wei et al., 2019; Ghazvininejad et al.,
2019; Ma et al., 2019).

All these methods are based on transformer modules
(Vaswani et al., 2017), and depend on a well-trained
AR model to obtain its output translations to create targets for NAR model training.
</blockquote>

===== Key Papers =====

  * [[https://arxiv.org/pdf/1711.02281.pdf|Gu et al 2017 - Non-Autoregressive Neural Machine Translation]]
  * [[http://proceedings.mlr.press/v80/kaiser18a/kaiser18a.pdf|Kaiser et al 2018 - Fast Decoding in Sequence Models Using Discrete Latent Variables]]
  * **[[https://arxiv.org/pdf/1904.09324.pdf|Ghazvininejad et al 2019 - Mask-Predict: Parallel Decoding of Conditional Masked Language Models]]**
  * [[https://arxiv.org/pdf/2002.07233.pdf|Lee et al 2020 - On the Discrepancy between Density Estimation and Sequence Generation]]
  * [[https://www.aclweb.org/anthology/2020.acl-main.171.pdf|Zhou & Keung 2020 - Improving Non-autoregressive Neural Machine Translation with Monolingual Data]]
  * **[[https://arxiv.org/pdf/2012.15833.pdf|Gu & Kong 2020 - Fully Non-autoregressive Neural Machine Translation:Tricks of the Trade]]** Extensive experiments pushing the boundary of non-autoregressive methods toward auto-regressive performance. Introduces CTC loss
  * [[https://arxiv.org/pdf/2006.10369.pdf|Kasai et al 2020 - Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation]]
  * [[https://arxiv.org/pdf/2101.08942.pdf|Liu et al 2021 - Enriching Non-Autoregressive Transformer with Syntactic and Semantic Structures for Neural Machine Translation]] (not very good, not published. Changmao plans to redo this with AMR structures as input)
  * **[[https://arxiv.org/pdf/2004.07437.pdf|Non-Autoregressive Machine Translation with Latent Alignments]]** From Google, uses CTC loss from Gu & Kong 2020
  * [[https://arxiv.org/pdf/2006.10369.pdf|Kasai et al 2020 - Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation]]
  * [[https://arxiv.org/pdf/2404.12022|Wu et al 2024 - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration]]

===== Papers =====
  * [[https://arxiv.org/pdf/2305.10427.pdf|Santilli et al 2023 - Accelerating Transformer Inference for Translation via Parallel Decoding]]