====== Sequence to Sequence Models ======

===== Decoding Strategies =====
See also [[Decoding]].

  * [[https://arxiv.org/pdf/1808.10006.pdf|Murray & Chiang 2018 - Correcting Length Bias in Neural Machine Translation]]
  * Nucleus Sampling: [[https://arxiv.org/pdf/1904.09751.pdf|Holtzman et al 2019 - The Curious Case of Neural Text Degeneration]]
  * [[https://arxiv.org/pdf/1908.04319.pdf|Wellek et al 2019 - Neural Text Generation with Unlikelihood Training]]
  * [[https://arxiv.org/pdf/1908.10090.pdf|Stahlberg &  Byrne 2019 - On NMT Search Errors and Model Errors: Cat Got Your Tongue?]] Exact decoding method for seq2seq models. Follow-up work: [[https://arxiv.org/pdf/2012.13454.pdf|Shi et al 2020 - Why Neural Machine Translation Prefers Empty Outputs]]
  * [[https://arxiv.org/pdf/2010.02650.pdf|Meister et al 2020 - If beam search is the answer, what was the question?]]
  * [[https://arxiv.org/pdf/2102.02574.pdf|Hargreaves et al 2021 - Incremental Beam Manipulation for Natural Language Generation]]
  * **Diverse k-Best and Lattice Decoding**
    * [[https://arxiv.org/pdf/2112.07660.pdf|Xu & Durrett 2021 - Massive-scale Decoding for Text Generation using Lattices]]  Produces a lattice of diverse generated outputs
  * **Constrained Decoding**
    * [[https://arxiv.org/pdf/1704.01696.pdf|Yin & Neubig 2017 - A Syntactic Neural Model for General-Purpose Code Generation]]
    * [[https://aclanthology.org/2020.acl-main.677.pdf|Wang et al 2020 - RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers]]
    * [[https://arxiv.org/pdf/2109.05093.pdf|Scholak et al 2021 - PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models]]
    * [[https://arxiv.org/pdf/2201.11227.pdf|Poesia et al 2022 - Synchromesh: Reliable Code Generation from Pre-trained Language Models]] They created a tool that will take a ANTLR parser and a string, and give you the set of valid next token completions (see sect 3.1).
    * **[[https://arxiv.org/pdf/2305.13971|Geng et al 2023 - Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning]]** Elegant solution. Uses Grammatical Framework to constrain the outputs.
    * **[[https://arxiv.org/pdf/2403.06988|Beurer-Kellner et al 2024 - Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation]]**
  * **Parallel Decoding**
    * [[https://arxiv.org/pdf/2305.10427.pdf|Santilli et al 2023 - Accelerating Transformer Inference for Translation via Parallel Decoding]]
  * **Speculative Decoding**
    * Overviews
      * [[https://arxiv.org/pdf/2401.07851|Xia et al 2024 - Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding]]
      * [[https://arxiv.org/pdf/2405.13019|Khoshnoodi et al 2024 - A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models]]
    * [[https://arxiv.org/pdf/2211.17192|Leviathan et al 2024 - Fast Inference from Transformers via Speculative Decoding]]
    * [[https://arxiv.org/pdf/2404.11912|Sun et al 2024 - TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding]]
    * [[https://arxiv.org/pdf/2502.17421|Yang et al 2025 - LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification]]
    * [[https://arxiv.org/pdf/2505.20776|Cha et al 2025 - SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences]]
  * **Miscellaneous Decoding Techniques**
    * Contrastive Decoding
      * [[https://aclanthology.org/2024.eacl-long.155.pdf|Waldendorf et al 2024 - Contrastive Decoding Reduces Hallucinations in Large Multilingual Machine Translation Models]]

===== Issues in Seq2Seq Models =====

==== Length Issues ====
  * [[https://www.aclweb.org/anthology/D16-1248.pdf|Shi et al 2016 - Why Neural Translations are the Right Length]]
  * [[https://arxiv.org/pdf/1808.10006.pdf|Murray & Chiang 2018 - Correcting Length Bias in Neural Machine Translation]]
  * [[https://arxiv.org/pdf/2012.13454.pdf|Shi et al 2020 - Why Neural Machine Translation Prefers Empty Outputs]] If you add a different EOS token for each length, and do not perform label smoothing on the EOS tokens, then the empty-sentence and length issues go away.
  * [[https://arxiv.org/pdf/2205.00659.pdf|Liang et al 2022 - The Implicit Length Bias of Label Smoothing on Beam Search Decoding]] Introduces a method for correcting the length problem induced by label smoothing which improves translation quality.

==== Exposure Bias ====
See also this [[https://towardsdatascience.com/what-is-teacher-forcing-3da6217fed1c|post]], references at the bottom.
  * [[https://arxiv.org/pdf/2005.03642.pdf|Wang & Sennrich 2020 - On Exposure Bias, Hallucination and Domain Shift
in Neural Machine Translation]] Uses minimum risk training (i.e. risk loss function), which shows a consistant improvement across models.
  * [[https://arxiv.org/pdf/1905.10617.pdf|He et al 2021 - Exposure Bias versus Self-Recovery: Are Distortions Really Incremental for Autoregressive Text Generation?]] Published [[https://aclanthology.org/2021.emnlp-main.415.pdf|here]]
  * [[https://aclanthology.org/2022.findings-acl.58.pdf|Arora et al 2022 - Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation]] Shows that exposure bias leads to an accumulation of errors during generation (such as repetition, etc), and perplexity doesn't capture this.

=== Scheduled Sampling ===
  * [[https://arxiv.org/pdf/1506.03099.pdf|Bengio et al 2015 - Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks]] Scheduled sampling attempts to avoid the exposure bias problem of teacher forcing by sampling predictions from the model according to a schedule during training
  * Scheduled Sampling is actually DAGGER: [[https://arxiv.org/pdf/1011.0686.pdf|Ross et al 2010 - A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning]] (see Graham Neubig's slides)
  * [[https://arxiv.org/pdf/1906.07651.pdf|Mihaylova & Martins 2019 - Scheduled Sampling for Transformers]]

===== Sequence to Sequence Model Variants =====
  * [[Noisy channel model]]: [[https://www.aclweb.org/anthology/D19-1571.pdf|Yee et al 2019 - Simple and Effective Noisy Channel Modeling for Neural Machine Translation]] "noisy channel models can outperform a direct model by up to 3.2 BLEU"
  * Fast Variants
    * [[https://arxiv.org/pdf/1705.03122.pdf|Gehring et al 2017 - Convolutional Sequence to Sequence Learning]]  10x faster.  Very strong (high BLEU score) baseline given in [[https://arxiv.org/pdf/1711.04956v5.pdf|Edunov et al 2017]].
  * Non-Autoregressive, see [[Non-Autoregressive Seq2seq]]

===== Datasets =====
  * Standard seq2seq datasets
    * WMT 2014 & 2016 (En-De, and En-Fr)
    * Neural abstractive summarization ([[https://arxiv.org/pdf/1509.00685.pdf|Rush 2015]])
    * Easy dialog datasets
    * Some easy semantic parsing datasets? E2E dataset?

===== Misc Papers =====
  * [[https://arxiv.org/pdf/2002.07233.pdf|Lee et al 2020 - On the Discrepancy between Density Estimation and Sequence Generation]]

===== Related Pages =====
  * [[Generation]]
  * [[Label Bias Problem]]
  * [[Machine Translation]]
  * [[Non-Autoregressive Seq2seq]]
  * [[RNN]]
  * [[Summarization]]