Table of Contents

Sequence to Sequence Models

Sequence to Sequence Models

Decoding Strategies

See also Decoding.

Murray & Chiang 2018 - Correcting Length Bias in Neural Machine Translation
Nucleus Sampling: Holtzman et al 2019 - The Curious Case of Neural Text Degeneration
Wellek et al 2019 - Neural Text Generation with Unlikelihood Training
Stahlberg & Byrne 2019 - On NMT Search Errors and Model Errors: Cat Got Your Tongue? Exact decoding method for seq2seq models. Follow-up work: Shi et al 2020 - Why Neural Machine Translation Prefers Empty Outputs
Meister et al 2020 - If beam search is the answer, what was the question?
Hargreaves et al 2021 - Incremental Beam Manipulation for Natural Language Generation
Diverse k-Best and Lattice Decoding
- Xu & Durrett 2021 - Massive-scale Decoding for Text Generation using Lattices Produces a lattice of diverse generated outputs
Constrained Decoding
- Yin & Neubig 2017 - A Syntactic Neural Model for General-Purpose Code Generation
- Wang et al 2020 - RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers
- Scholak et al 2021 - PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models
- Poesia et al 2022 - Synchromesh: Reliable Code Generation from Pre-trained Language Models They created a tool that will take a ANTLR parser and a string, and give you the set of valid next token completions (see sect 3.1).
- Geng et al 2023 - Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning Elegant solution. Uses Grammatical Framework to constrain the outputs.
- Beurer-Kellner et al 2024 - Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation
Parallel Decoding
- Santilli et al 2023 - Accelerating Transformer Inference for Translation via Parallel Decoding
Speculative Decoding
Miscellaneous Decoding Techniques
- Contrastive Decoding
  - Waldendorf et al 2024 - Contrastive Decoding Reduces Hallucinations in Large Multilingual Machine Translation Models

Issues in Seq2Seq Models

Length Issues

Shi et al 2016 - Why Neural Translations are the Right Length
Murray & Chiang 2018 - Correcting Length Bias in Neural Machine Translation
Shi et al 2020 - Why Neural Machine Translation Prefers Empty Outputs If you add a different EOS token for each length, and do not perform label smoothing on the EOS tokens, then the empty-sentence and length issues go away.
Liang et al 2022 - The Implicit Length Bias of Label Smoothing on Beam Search Decoding Introduces a method for correcting the length problem induced by label smoothing which improves translation quality.

Exposure Bias

See also this post, references at the bottom.

Wang & Sennrich 2020 - On Exposure Bias, Hallucination and Domain Shift in Neural Machine Translation Uses minimum risk training (i.e. risk loss function), which shows a consistant improvement across models.
He et al 2021 - Exposure Bias versus Self-Recovery: Are Distortions Really Incremental for Autoregressive Text Generation? Published here
Arora et al 2022 - Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation Shows that exposure bias leads to an accumulation of errors during generation (such as repetition, etc), and perplexity doesn't capture this.

Scheduled Sampling

Bengio et al 2015 - Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks Scheduled sampling attempts to avoid the exposure bias problem of teacher forcing by sampling predictions from the model according to a schedule during training
Scheduled Sampling is actually DAGGER: Ross et al 2010 - A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (see Graham Neubig's slides)
Mihaylova & Martins 2019 - Scheduled Sampling for Transformers

Sequence to Sequence Model Variants

Noisy channel model: Yee et al 2019 - Simple and Effective Noisy Channel Modeling for Neural Machine Translation “noisy channel models can outperform a direct model by up to 3.2 BLEU”
Fast Variants
- Gehring et al 2017 - Convolutional Sequence to Sequence Learning 10x faster. Very strong (high BLEU score) baseline given in Edunov et al 2017.
Non-Autoregressive, see Non-Autoregressive Seq2seq

Datasets

Standard seq2seq datasets
- WMT 2014 & 2016 (En-De, and En-Fr)
- Neural abstractive summarization (Rush 2015)
- Easy dialog datasets
- Some easy semantic parsing datasets? E2E dataset?

Misc Papers

Lee et al 2020 - On the Discrepancy between Density Estimation and Sequence Generation

Related Pages