Table of Contents
Sequence to Sequence Models
Decoding Strategies
Issues in Seq2Seq Models
Length Issues
Exposure Bias
Sequence to Sequence Model Variants
Datasets
Misc Papers
Related Pages
Sequence to Sequence Models
Decoding Strategies
See also
Decoding
.
Murray & Chiang 2018 - Correcting Length Bias in Neural Machine Translation
Nucleus Sampling:
Holtzman et al 2019 - The Curious Case of Neural Text Degeneration
Wellek et al 2019 - Neural Text Generation with Unlikelihood Training
Stahlberg & Byrne 2019 - On NMT Search Errors and Model Errors: Cat Got Your Tongue?
Exact decoding method for seq2seq models. Follow-up work:
Shi et al 2020 - Why Neural Machine Translation Prefers Empty Outputs
Meister et al 2020 - If beam search is the answer, what was the question?
Hargreaves et al 2021 - Incremental Beam Manipulation for Natural Language Generation
Diverse k-Best and Lattice Decoding
Xu & Durrett 2021 - Massive-scale Decoding for Text Generation using Lattices
Produces a lattice of diverse generated outputs
Constrained Decoding
Yin & Neubig 2017 - A Syntactic Neural Model for General-Purpose Code Generation
Wang et al 2020 - RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers
Scholak et al 2021 - PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models
Poesia et al 2022 - Synchromesh: Reliable Code Generation from Pre-trained Language Models
They created a tool that will take a ANTLR parser and a string, and give you the set of valid next token completions (see sect 3.1).
Geng et al 2023 - Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning
Elegant solution. Uses Grammatical Framework to constrain the outputs.
Beurer-Kellner et al 2024 - Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation
Parallel Decoding
Santilli et al 2023 - Accelerating Transformer Inference for Translation via Parallel Decoding
Speculative Decoding
Overviews
Xia et al 2024 - Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
Khoshnoodi et al 2024 - A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models
Leviathan et al 2024 - Fast Inference from Transformers via Speculative Decoding
Sun et al 2024 - TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Yang et al 2025 - LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification
Cha et al 2025 - SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Miscellaneous Decoding Techniques
Contrastive Decoding
Waldendorf et al 2024 - Contrastive Decoding Reduces Hallucinations in Large Multilingual Machine Translation Models
Issues in Seq2Seq Models
Length Issues
Shi et al 2016 - Why Neural Translations are the Right Length
Murray & Chiang 2018 - Correcting Length Bias in Neural Machine Translation
Shi et al 2020 - Why Neural Machine Translation Prefers Empty Outputs
If you add a different EOS token for each length, and do not perform label smoothing on the EOS tokens, then the empty-sentence and length issues go away.
Liang et al 2022 - The Implicit Length Bias of Label Smoothing on Beam Search Decoding
Introduces a method for correcting the length problem induced by label smoothing which improves translation quality.
Exposure Bias
See also this
post
, references at the bottom.
Wang & Sennrich 2020 - On Exposure Bias, Hallucination and Domain Shift in Neural Machine Translation
Uses minimum risk training (i.e. risk loss function), which shows a consistant improvement across models.
He et al 2021 - Exposure Bias versus Self-Recovery: Are Distortions Really Incremental for Autoregressive Text Generation?
Published
here
Arora et al 2022 - Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation
Shows that exposure bias leads to an accumulation of errors during generation (such as repetition, etc), and perplexity doesn't capture this.
Scheduled Sampling
Bengio et al 2015 - Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Scheduled sampling attempts to avoid the exposure bias problem of teacher forcing by sampling predictions from the model according to a schedule during training
Scheduled Sampling is actually DAGGER:
Ross et al 2010 - A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
(see Graham Neubig's slides)
Mihaylova & Martins 2019 - Scheduled Sampling for Transformers
Sequence to Sequence Model Variants
Noisy channel model
:
Yee et al 2019 - Simple and Effective Noisy Channel Modeling for Neural Machine Translation
“noisy channel models can outperform a direct model by up to 3.2 BLEU”
Fast Variants
Gehring et al 2017 - Convolutional Sequence to Sequence Learning
10x faster. Very strong (high BLEU score) baseline given in
Edunov et al 2017
.
Non-Autoregressive, see
Non-Autoregressive Seq2seq
Datasets
Standard seq2seq datasets
WMT 2014 & 2016 (En-De, and En-Fr)
Neural abstractive summarization (
Rush 2015
)
Easy dialog datasets
Some easy semantic parsing datasets? E2E dataset?
Misc Papers
Lee et al 2020 - On the Discrepancy between Density Estimation and Sequence Generation
Related Pages
Generation
Label Bias Problem
Machine Translation
Non-Autoregressive Seq2seq
RNN
Summarization