====== Machine Translation ======

===== Overviews =====
For a reading list, see [[https://github.com/THUNLP-MT/MT-Reading-List|The Machine Translation Reading List]]
  * [[https://arxiv.org/pdf/1709.07809.pdf|Philipp Koehn's 2017 Draft Chapter on NMT]]
  * [[http://mt-class.org/jhu/assets/nmt-book.pdf|Philipp Koehn's draft book on NMT]]
  * [[https://www.amazon.com/Neural-Machine-Translation-Philipp-Koehn/dp/1108497322/ref=sr_1_1?dchild=1&keywords=Neural+Machine+Translation&qid=1617092140&sr=8-1|Philipp Koehn's 2020 Book - Neural Machine Translation]]
  * [[https://arxiv.org/pdf/1912.02047.pdf|Stahlberg 2019 - Neural Machine Translation: A Review and Survey]]
  * [[https://arxiv.org/pdf/2002.07526.pdf|Yang et al 2020 - A Survey of Deep Learning Techniques for Neural Machine Translation]]
  * [[https://arxiv.org/pdf/2012.15515.pdf|Tan et al 2020 - Neural Machine Translation: A Review of Methods, Resources, and Tools]]

===== Key Papers =====
  * [[https://arxiv.org/pdf/1409.0473.pdf|Bahdanau et al 2014 - Neural Machine Translation by Jointly Learning to Align and Translate]]
  * [[https://arxiv.org/pdf/1508.07909.pdf|Sennrich et al 2016 - Neural Machine Translation of Rare Words with Subword Units]]
  * [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]]

===== System Papers =====
  * [[https://arxiv.org/pdf/1609.08144.pdf|Wu et al 2016 - Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation]]
  * [[https://arxiv.org/pdf/1709.05820.pdf|Levin et al 2017 - Toward a Full-Scale Neural Machine Translation in Production: The Booking.com Use Case]]
  * [[https://arxiv.org/pdf/1703.03906.pdf|Britz et al 2017 - Massive Exploration of Neural Machine Translation Architectures]]
  * [[https://arxiv.org/pdf/1907.05019.pdf|Arivazhagan et al 2019 - Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges]] See the list open problems.  From Google's MT team, did they deploy this?
  * [[https://arxiv.org/pdf/1803.05567.pdf|Hassan et al 2018 - Achieving Human Parity on Automatic Chinese to English News Translation]]
   * [[https://research.facebook.com/file/585831413174038/No-Language-Left-Behind--Scaling-Human-Centered-Machine-Translation.pdf|Costa-jussà et al 2022 - No Language Left Behind: Scaling Human-Centered Machine Translation]] [[https://ai.facebook.com/blog/nllb-200-high-quality-machine-translation/|blog]] [[https://ai.facebook.com/research/no-language-left-behind/|website]] [[https://github.com/facebookresearch/fairseq/tree/nllb/?fbclid=IwAR1dOIBFelfGY48IJe0MgkUhJnqw3SP2y3O4VhlKs5-QM3dXuFRw4HIleZU|model]] [[https://nllb.metademolab.com/|demo]] Transformer encoder-decoder model with sparsely gated mixture of experts. 50B params, and also distilled versions.

===== Baselines =====
  * [[https://arxiv.org/pdf/1706.09733.pdf|Denkowski & Neubig et al 2017 - Stronger Baselines for Trustable Results in Neural Machine Translation]]

===== Syntax in MT =====
  * [[http://www.statmt.org/wmt16/pdf/W16-2209.pdf|Sennrich & Haddow 2016 - Linguistic Input Features Improve Neural Machine Translation]]
  * [[https://arxiv.org/pdf/1702.01147.pdf|Nadejde et al 2017 - Predicting Target Language CCG Supertags Improves Neural Machine Translation]]
  * [[https://arxiv.org/pdf/1808.10267.pdf|Currey & Heafield 2018 - Multi-Source Syntactic Neural Machine Translation]]

===== Multilingual Translation =====
In multilingual translation, one system is built to translate between many language pairs (rather than just one).
  * [[https://arxiv.org/pdf/1903.00089.pdf|Aharoni et al 2019 - Massively Multilingual Neural Machine Translation]]
  * [[https://arxiv.org/pdf/2008.00401.pdf|Tang et al 2020 - Multilingual Translation with Extensible Multilingual Pretraining and Finetuning]]

===== Low-Resource =====
  * [[https://arxiv.org/pdf/1604.02201.pdf|Zoph et al 2016 - Transfer Learning for Low-Resource Neural Machine Translation]]
   * [[https://www.aclweb.org/anthology/N18-1032.pdf|Gu et al 2018 - Universal Neural Machine Translation for Extremely Low Resource Languages]]
  * Comparision of SMT vs NMT for low-resource MT
     * [[https://arxiv.org/pdf/1905.11901.pdf|Sennrich & Zhang 2019 - Revisiting Low-Resource Neural Machine Translation: A Case Study]]
     * [[https://www.aclweb.org/anthology/2020.lrec-1.325.pdf|Duh et al 2020 - Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages]]
   * [[https://research.facebook.com/file/585831413174038/No-Language-Left-Behind--Scaling-Human-Centered-Machine-Translation.pdf|Costa-jussà et al 2022 - No Language Left Behind: Scaling Human-Centered Machine Translation]] [[https://github.com/facebookresearch/flores|dataset]] [[https://ai.facebook.com/blog/nllb-200-high-quality-machine-translation/|blog]] [[https://ai.facebook.com/research/no-language-left-behind/|website]] [[https://github.com/facebookresearch/fairseq/tree/nllb/?fbclid=IwAR1dOIBFelfGY48IJe0MgkUhJnqw3SP2y3O4VhlKs5-QM3dXuFRw4HIleZU|model]] [[https://nllb.metademolab.com/|demo]] Transformer encoder-decoder model with sparsely gated mixture of experts. 50B params, and also distilled versions.
  * [[https://aclanthology.org/2022.wmt-1.73.pdf|Marco & Fraser 2022 - Findings of the WMT 2022 Shared Tasks in Unsupervised MT and Very Low Resource Supervised MT]]

===== Character-Level =====
  * [[https://arxiv.org/pdf/1808.09943.pdf|Cherry et al 2018 - Revisiting Character-Based Neural Machine Translation with Capacity and Compression]]

===== Domain Adaptation =====
See also [[Domain Adaptation]].
  * Surveys
    * [[https://arxiv.org/pdf/1806.00258.pdf|Chu & Wang 2018 - A Survey of Domain Adaptation for Neural Machine Translation]]
  * [[https://www.aclweb.org/anthology/N19-1209.pdf|Thompson et al 2019 - Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation]]

===== Pretraining =====
  * [[https://arxiv.org/pdf/1804.06323.pdf|Qi et al 2018 - When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?]]
  * [[https://arxiv.org/pdf/2002.06823.pdf|Zhu etal 2020 - Incorporating BERT into Neural Machine Translation]]
  * [[https://arxiv.org/pdf/2008.00401.pdf|Tang et al 2020 - Multilingual Translation with Extensible Multilingual Pretraining and Finetuning]]


===== Unsupervised =====
  * [[https://arxiv.org/pdf/1711.00043.pdf|Lample et al 2017- Unsupervised Machine Translation Using Monolingual Corpora Only]]
  * [[https://arxiv.org/pdf/1906.06718.pdf|Luo et al 2019 - Neural Decipherment via Minimum-Cost Flow: from Ugaritic to Linear B]]
  * [[https://arxiv.org/pdf/1905.02450.pdf|Song et al 2019 - MASS: Masked Sequence to Sequence Pre-training for Language Generation]] [[https://github.com/microsoft/MASS|github]]
  * [[https://arxiv.org/pdf/2004.05516.pdf|Marchisio et al 2020 - When Does Unsupervised Machine Translation Work?]]
  * [[https://arxiv.org/pdf/2106.15818.pdf|Marchisio et al 2021 - On Systematic Style Differences between Unsupervised and Supervised MT and an Application for High-Resource Machine Translation]]
  * [[https://aclanthology.org/2022.wmt-1.73.pdf|Marco & Fraser 2022 - Findings of the WMT 2022 Shared Tasks in Unsupervised MT and Very Low Resource Supervised MT]]
  * [[https://arxiv.org/pdf/2310.10385.pdf|Tan & Monz 2023 - Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance]]

===== Sentence Alignment =====
Before an MT system can be trained, the sentences in the parallel documents need to be aligned to create sentence pairs.
  * [[https://www.aclweb.org/anthology/D19-1136.pdf|Thompson & Koehn 2019 - Vecalign: Improved Sentence Alignment in Linear Time and Space]]
  * Mining parallel sentences
    * Some of these methods can be used to mine parallel sentences from large collections of documents
    * [[https://github.com/facebookresearch/LASER|Laser]]
      * [[https://arxiv.org/pdf/1812.10464.pdf|Artetxe & Schwenk 2019 - Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond]] [[https://github.com/facebookresearch/LASER|github]] Used in NLLB
      * [[https://arxiv.org/pdf/2205.12654.pdf|Heffernan et al 2022 - Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages]]

===== Statistical MT =====
See also [[Statistical Machine Translation]].  Recent papers related to SMT:
  * [[https://www.aclweb.org/anthology/E17-1100.pdf|2019 - A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions]]

===== Evaluation =====
For an overview, see [[http://www.phontron.com/class/mtandseq2seq2018/assets/slides/mt-fall2018.chapter11.pdf|Evaluating MT Systems]].

=== Papers ===
See also the metrics task at WMT every year which does a correlation with human evaluations.
  * BLEU: [[https://aclanthology.org/P02-1040.pdf|Papineni et al 2002 - BLEU: a Method for Automatic Evaluation of Machine Translation]]
  * METEOR: [[https://aclanthology.org/W05-0909.pdf|Banerjee & Lavie 2005 - METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments]] and [[https://aclanthology.org/W07-0734.pdf|Lavie & Agarwal 2007]], [[https://www.cs.cmu.edu/~alavie/METEOR/|download]]
  * TER: [[https://aclanthology.org/2006.amta-papers.25.pdf|A Study of Translation Edit Rate with Targeted Human Annotation]]
  * Multi-lingual METEOR: [[https://aclanthology.org/W14-3348.pdf|Denkowski & Lavie 2014 - Meteor Universal: Language Specific Translation Evaluation for Any Target Language]], [[https://www.cs.cmu.edu/~alavie/METEOR/|download]]
  * chrF: [[https://aclanthology.org/W15-3049.pdf|Popovic 2015 - chrF: character n-gram F-score for automatic MT evaluation]]
  * chrF++: [[https://www.statmt.org/wmt17/pdf/WMT70.pdf|Popovic 2017 - chrF++: words helping character n-grams]], [[https://github.com/m-popovic/chrF|github]]
  * [[https://arxiv.org/pdf/1804.08771.pdf|Post 2018 - A Call for Clarity in Reporting BLEU Scores]]
  * BERTscore: [[https://arxiv.org/abs/1904.09675|Zhang et al 2019 - BERTScore: Evaluating Text Generation with BERT]]
  * [[https://arxiv.org/pdf/2004.06063.pdf|Freitag et al 2020 - BLEU might be Guilty but References are not Innocent]]
  * [[https://arxiv.org/pdf/2106.15195.pdf|Marie et al 2021 - Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers]]
  * [[https://arxiv.org/pdf/2310.10482.pdf|Guerreiro et al 2023 - xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection]]
  * [[https://arxiv.org/pdf/2302.14520|Kocmi & Federmann 2023 - Large Language Models Are State-of-the-Art Evaluators of Translation Quality]]
  * **Evaluation of Metrics**
    * **[[https://aclanthology.org/2021.tacl-1.87.pdf|Freitag et al 2021 - Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation]]** Used in [[https://www2.statmt.org/wmt24/metrics-task.html|WMT]]

=== BLEU ===

Note that BLEU is a corpus-level metric, and that averaging BLEU scores computed at the sentence level will not give the same result as corpus-level BLEU.  Corpus-level BLEU is the standard one reported in papers.

Notes: To assess length effects (translations being too short), people often report the brevity penalty, BP computed when calculating BLEU.   Most BLEU evaluation scripts report this number as ''BP = ''.

  * [[https://github.com/mjpost/sacrebleu|SacreBLEU]] (recommended) [[https://arxiv.org/pdf/1804.08771.pdf|paper]]
    * Internally uses [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl|mteval-v13a.pl]] as the tokenizer
    * If you want to simulate SacreBLEU evaluation, but with statistical significance, you can use the [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl|mteval-v13a.pl]] script to tokenize your output and references, and then use [[https://github.com/jhclark/multeval|MultEval]]
  * [[https://github.com/neulab/compare-mt|Compare-MT]] Can analyze the differences between two systems and compute statistical significance. [[https://www.aclweb.org/anthology/N19-4007.pdf|paper]]
  * Historical: Moses's [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl|multi-bleu.pl]]
  * Jon Clark's [[https://github.com/jhclark/multeval|MultEval]] Does automatic boostrap resampling to compute statistical significance. [[http://www.cs.cmu.edu/~jhclark/pubs/significance.pdf|paper]]

===== Datasets =====

==== Papers About Corpus Collection ====
  * [[https://aclanthology.org/J03-3002.pdf|Resnik & Smith 2003 - The Web as a Parallel Corpus]] - The foundational paper about collecting parallel data from the web.

==== Standard Datasets =====
  * WMT 2014 En-Fr, etc
  * [[http://www.statmt.org/wmt16/translation-task.html|WMT 2016]]
    * Nice scripts to download and preprocess: [[https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh|wmt16_en_de.sh]]
  * [[https://opus.nlpl.eu/|Opus - The Open Parallel Corpus]]

==== Datasets for Small-Scale Experiments =====

  * [[https://wit3.fbk.eu/2013-01|IWSLT 2013 MT Datasets]] English-French (200K sentence pairs), used for example [[https://arxiv.org/pdf/1911.01986.pdf|here]].
  * [[https://wit3.fbk.eu/2014-01|IWSLT 2014]] English-German (160K sentence pairs), used for example [[https://arxiv.org/pdf/1911.01986.pdf|here]].
  * [[https://www.cs.cmu.edu/~ark/global-voices/|Malagasy-English dataset]] (80K sentence pairs) Malagasy is a morphologically rich language (WARNING: hasn't been used in a while, no recent neural models to compare to)

==== Low-Resource Datasets ====
  * [[https://arxiv.org/pdf/1902.01382.pdf|Guzmán et al 2019]] [[https://github.com/facebookresearch/flores/tree/master/floresv1|dataset]] Four language pairs: Nepali-English, Sinhala-English, Khmer-English, Pashto-English
  * [[http://www.cs.cmu.edu/~ark/global-voices/|Malagasy-English dataset]] (Jeff recommends)
  * [[https://github.com/ldmt-muri/muri-data|LDMT MURI Data]] (ask Jeff for it, he has access)
  * [[https://github.com/facebookresearch/flores|Flores-101 dataset]] Paper: [[https://scontent-sjc3-1.xx.fbcdn.net/v/t39.8562-6/196203317_1861942553982349_5142503689226033347_n.pdf?_nc_cat=110&ccb=1-5&_nc_sid=ae5e01&_nc_ohc=5qR8k_1OO8UAX_B9sJK&_nc_ht=scontent-sjc3-1.xx&oh=ce279e1cc9b9734be2b394c5cfe2cceb&oe=6166924D|Goyal et al 2021]] 3001 sentences translated into 101 languages
  * [[https://github.com/facebookresearch/flores|Flores-200 dataset]] Paper: [[https://arxiv.org/pdf/2207.04672.pdf|Costa-jussà et al 2022]]
  * [[https://github.com/ZhangShiyue/ChrEn|Cherokee-English dataset]] Recommended (recent, 2020)

==== Large Datasets ====
  * [[https://arxiv.org/pdf/1911.06154.pdf|El-Kishky et al 2019 - CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs]]

===== Software =====
See also [[https://arxiv.org/pdf/2012.15515.pdf|Tan 2020]].
  * FairSeq
  * OpenNMT
  * Sockeye
  * Nematus

===== Resources =====
  * Conferences and Workshops
    * [[https://aclanthology.org/venues/wmt/|WMT]] (Workshop on Machine Translation, now Conference on Machine Translation)
  * Books
    * [[https://www.amazon.com/Neural-Machine-Translation-Philipp-Koehn/dp/1108497322/ref=sr_1_1?dchild=1&keywords=Neural+Machine+Translation&qid=1617092140&sr=8-1|Koehn 2020 - Neural Machine Translation]]
  * Wikis
    * [[http://www.statmt.org/survey/|MT Research Survey Wiki]] Covers neural methods as well
  * Bibliographies
    * [[https://github.com/THUNLP-MT/MT-Reading-List|MT Reading List (constantly updated)]]

===== People =====
  * [[https://scholar.google.com/citations?user=phgBJXYAAAAJ&hl=en|Wilker Aziz]]
  * [[https://scholar.google.com/citations?user=iPAX6jcAAAAJ&hl=en|Marine Carpuat]]
  * [[https://scholar.google.com/citations?user=dok0514AAAAJ&hl=en|David Chiang]]
  * [[https://scholar.google.com/citations?user=dLaR9lgAAAAJ&hl=en|Orhan Firat]]
  * [[https://scholar.google.com/citations?user=VEJE37AAAAAJ&hl=en|Kenneth Heafield]]
  * [[https://scholar.google.com/citations?user=d7PTaOYAAAAJ&hl=en|Kevin Knight]]
  * [[https://scholar.google.com/citations?user=OsIZgIYAAAAJ&hl=en|Philipp Koehn]]
  * [[https://scholar.google.com/citations?user=wlosgkoAAAAJ&hl=en|Graham Neubig]]
  * [[https://scholar.google.com/citations?user=4w7LhxsAAAAJ&hl=en|Matt Post]]
  * [[https://scholar.google.com/citations?user=XTpJvCgAAAAJ&hl=en|Rico Sennrich]]

===== Related Pages =====
  * [[Cross-Lingual Transfer]]
  * [[Multilinguality]]
  * [[Noisy Channel Model]]
  * [[Seq2seq]]
  * [[ml:Software]]
  * [[Statistical Machine Translation]]