====== Machine Translation ====== ===== Overviews ===== For a reading list, see [[https://github.com/THUNLP-MT/MT-Reading-List|The Machine Translation Reading List]] * [[https://arxiv.org/pdf/1709.07809.pdf|Philipp Koehn's 2017 Draft Chapter on NMT]] * [[http://mt-class.org/jhu/assets/nmt-book.pdf|Philipp Koehn's draft book on NMT]] * [[https://www.amazon.com/Neural-Machine-Translation-Philipp-Koehn/dp/1108497322/ref=sr_1_1?dchild=1&keywords=Neural+Machine+Translation&qid=1617092140&sr=8-1|Philipp Koehn's 2020 Book - Neural Machine Translation]] * [[https://arxiv.org/pdf/1912.02047.pdf|Stahlberg 2019 - Neural Machine Translation: A Review and Survey]] * [[https://arxiv.org/pdf/2002.07526.pdf|Yang et al 2020 - A Survey of Deep Learning Techniques for Neural Machine Translation]] * [[https://arxiv.org/pdf/2012.15515.pdf|Tan et al 2020 - Neural Machine Translation: A Review of Methods, Resources, and Tools]] ===== Key Papers ===== * [[https://arxiv.org/pdf/1409.0473.pdf|Bahdanau et al 2014 - Neural Machine Translation by Jointly Learning to Align and Translate]] * [[https://arxiv.org/pdf/1508.07909.pdf|Sennrich et al 2016 - Neural Machine Translation of Rare Words with Subword Units]] * [[https://arxiv.org/pdf/1706.03762.pdf|Vaswani et al 2017 - Attention Is All You Need]] ===== System Papers ===== * [[https://arxiv.org/pdf/1609.08144.pdf|Wu et al 2016 - Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation]] * [[https://arxiv.org/pdf/1709.05820.pdf|Levin et al 2017 - Toward a Full-Scale Neural Machine Translation in Production: The Booking.com Use Case]] * [[https://arxiv.org/pdf/1703.03906.pdf|Britz et al 2017 - Massive Exploration of Neural Machine Translation Architectures]] * [[https://arxiv.org/pdf/1907.05019.pdf|Arivazhagan et al 2019 - Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges]] See the list open problems. From Google's MT team, did they deploy this? * [[https://arxiv.org/pdf/1803.05567.pdf|Hassan et al 2018 - Achieving Human Parity on Automatic Chinese to English News Translation]] * [[https://research.facebook.com/file/585831413174038/No-Language-Left-Behind--Scaling-Human-Centered-Machine-Translation.pdf|Costa-jussà et al 2022 - No Language Left Behind: Scaling Human-Centered Machine Translation]] [[https://ai.facebook.com/blog/nllb-200-high-quality-machine-translation/|blog]] [[https://ai.facebook.com/research/no-language-left-behind/|website]] [[https://github.com/facebookresearch/fairseq/tree/nllb/?fbclid=IwAR1dOIBFelfGY48IJe0MgkUhJnqw3SP2y3O4VhlKs5-QM3dXuFRw4HIleZU|model]] [[https://nllb.metademolab.com/|demo]] Transformer encoder-decoder model with sparsely gated mixture of experts. 50B params, and also distilled versions. ===== Baselines ===== * [[https://arxiv.org/pdf/1706.09733.pdf|Denkowski & Neubig et al 2017 - Stronger Baselines for Trustable Results in Neural Machine Translation]] ===== Syntax in MT ===== * [[http://www.statmt.org/wmt16/pdf/W16-2209.pdf|Sennrich & Haddow 2016 - Linguistic Input Features Improve Neural Machine Translation]] * [[https://arxiv.org/pdf/1702.01147.pdf|Nadejde et al 2017 - Predicting Target Language CCG Supertags Improves Neural Machine Translation]] * [[https://arxiv.org/pdf/1808.10267.pdf|Currey & Heafield 2018 - Multi-Source Syntactic Neural Machine Translation]] ===== Multilingual Translation ===== In multilingual translation, one system is built to translate between many language pairs (rather than just one). * [[https://arxiv.org/pdf/1903.00089.pdf|Aharoni et al 2019 - Massively Multilingual Neural Machine Translation]] * [[https://arxiv.org/pdf/2008.00401.pdf|Tang et al 2020 - Multilingual Translation with Extensible Multilingual Pretraining and Finetuning]] ===== Low-Resource ===== * [[https://arxiv.org/pdf/1604.02201.pdf|Zoph et al 2016 - Transfer Learning for Low-Resource Neural Machine Translation]] * [[https://www.aclweb.org/anthology/N18-1032.pdf|Gu et al 2018 - Universal Neural Machine Translation for Extremely Low Resource Languages]] * Comparision of SMT vs NMT for low-resource MT * [[https://arxiv.org/pdf/1905.11901.pdf|Sennrich & Zhang 2019 - Revisiting Low-Resource Neural Machine Translation: A Case Study]] * [[https://www.aclweb.org/anthology/2020.lrec-1.325.pdf|Duh et al 2020 - Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages]] * [[https://research.facebook.com/file/585831413174038/No-Language-Left-Behind--Scaling-Human-Centered-Machine-Translation.pdf|Costa-jussà et al 2022 - No Language Left Behind: Scaling Human-Centered Machine Translation]] [[https://github.com/facebookresearch/flores|dataset]] [[https://ai.facebook.com/blog/nllb-200-high-quality-machine-translation/|blog]] [[https://ai.facebook.com/research/no-language-left-behind/|website]] [[https://github.com/facebookresearch/fairseq/tree/nllb/?fbclid=IwAR1dOIBFelfGY48IJe0MgkUhJnqw3SP2y3O4VhlKs5-QM3dXuFRw4HIleZU|model]] [[https://nllb.metademolab.com/|demo]] Transformer encoder-decoder model with sparsely gated mixture of experts. 50B params, and also distilled versions. * [[https://aclanthology.org/2022.wmt-1.73.pdf|Marco & Fraser 2022 - Findings of the WMT 2022 Shared Tasks in Unsupervised MT and Very Low Resource Supervised MT]] ===== Character-Level ===== * [[https://arxiv.org/pdf/1808.09943.pdf|Cherry et al 2018 - Revisiting Character-Based Neural Machine Translation with Capacity and Compression]] ===== Domain Adaptation ===== See also [[Domain Adaptation]]. * Surveys * [[https://arxiv.org/pdf/1806.00258.pdf|Chu & Wang 2018 - A Survey of Domain Adaptation for Neural Machine Translation]] * [[https://www.aclweb.org/anthology/N19-1209.pdf|Thompson et al 2019 - Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation]] ===== Pretraining ===== * [[https://arxiv.org/pdf/1804.06323.pdf|Qi et al 2018 - When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?]] * [[https://arxiv.org/pdf/2002.06823.pdf|Zhu etal 2020 - Incorporating BERT into Neural Machine Translation]] * [[https://arxiv.org/pdf/2008.00401.pdf|Tang et al 2020 - Multilingual Translation with Extensible Multilingual Pretraining and Finetuning]] ===== Unsupervised ===== * [[https://arxiv.org/pdf/1711.00043.pdf|Lample et al 2017- Unsupervised Machine Translation Using Monolingual Corpora Only]] * [[https://arxiv.org/pdf/1906.06718.pdf|Luo et al 2019 - Neural Decipherment via Minimum-Cost Flow: from Ugaritic to Linear B]] * [[https://arxiv.org/pdf/1905.02450.pdf|Song et al 2019 - MASS: Masked Sequence to Sequence Pre-training for Language Generation]] [[https://github.com/microsoft/MASS|github]] * [[https://arxiv.org/pdf/2004.05516.pdf|Marchisio et al 2020 - When Does Unsupervised Machine Translation Work?]] * [[https://arxiv.org/pdf/2106.15818.pdf|Marchisio et al 2021 - On Systematic Style Differences between Unsupervised and Supervised MT and an Application for High-Resource Machine Translation]] * [[https://aclanthology.org/2022.wmt-1.73.pdf|Marco & Fraser 2022 - Findings of the WMT 2022 Shared Tasks in Unsupervised MT and Very Low Resource Supervised MT]] * [[https://arxiv.org/pdf/2310.10385.pdf|Tan & Monz 2023 - Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance]] ===== Sentence Alignment ===== Before an MT system can be trained, the sentences in the parallel documents need to be aligned to create sentence pairs. * [[https://www.aclweb.org/anthology/D19-1136.pdf|Thompson & Koehn 2019 - Vecalign: Improved Sentence Alignment in Linear Time and Space]] * Mining parallel sentences * Some of these methods can be used to mine parallel sentences from large collections of documents * [[https://github.com/facebookresearch/LASER|Laser]] * [[https://arxiv.org/pdf/1812.10464.pdf|Artetxe & Schwenk 2019 - Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond]] [[https://github.com/facebookresearch/LASER|github]] Used in NLLB * [[https://arxiv.org/pdf/2205.12654.pdf|Heffernan et al 2022 - Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages]] ===== Statistical MT ===== See also [[Statistical Machine Translation]]. Recent papers related to SMT: * [[https://www.aclweb.org/anthology/E17-1100.pdf|2019 - A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions]] ===== Evaluation ===== For an overview, see [[http://www.phontron.com/class/mtandseq2seq2018/assets/slides/mt-fall2018.chapter11.pdf|Evaluating MT Systems]]. === Papers === See also the metrics task at WMT every year which does a correlation with human evaluations. * BLEU: [[https://aclanthology.org/P02-1040.pdf|Papineni et al 2002 - BLEU: a Method for Automatic Evaluation of Machine Translation]] * METEOR: [[https://aclanthology.org/W05-0909.pdf|Banerjee & Lavie 2005 - METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments]] and [[https://aclanthology.org/W07-0734.pdf|Lavie & Agarwal 2007]], [[https://www.cs.cmu.edu/~alavie/METEOR/|download]] * TER: [[https://aclanthology.org/2006.amta-papers.25.pdf|A Study of Translation Edit Rate with Targeted Human Annotation]] * Multi-lingual METEOR: [[https://aclanthology.org/W14-3348.pdf|Denkowski & Lavie 2014 - Meteor Universal: Language Specific Translation Evaluation for Any Target Language]], [[https://www.cs.cmu.edu/~alavie/METEOR/|download]] * chrF: [[https://aclanthology.org/W15-3049.pdf|Popovic 2015 - chrF: character n-gram F-score for automatic MT evaluation]] * chrF++: [[https://www.statmt.org/wmt17/pdf/WMT70.pdf|Popovic 2017 - chrF++: words helping character n-grams]], [[https://github.com/m-popovic/chrF|github]] * [[https://arxiv.org/pdf/1804.08771.pdf|Post 2018 - A Call for Clarity in Reporting BLEU Scores]] * BERTscore: [[https://arxiv.org/abs/1904.09675|Zhang et al 2019 - BERTScore: Evaluating Text Generation with BERT]] * [[https://arxiv.org/pdf/2004.06063.pdf|Freitag et al 2020 - BLEU might be Guilty but References are not Innocent]] * [[https://arxiv.org/pdf/2106.15195.pdf|Marie et al 2021 - Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers]] * [[https://arxiv.org/pdf/2310.10482.pdf|Guerreiro et al 2023 - xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection]] * [[https://arxiv.org/pdf/2302.14520|Kocmi & Federmann 2023 - Large Language Models Are State-of-the-Art Evaluators of Translation Quality]] * **Evaluation of Metrics** * **[[https://aclanthology.org/2021.tacl-1.87.pdf|Freitag et al 2021 - Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation]]** Used in [[https://www2.statmt.org/wmt24/metrics-task.html|WMT]] === BLEU === Note that BLEU is a corpus-level metric, and that averaging BLEU scores computed at the sentence level will not give the same result as corpus-level BLEU. Corpus-level BLEU is the standard one reported in papers. Notes: To assess length effects (translations being too short), people often report the brevity penalty, BP computed when calculating BLEU. Most BLEU evaluation scripts report this number as ''BP = ''. * [[https://github.com/mjpost/sacrebleu|SacreBLEU]] (recommended) [[https://arxiv.org/pdf/1804.08771.pdf|paper]] * Internally uses [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl|mteval-v13a.pl]] as the tokenizer * If you want to simulate SacreBLEU evaluation, but with statistical significance, you can use the [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl|mteval-v13a.pl]] script to tokenize your output and references, and then use [[https://github.com/jhclark/multeval|MultEval]] * [[https://github.com/neulab/compare-mt|Compare-MT]] Can analyze the differences between two systems and compute statistical significance. [[https://www.aclweb.org/anthology/N19-4007.pdf|paper]] * Historical: Moses's [[https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl|multi-bleu.pl]] * Jon Clark's [[https://github.com/jhclark/multeval|MultEval]] Does automatic boostrap resampling to compute statistical significance. [[http://www.cs.cmu.edu/~jhclark/pubs/significance.pdf|paper]] ===== Datasets ===== ==== Papers About Corpus Collection ==== * [[https://aclanthology.org/J03-3002.pdf|Resnik & Smith 2003 - The Web as a Parallel Corpus]] - The foundational paper about collecting parallel data from the web. ==== Standard Datasets ===== * WMT 2014 En-Fr, etc * [[http://www.statmt.org/wmt16/translation-task.html|WMT 2016]] * Nice scripts to download and preprocess: [[https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh|wmt16_en_de.sh]] * [[https://opus.nlpl.eu/|Opus - The Open Parallel Corpus]] ==== Datasets for Small-Scale Experiments ===== * [[https://wit3.fbk.eu/2013-01|IWSLT 2013 MT Datasets]] English-French (200K sentence pairs), used for example [[https://arxiv.org/pdf/1911.01986.pdf|here]]. * [[https://wit3.fbk.eu/2014-01|IWSLT 2014]] English-German (160K sentence pairs), used for example [[https://arxiv.org/pdf/1911.01986.pdf|here]]. * [[https://www.cs.cmu.edu/~ark/global-voices/|Malagasy-English dataset]] (80K sentence pairs) Malagasy is a morphologically rich language (WARNING: hasn't been used in a while, no recent neural models to compare to) ==== Low-Resource Datasets ==== * [[https://arxiv.org/pdf/1902.01382.pdf|Guzmán et al 2019]] [[https://github.com/facebookresearch/flores/tree/master/floresv1|dataset]] Four language pairs: Nepali-English, Sinhala-English, Khmer-English, Pashto-English * [[http://www.cs.cmu.edu/~ark/global-voices/|Malagasy-English dataset]] (Jeff recommends) * [[https://github.com/ldmt-muri/muri-data|LDMT MURI Data]] (ask Jeff for it, he has access) * [[https://github.com/facebookresearch/flores|Flores-101 dataset]] Paper: [[https://scontent-sjc3-1.xx.fbcdn.net/v/t39.8562-6/196203317_1861942553982349_5142503689226033347_n.pdf?_nc_cat=110&ccb=1-5&_nc_sid=ae5e01&_nc_ohc=5qR8k_1OO8UAX_B9sJK&_nc_ht=scontent-sjc3-1.xx&oh=ce279e1cc9b9734be2b394c5cfe2cceb&oe=6166924D|Goyal et al 2021]] 3001 sentences translated into 101 languages * [[https://github.com/facebookresearch/flores|Flores-200 dataset]] Paper: [[https://arxiv.org/pdf/2207.04672.pdf|Costa-jussà et al 2022]] * [[https://github.com/ZhangShiyue/ChrEn|Cherokee-English dataset]] Recommended (recent, 2020) ==== Large Datasets ==== * [[https://arxiv.org/pdf/1911.06154.pdf|El-Kishky et al 2019 - CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs]] ===== Software ===== See also [[https://arxiv.org/pdf/2012.15515.pdf|Tan 2020]]. * FairSeq * OpenNMT * Sockeye * Nematus ===== Resources ===== * Conferences and Workshops * [[https://aclanthology.org/venues/wmt/|WMT]] (Workshop on Machine Translation, now Conference on Machine Translation) * Books * [[https://www.amazon.com/Neural-Machine-Translation-Philipp-Koehn/dp/1108497322/ref=sr_1_1?dchild=1&keywords=Neural+Machine+Translation&qid=1617092140&sr=8-1|Koehn 2020 - Neural Machine Translation]] * Wikis * [[http://www.statmt.org/survey/|MT Research Survey Wiki]] Covers neural methods as well * Bibliographies * [[https://github.com/THUNLP-MT/MT-Reading-List|MT Reading List (constantly updated)]] ===== People ===== * [[https://scholar.google.com/citations?user=phgBJXYAAAAJ&hl=en|Wilker Aziz]] * [[https://scholar.google.com/citations?user=iPAX6jcAAAAJ&hl=en|Marine Carpuat]] * [[https://scholar.google.com/citations?user=dok0514AAAAJ&hl=en|David Chiang]] * [[https://scholar.google.com/citations?user=dLaR9lgAAAAJ&hl=en|Orhan Firat]] * [[https://scholar.google.com/citations?user=VEJE37AAAAAJ&hl=en|Kenneth Heafield]] * [[https://scholar.google.com/citations?user=d7PTaOYAAAAJ&hl=en|Kevin Knight]] * [[https://scholar.google.com/citations?user=OsIZgIYAAAAJ&hl=en|Philipp Koehn]] * [[https://scholar.google.com/citations?user=wlosgkoAAAAJ&hl=en|Graham Neubig]] * [[https://scholar.google.com/citations?user=4w7LhxsAAAAJ&hl=en|Matt Post]] * [[https://scholar.google.com/citations?user=XTpJvCgAAAAJ&hl=en|Rico Sennrich]] ===== Related Pages ===== * [[Cross-Lingual Transfer]] * [[Multilinguality]] * [[Noisy Channel Model]] * [[Seq2seq]] * [[ml:Software]] * [[Statistical Machine Translation]]