This is an old revision of the document!
Table of Contents
Machine Translation
Overviews
For a reading list, see The Machine Translation Reading List
Key Papers
System Papers
- Arivazhagan et al 2019 - Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges See the list open problems. From Google's MT team, did they deploy this?
- Costa-jussà et al 2022 - No Language Left Behind: Scaling Human-Centered Machine Translation blog website model demo Transformer encoder-decoder model with sparsely gated mixture of experts. 50B params, and also distilled versions.
Baselines
Syntax in MT
Multilingual Translation
Multilingual translation is where you build a system to translate between many language pairs (rather than just one).
Low-Resource
- Comparision of SMT vs NMT for low-resource MT
- Costa-jussà et al 2022 - No Language Left Behind: Scaling Human-Centered Machine Translation blog website model demo Transformer encoder-decoder model with sparsely gated mixture of experts. 50B params, and also distilled versions.
Character-Level
Domain Adaptation
See also Domain Adaptation.
- Surveys
Pretraining
Unsupervised
Sentence Alignment
Before an MT system can be trained, the sentences in the parallel documents need to be aligned to create sentence pairs.
- Mining parallel sentences
- Some of these methods can be used to mine parallel sentences from large collections of documents
-
Statistical MT
See also Statistical Machine Translation. Recent papers related to SMT:
Evaluation
For an overview, see Evaluating MT Systems.
Papers
See also the metrics task at WMT every year which does a correlation with human evaluations.
BLEU
Note that BLEU is a corpus-level metric, and that averaging BLEU scores computed at the sentence level will not give the same result as corpus-level BLEU. Corpus-level BLEU is the standard one reported in papers.
Notes: To assess length effects (translations being too short), people often report the brevity penalty, BP computed when calculating BLEU. Most BLEU evaluation scripts report this number as BP = .
-
- If you want to simulate SacreBLEU evaluation, but with statistical significance, you can use the mteval-v13a.pl script to tokenize your output and references, and then use MultEval
- Compare-MT Can analyze the differences between two systems and compute statistical significance. paper
- Historical: Moses's multi-bleu.pl
Datasets
Standard Datasets
- WMT 2014 En-Fr, etc
-
- Nice scripts to download and preprocess: wmt16_en_de.sh
Datasets for Small-Scale Experiments
- IWSLT 2013 MT Datasets English-French (200K sentence pairs), used for example here.
- IWSLT 2014 English-German (160K sentence pairs), used for example here.
- Malagasy-English dataset (80K sentence pairs) Malagasy is a morphologically rich language (WARNING: hasn't been used in a while, no recent neural models to compare to)
Low-Resource Datasets
- Guzmán et al 2019 dataset Four language pairs: Nepali-English, Sinhala-English, Khmer-English, Pashto-English
- Malagasy-English dataset (Jeff recommends)
- LDMT MURI Data (ask Jeff for it, he has access)
- Flores-101 dataset Paper: Goyal et al 2021 3001 sentences translated into 101 languages
- Cherokee-English dataset Recommended (recent, 2020)
Large Datasets
Software
See also Tan 2020.
- FairSeq
- OpenNMT
- Sockeye
- Nematus
Resources
- Conferences and Workshops
- WMT (Workshop on Machine Translation, now Conference on Machine Translation)
- Books
- Wikis
- MT Research Survey Wiki Covers neural methods as well
- Bibliographies