User Tools

Site Tools


nlp:machine_translation

This is an old revision of the document!


Machine Translation

Overviews

Key Papers

System Papers

Baselines

Syntax in MT

Multilingual Translation

In multilingual translation, one system is built to translate between many language pairs (rather than just one).

Low-Resource

Character-Level

Domain Adaptation

Pretraining

Unsupervised

Sentence Alignment

Before an MT system can be trained, the sentences in the parallel documents need to be aligned to create sentence pairs.

Statistical MT

Evaluation

For an overview, see Evaluating MT Systems.

Papers

BLEU

Note that BLEU is a corpus-level metric, and that averaging BLEU scores computed at the sentence level will not give the same result as corpus-level BLEU. Corpus-level BLEU is the standard one reported in papers.

Notes: To assess length effects (translations being too short), people often report the brevity penalty, BP computed when calculating BLEU. Most BLEU evaluation scripts report this number as BP = .

  • SacreBLEU (recommended) paper
    • Internally uses mteval-v13a.pl as the tokenizer
    • If you want to simulate SacreBLEU evaluation, but with statistical significance, you can use the mteval-v13a.pl script to tokenize your output and references, and then use MultEval
  • Compare-MT Can analyze the differences between two systems and compute statistical significance. paper
  • Historical: Moses's multi-bleu.pl
  • Jon Clark's MultEval Does automatic boostrap resampling to compute statistical significance. paper

Datasets

Papers About Corpus Collection

Standard Datasets

Datasets for Small-Scale Experiments

  • IWSLT 2013 MT Datasets English-French (200K sentence pairs), used for example here.
  • IWSLT 2014 English-German (160K sentence pairs), used for example here.
  • Malagasy-English dataset (80K sentence pairs) Malagasy is a morphologically rich language (WARNING: hasn't been used in a while, no recent neural models to compare to)

Low-Resource Datasets

Large Datasets

Software

See also Tan 2020.

  • FairSeq
  • OpenNMT
  • Sockeye
  • Nematus

Resources

People

nlp/machine_translation.1698194242.txt.gz · Last modified: 2023/10/25 00:37 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki