Ensembling

Ensembling combines several models to improve generalization performance. For example, ensembling models trained with different random seeds almost always improves performance. This technique is often used when performance is the main object, such as in competitions like WMT. However, in papers, because it often gives a large improvement, researchers usually compare non-ensembling methods to other non-ensembled methods, and ensembled methods to ensembled methods. As an example of this see Gehring et al 2017.

Basic Method

For models trained with cross-entropy, the standard method of ensembling in NLP is to just average the probabilities of the models at test time and predict using this probability. There are two standard ways to create the different models for ensembling:

Multirun ensembling: models come from training runs with different random seeds
Checkpoint ensembling: models come from different checkpoints of a single training run. This has the advantage that a only single training run is needed

See Koehn 2020 p. 148 or pdf here.