This is an old revision of the document!

Ensembling

Ensembling combines several models to improve generalization performance. For example, ensembling models trained with different random seeds almost always improves performance. This technique is often used when performance is the main object, such as in competitions like WMT. However, in papers, because it often gives a large improvement, researchers usually compare non-ensembling methods to other non-ensembled methods, and ensembled methods to ensembled methods. See for example of this see Gehring et al 2017.

Introduction: Method

For models trained separately with cross-entropy, the standard method of ensembling in NLP is to just average the probabilities of the models at test time and predict using this probability. There are two standard ways to create the different models for ensembling

Multirun ensembling: models come from training runs with different random seeds
Checkpoint ensembling: models come from different checkpoints of a single training run