Transformers are hard to train (
Liu et al 2020 - Understanding the Difficulty of Training Transformers), and often we can't even get them to overfit and just train a long as we can. This isn't a good situation, and shows there is some issue with the normalization, the initializer, or the optimizer. Feedforward, CNN, and RNNS had this issue for a long time, and these issues were fixed with Glorot initialization, batch normalization, and layer normalization. The open problem is:
what are the optimal initialization and normalization procedures for the Transformer?