Time and Space Complexity: The Transformer uses O(n^2) computation time, and O(n^2) memory (Subramanian et al 2019) due to the attention matrix. However, the experiments in Subramanian et al 2019 (fig 2) seem to show a linear increase in memory usage with sequence length. Presumably this is because the attention matrix does not dominate the memory footprint.
Expressiveness and Representation Power:
See also the group FLaNN (Formal Languages and Neural Networks).
See also Mechanistic Interpretability and Transformer Circuits.
These are ablation experiments on the Transformer, such as ablating the multi-head attention, or comparing to an LSTM with multi-head attention.
See also Software.