User Tools

Site Tools


nlp:transformers

Transformers

Overview

Surveys

Transformer Properties

Time and Space Complexity: The Transformer uses O(n^2) computation time, and O(n^2) memory (Subramanian et al 2019) due to the attention matrix. However, the experiments in Subramanian et al 2019 (fig 2) seem to show a linear increase in memory usage with sequence length. Presumably this is because the attention matrix does not dominate the memory footprint.

Expressiveness and Representation Power:

See also the group FLaNN (Formal Languages and Neural Networks).

Analysis and Interpretation

Transformer Variants: Overviews

Improvements

Mixture of Expert (MoE) Transformers

Ablation Experiments on the Transformer

These are ablation experiments on the Transformer, such as ablating the multi-head attention, or comparing to an LSTM with multi-head attention.

Pruning Attention Heads

Training

Efficient Transformers

Survey Papers

Papers

Datasets and Benchmarks

Long-Context Transformers

Survey Papers

Papers

Position Embeddings

Software

See also Software.

nlp/transformers.txt · Last modified: 2025/10/17 20:09 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki