This is an old revision of the document!

Neural Network Architectures

Overviews

Highway networks
GLU (also considered a kind of activation, but it's more like a FF architecture). Variants: Shazeer 2020 ReGLU and SwiGLU work well.
Capsule networks (also used in a CNN-type architecture)
Sparsely-Gated Mixture-of-Experts. Used to greatly scale-up the number of parameters with (sub-linear? check this) increase in computation. Uses many overlapping feedforward networks that are gated by another network. 1000x improvements in model capacity.

Residual connections
ReZero Similar to residual connections, but with a trainable parameter that controls the strength of the nonlinearity (which is initialized to zero).

See also the table in Wikipedia's Activation functions.

Sigmoid, Tahn, etc
Softmax
Maxout (explanation)
Softsign
HardTanh (from Collobert 2004)
ReLU (history: also popularized here and earlier)
Leaky ReLU
Parametric ReLU (PReLU) Leaky ReLU with learned parameters.
GLU and variants
Gaussian error linear units (GELU) Roughly xσ(1.702x). Used in GTP-2 and BERT.
Swish f(x) = xσ(βx). β=1.702 is GELU, β=1 is Sigmoid weighted Linear Unit (SiL)
STL Signed Truncated Logarithm. Very cool activation function with great properties.

Comparisons:

Max, average pooling
Attention
Transformer (it is actually a set network) and Simple Self-Attention Network (SSAN) which is a single-layer transformer with 1 attention head
Deep sets
Deep averaging networks (DAN) aka the neural bag-of-words model (NBOW)
Weighted deep averaging networks. (A natural extension would be to predict the vector “a” from a pooling operation over vectors. Not sure if anyone has done this yet.)
Weighted Multiset Automata
See also Vinyals et al 2015 - Order Matters: Sequence to sequence for sets
BiLSTM Aggregation
Attentive Pooling and described in Attentive Pooling with Learnable Norms

Neural Stacks, Queues, and DeQues (see also Probabilistic Neural Programs)
Associative Memories
- Memory networks A simple key-value associative memory
- Holographic Reduced Representations An associative memory that compresses a collection of key-value vectors into a fixed-size representation using an approximation
Continuous unbounded memory (see sections 3.2-3.3)

Feedforward network (Elman network)
Feedforward network with residual connections (with careful tuning, has been shown to perform as well as LSTMs I believe)
LSTM
- Forget gate
- Peephole connections
Associative LSTM
GRU (has been shown not to perform as well as the LSTM cell, for example here)
Minimal Gated Unit (MGU)