====== Neural Network Architectures ======

===== Overviews =====
  * [[https://www.mitpressjournals.org/doi/pdf/10.1162/neco_a_01199|Yu et al 2019 - A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures]]
  * [[https://arxiv.org/pdf/1901.00596.pdf|Wu et al 2019 - A Comprehensive Survey on Graph Neural Networks]]
  * [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021 - Do Transformer Modifications Transfer Across Implementations and Applications?]] Comparison of many Transformer model variants

===== Feedforward Networks ======
  * [[https://arxiv.org/pdf/1505.00387.pdf|Highway networks]]
  * [[https://arxiv.org/abs/1612.08083|GLU]] (also considered a kind of activation, but it's more like a FF architecture).  Variants: [[https://arxiv.org/pdf/2002.05202.pdf|Shazeer 2020]] ReGLU and SwiGLU work well.
  * [[https://arxiv.org/pdf/1710.09829.pdf|Capsule networks]] (also used in a CNN-type architecture)
  * [[https://arxiv.org/pdf/1701.06538.pdf|Sparsely-Gated Mixture-of-Experts]].  Used to greatly scale-up the number of parameters with (sub-linear? check this) increase in computation. Uses many overlapping feedforward networks that are gated by another network. 1000x improvements in model capacity.
  * [[https://arxiv.org/pdf/1902.05770.pdf|Dou et al 2019 - Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement]]
  * [[https://arxiv.org/pdf/2204.00595|Dao et al 2022 - Monarch: Expressive Structured Matrices for Efficient and Accurate Training]]

===== Connections =====
  * [[https://arxiv.org/pdf/1512.03385.pdf|Residual connections]]
  * [[https://arxiv.org/pdf/2003.04887.pdf|ReZero]] Similar to residual connections, but with a trainable parameter that controls the strength of the nonlinearity (which is initialized to zero).

===== Sequence Networks ======
See also [[State-Space Models]].

  * RNNs: [[https://crl.ucsd.edu/~elman/Papers/fsit.pdf|Elman networks]], [[https://cseweb.ucsd.edu/~gary/PAPER-SUGGESTIONS/Jordan-TR-8604-OCRed.pdf|Jordan networks]]
  * [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.4320&rep=rep1&type=pdf|LSTMs]], [[https://arxiv.org/pdf/1406.1078.pdf|GRUs]] (see [[nn_architectures#RNN Cells]])
  * [[https://arxiv.org/pdf/1410.5401.pdf|Neural Turing Machines]] Cool idea, but this paper has a drawback because in practice they limit the size of the external memory, which makes it more like a neural finite-state machine (see p. 11, footnote 2).  Not necessarily the be-all and end-all architecture for NTMs.
  * [[https://en.wikipedia.org/wiki/Differentiable_neural_computer|Differentiable Neural Computer]] An extension to Neural Turing Machines
  * [[https://arxiv.org/pdf/1506.03134.pdf|Pointer Networks]]
  * StackRNNs, StackLSTMs
  * [[https://arxiv.org/pdf/1609.09106.pdf|HyperNetworks]]  Uses one network to generate the weights for another network.
  * Memory networks (i.e. End-to-end memory networks)
  * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Associative LSTM]]
  * [[https://arxiv.org/pdf/1705.07393|Recurrent Additive Networks]] An early state space model
  * [[https://arxiv.org/pdf/1704.04368.pdf|Pointer-Generator Networks]]
  * [[https://arxiv.org/pdf/1705.03122.pdf|Convolutional Seq2seq]]
  * [[https://arxiv.org/pdf/1601.06733.pdf|Long Short-Term Memory-Network (LSTMN)]] Augments the LSTM cell with a memory network
  * [[https://arxiv.org/pdf/1610.10099.pdf|ByteNet]] Dilated convolution network for seq2seq that stacks the encoder and decoder, and doesn't use attention. Operates in linear time.
  * [[https://arxiv.org/pdf/1706.03762.pdf|Transformers]]
  * [[http://proceedings.mlr.press/v80/kaiser18a/kaiser18a.pdf|Latent Transformer]] Non-autoregressive Transformer using latent variables
  * [[https://www.aclweb.org/anthology/W18-6219.pdf|Simple Self-Attention Network (SSAN)]] Single-layer transformer with 1 attention head
  * [[https://www.aclweb.org/anthology/P18-1167.pdf|RNMT+]] Hybrid RNN/Transformer archecture. Outperforms the Transformer by half a BLEU point
  * [[https://arxiv.org/pdf/1807.03819.pdf|Universal Transformers]] A recurrent (across layers) Transformer with dynamic halting at each position
  * [[https://arxiv.org/pdf/1905.13324.pdf|Lightweight Recurrent Networks]] Related to the Transformer, LRNs are a drop-in replacement to other RNNs, which remove the sequential natural of RNN processing.  Essentially uses a Key-Query-Value attention mechanism instead of the recurrence.
  * [[https://arxiv.org/pdf/2002.09402.pdf|Feedback Transformer]] Makes the Transformer recurrent by allowing each timestep to look back at all layers.  Improves performance but makes training much slower because of the recurrence.
  * [[https://arxiv.org/pdf/2109.00301.pdf|∞-former (Infinite former)]] Infinite Memory Transformer
  * [[https://arxiv.org/pdf/2105.03824.pdf|FNet]] A faster, attention-free Transformer architecture based on Fourier transforms
  * [[https://arxiv.org/pdf/2305.10991.pdf|Anthe: Less is More! A slim architecture for optimal language translation]]
  * [[https://arxiv.org/pdf/2305.13048|RWKV (Receptance Weighted Key Value) Network]] Information is passed across positions using a positional weight decay which gates the information. Allows parallel training like the transformer, but more efficient inference like the RNN
  * [[https://arxiv.org/pdf/2307.08621.pdf|RetNet (Retentive Network)]]

===== Tree Networks =====
  * [[https://arxiv.org/pdf/1503.00075.pdf|TreeLSTM]], also [[https://arxiv.org/pdf/1503.04881.pdf|S-LSTMs]]

===== Graph Networks =====
See also [[https://arxiv.org/pdf/1901.00596.pdf|Wu et al 2019 - A Comprehensive Survey on Graph Neural Networks]] and [[Graph_NN|Graph Neural Networks]].
  * Graph convolution networks
  * Graph transformers

===== Activation Functions =====
See also the table in Wikipedia's [[https://en.wikipedia.org/wiki/Activation_function|Activation functions]].
  * Sigmoid, Tahn, etc
  * Softmax
  * [[https://arxiv.org/pdf/1302.4389.pdf|Maxout]] ([[https://stats.stackexchange.com/questions/129698/what-is-maxout-in-neural-network|explanation]])
  * [[http://www.iro.umontreal.ca/~lisa/publications2/index.php/attachments/single/205|Softsign]]
  * [[https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf|HardTanh]] (from [[https://ronan.collobert.com/pub/matos/2004_phdthesis_lip6.pdf|Collobert 2004]])
  * [[http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf|ReLU]] (history: also popularized [[https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40811.pdf|here]] and [[https://www.cs.toronto.edu/~fritz/absps/reluICML.pdf|earlier]])
  * [[https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf|Leaky ReLU]]
  * [[https://arxiv.org/pdf/1502.01852.pdf|Parametric ReLU (PReLU)]] Leaky ReLU with learned parameters.
  * [[https://arxiv.org/abs/1612.08083|GLU]] and [[https://arxiv.org/pdf/2002.05202.pdf|variants]]
  * [[https://arxiv.org/pdf/1606.08415.pdf|Gaussian error linear units (GELU)]] Roughly xσ(1.702x). Used in GTP-2 and BERT.
  * [[https://arxiv.org/pdf/1710.05941.pdf|Swish]] f(x) = xσ(βx). β=1.702 is GELU, β=1 is [[https://arxiv.org/pdf/1702.03118.pdf|Sigmoid weighted Linear Unit (SiL)]]
  * [[https://arxiv.org/pdf/2307.16389.pdf|STL]] Signed Truncated Logarithm. Very cool activation function with great properties.

Comparisons:
  * [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021]] Compares activation functions in the Transformer

===== Matrices ===
Various representations of matrices, such as sparse, or low-dimensional ones.
  * Tensor networks
  * [[https://arxiv.org/pdf/2106.09685|LoRA]]
  * [[https://arxiv.org/pdf/2204.00595|Monarch Matrices]]

===== Set and Pooling Networks =====
  * Max, average pooling
  * Attention
  * Transformer (it is actually a set network) and [[https://www.aclweb.org/anthology/W18-6219.pdf|Simple Self-Attention Network (SSAN)]] which is a single-layer transformer with 1 attention head
  * [[https://arxiv.org/pdf/1703.06114.pdf|Deep sets]]
  * [[https://people.cs.umass.edu/~miyyer/pubs/2015_acl_dan.pdf|Deep averaging networks (DAN)]] aka the neural bag-of-words model (NBOW)
  * [[https://www.aclweb.org/anthology/W16-1626.pdf|Weighted deep averaging networks]].  (A natural extension would be to predict the vector "a" from a pooling operation over vectors. Not sure if anyone has done this yet.)
  * [[https://arxiv.org/pdf/2001.00610.pdf|Weighted Multiset Automata]]
  * See also [[https://arxiv.org/pdf/1511.06391.pdf|Vinyals et al 2015 - Order Matters: Sequence to sequence for sets]]
  * BiLSTM Aggregation
  * [[https://aclanthology.org/N16-1174.pdf|Attentive Pooling]] and described in [[https://aclanthology.org/2020.acl-main.267.pdf|Attentive Pooling with Learnable Norms]]

===== Memory Architectures =====
  * [[https://arxiv.org/pdf/1506.02516.pdf|Neural Stacks, Queues, and DeQues]] (see also [[https://arxiv.org/pdf/1612.00712.pdf|Probabilistic Neural Programs]])
  * Associative Memories
    * [[https://arxiv.org/pdf/1503.08895.pdf|Memory networks]] A simple key-value associative memory
    * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Holographic Reduced Representations]] An associative memory that compresses a collection of key-value vectors into a fixed-size representation using an approximation
  * [[https://arxiv.org/pdf/2109.00301.pdf|Continuous unbounded memory]] (see sections 3.2-3.3)

===== RNN Cells =====
See also [[https://en.wikipedia.org/wiki/Recurrent_neural_network|Wikipedia - Recurrent Neural Networks]] and [[https://www.mitpressjournals.org/doi/pdf/10.1162/neco_a_01199|Yu et al 2019 - A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures]]
  * Feedforward network (Elman network)
  * Feedforward network with residual connections (with careful tuning, has been shown to perform as well as LSTMs I believe)
  * LSTM
     * Forget gate
     * Peephole connections
  * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Associative LSTM]]
  * [[https://arxiv.org/pdf/1406.1078.pdf|GRU]] (has been shown not to perform as well as the LSTM cell, for example [[https://arxiv.org/pdf/1611.01734.pdf|here]])
  * Minimal Gated Unit (MGU)

===== Position Embeddings =====
See [[nlp:Transformers#Position Embeddings]].

===== Attention Mechanisms =====
See also the [[nlp:Attention Mechanisms]] page.
  * Feedforward attention (the original one)
  * [[https://arxiv.org/pdf/1508.04025.pdf|Dot product attention]] (aka Luong attention)
  * [[https://arxiv.org/pdf/1701.01811.pdf|Structural attention]]
  * [[https://arxiv.org/pdf/1702.00887.pdf|Structured Attention Networks]]
  * [[https://arxiv.org/pdf/1911.03875.pdf|Label Attention Layer]]
  * [[https://arxiv.org/pdf/2006.16236.pdf|Linear Attention]] (Faster to compute - makes the Transformer O(n))
  * [[https://openreview.net/forum?id=QtTKTdVrFBB|Random Feature Attention]] Uses random features to approximate the softmax, making it O(1).
  * [[https://arxiv.org/pdf/2006.07214.pdf|Continuous Attention Mechanism]], used [[https://arxiv.org/pdf/2109.00301.pdf|here]]
  * [[https://arxiv.org/pdf/2209.10655.pdf|Single-Headed Gated Attention]] Can simulate multi-head attention, and is more expressive (see Sect 3.3 and Theorem 1).


===== Neurosymbolic Networks =====
See also [[nlp:Neurosymbolic Methods]]
  * [[https://arxiv.org/pdf/1802.08535.pdf|PossibleWorldNet]]
  * [[https://openaccess.thecvf.com/content_cvpr_2016/papers/Andreas_Neural_Module_Networks_CVPR_2016_paper.pdf|Neural Module Networks]] (see [[ml:modularity#Neural Module Networks]])
  * [[https://arxiv.org/pdf/1612.00712.pdf|Probabilistic Neural Programs]]

===== Dynamic Neural Networks =====
See also [[Conditional Computation]].

===== Miscellaneous Architectures =====
  * [[Infinite Neural Networks]] (GPNNs and Neural Tangent Kernel)
  * [[Model Compression#Binarized Neural Networks]]
  * [[https://arxiv.org/pdf/1508.05051.pdf|Auto-Sizing Neural Networks]]