====== Neural Network Architectures ====== ===== Overviews ===== * [[https://www.mitpressjournals.org/doi/pdf/10.1162/neco_a_01199|Yu et al 2019 - A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures]] * [[https://arxiv.org/pdf/1901.00596.pdf|Wu et al 2019 - A Comprehensive Survey on Graph Neural Networks]] * [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021 - Do Transformer Modifications Transfer Across Implementations and Applications?]] Comparison of many Transformer model variants ===== Feedforward Networks ====== * [[https://arxiv.org/pdf/1505.00387.pdf|Highway networks]] * [[https://arxiv.org/abs/1612.08083|GLU]] (also considered a kind of activation, but it's more like a FF architecture). Variants: [[https://arxiv.org/pdf/2002.05202.pdf|Shazeer 2020]] ReGLU and SwiGLU work well. * [[https://arxiv.org/pdf/1710.09829.pdf|Capsule networks]] (also used in a CNN-type architecture) * [[https://arxiv.org/pdf/1701.06538.pdf|Sparsely-Gated Mixture-of-Experts]]. Used to greatly scale-up the number of parameters with (sub-linear? check this) increase in computation. Uses many overlapping feedforward networks that are gated by another network. 1000x improvements in model capacity. * [[https://arxiv.org/pdf/1902.05770.pdf|Dou et al 2019 - Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement]] * [[https://arxiv.org/pdf/2204.00595|Dao et al 2022 - Monarch: Expressive Structured Matrices for Efficient and Accurate Training]] ===== Connections ===== * [[https://arxiv.org/pdf/1512.03385.pdf|Residual connections]] * [[https://arxiv.org/pdf/2003.04887.pdf|ReZero]] Similar to residual connections, but with a trainable parameter that controls the strength of the nonlinearity (which is initialized to zero). ===== Sequence Networks ====== See also [[State-Space Models]]. * RNNs: [[https://crl.ucsd.edu/~elman/Papers/fsit.pdf|Elman networks]], [[https://cseweb.ucsd.edu/~gary/PAPER-SUGGESTIONS/Jordan-TR-8604-OCRed.pdf|Jordan networks]] * [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.4320&rep=rep1&type=pdf|LSTMs]], [[https://arxiv.org/pdf/1406.1078.pdf|GRUs]] (see [[nn_architectures#RNN Cells]]) * [[https://arxiv.org/pdf/1410.5401.pdf|Neural Turing Machines]] Cool idea, but this paper has a drawback because in practice they limit the size of the external memory, which makes it more like a neural finite-state machine (see p. 11, footnote 2). Not necessarily the be-all and end-all architecture for NTMs. * [[https://en.wikipedia.org/wiki/Differentiable_neural_computer|Differentiable Neural Computer]] An extension to Neural Turing Machines * [[https://arxiv.org/pdf/1506.03134.pdf|Pointer Networks]] * StackRNNs, StackLSTMs * [[https://arxiv.org/pdf/1609.09106.pdf|HyperNetworks]] Uses one network to generate the weights for another network. * Memory networks (i.e. End-to-end memory networks) * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Associative LSTM]] * [[https://arxiv.org/pdf/1705.07393|Recurrent Additive Networks]] An early state space model * [[https://arxiv.org/pdf/1704.04368.pdf|Pointer-Generator Networks]] * [[https://arxiv.org/pdf/1705.03122.pdf|Convolutional Seq2seq]] * [[https://arxiv.org/pdf/1601.06733.pdf|Long Short-Term Memory-Network (LSTMN)]] Augments the LSTM cell with a memory network * [[https://arxiv.org/pdf/1610.10099.pdf|ByteNet]] Dilated convolution network for seq2seq that stacks the encoder and decoder, and doesn't use attention. Operates in linear time. * [[https://arxiv.org/pdf/1706.03762.pdf|Transformers]] * [[http://proceedings.mlr.press/v80/kaiser18a/kaiser18a.pdf|Latent Transformer]] Non-autoregressive Transformer using latent variables * [[https://www.aclweb.org/anthology/W18-6219.pdf|Simple Self-Attention Network (SSAN)]] Single-layer transformer with 1 attention head * [[https://www.aclweb.org/anthology/P18-1167.pdf|RNMT+]] Hybrid RNN/Transformer archecture. Outperforms the Transformer by half a BLEU point * [[https://arxiv.org/pdf/1807.03819.pdf|Universal Transformers]] A recurrent (across layers) Transformer with dynamic halting at each position * [[https://arxiv.org/pdf/1905.13324.pdf|Lightweight Recurrent Networks]] Related to the Transformer, LRNs are a drop-in replacement to other RNNs, which remove the sequential natural of RNN processing. Essentially uses a Key-Query-Value attention mechanism instead of the recurrence. * [[https://arxiv.org/pdf/2002.09402.pdf|Feedback Transformer]] Makes the Transformer recurrent by allowing each timestep to look back at all layers. Improves performance but makes training much slower because of the recurrence. * [[https://arxiv.org/pdf/2109.00301.pdf|∞-former (Infinite former)]] Infinite Memory Transformer * [[https://arxiv.org/pdf/2105.03824.pdf|FNet]] A faster, attention-free Transformer architecture based on Fourier transforms * [[https://arxiv.org/pdf/2305.10991.pdf|Anthe: Less is More! A slim architecture for optimal language translation]] * [[https://arxiv.org/pdf/2305.13048|RWKV (Receptance Weighted Key Value) Network]] Information is passed across positions using a positional weight decay which gates the information. Allows parallel training like the transformer, but more efficient inference like the RNN * [[https://arxiv.org/pdf/2307.08621.pdf|RetNet (Retentive Network)]] ===== Tree Networks ===== * [[https://arxiv.org/pdf/1503.00075.pdf|TreeLSTM]], also [[https://arxiv.org/pdf/1503.04881.pdf|S-LSTMs]] ===== Graph Networks ===== See also [[https://arxiv.org/pdf/1901.00596.pdf|Wu et al 2019 - A Comprehensive Survey on Graph Neural Networks]] and [[Graph_NN|Graph Neural Networks]]. * Graph convolution networks * Graph transformers ===== Activation Functions ===== See also the table in Wikipedia's [[https://en.wikipedia.org/wiki/Activation_function|Activation functions]]. * Sigmoid, Tahn, etc * Softmax * [[https://arxiv.org/pdf/1302.4389.pdf|Maxout]] ([[https://stats.stackexchange.com/questions/129698/what-is-maxout-in-neural-network|explanation]]) * [[http://www.iro.umontreal.ca/~lisa/publications2/index.php/attachments/single/205|Softsign]] * [[https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf|HardTanh]] (from [[https://ronan.collobert.com/pub/matos/2004_phdthesis_lip6.pdf|Collobert 2004]]) * [[http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf|ReLU]] (history: also popularized [[https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40811.pdf|here]] and [[https://www.cs.toronto.edu/~fritz/absps/reluICML.pdf|earlier]]) * [[https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf|Leaky ReLU]] * [[https://arxiv.org/pdf/1502.01852.pdf|Parametric ReLU (PReLU)]] Leaky ReLU with learned parameters. * [[https://arxiv.org/abs/1612.08083|GLU]] and [[https://arxiv.org/pdf/2002.05202.pdf|variants]] * [[https://arxiv.org/pdf/1606.08415.pdf|Gaussian error linear units (GELU)]] Roughly xσ(1.702x). Used in GTP-2 and BERT. * [[https://arxiv.org/pdf/1710.05941.pdf|Swish]] f(x) = xσ(βx). β=1.702 is GELU, β=1 is [[https://arxiv.org/pdf/1702.03118.pdf|Sigmoid weighted Linear Unit (SiL)]] * [[https://arxiv.org/pdf/2307.16389.pdf|STL]] Signed Truncated Logarithm. Very cool activation function with great properties. Comparisons: * [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021]] Compares activation functions in the Transformer ===== Matrices === Various representations of matrices, such as sparse, or low-dimensional ones. * Tensor networks * [[https://arxiv.org/pdf/2106.09685|LoRA]] * [[https://arxiv.org/pdf/2204.00595|Monarch Matrices]] ===== Set and Pooling Networks ===== * Max, average pooling * Attention * Transformer (it is actually a set network) and [[https://www.aclweb.org/anthology/W18-6219.pdf|Simple Self-Attention Network (SSAN)]] which is a single-layer transformer with 1 attention head * [[https://arxiv.org/pdf/1703.06114.pdf|Deep sets]] * [[https://people.cs.umass.edu/~miyyer/pubs/2015_acl_dan.pdf|Deep averaging networks (DAN)]] aka the neural bag-of-words model (NBOW) * [[https://www.aclweb.org/anthology/W16-1626.pdf|Weighted deep averaging networks]]. (A natural extension would be to predict the vector "a" from a pooling operation over vectors. Not sure if anyone has done this yet.) * [[https://arxiv.org/pdf/2001.00610.pdf|Weighted Multiset Automata]] * See also [[https://arxiv.org/pdf/1511.06391.pdf|Vinyals et al 2015 - Order Matters: Sequence to sequence for sets]] * BiLSTM Aggregation * [[https://aclanthology.org/N16-1174.pdf|Attentive Pooling]] and described in [[https://aclanthology.org/2020.acl-main.267.pdf|Attentive Pooling with Learnable Norms]] ===== Memory Architectures ===== * [[https://arxiv.org/pdf/1506.02516.pdf|Neural Stacks, Queues, and DeQues]] (see also [[https://arxiv.org/pdf/1612.00712.pdf|Probabilistic Neural Programs]]) * Associative Memories * [[https://arxiv.org/pdf/1503.08895.pdf|Memory networks]] A simple key-value associative memory * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Holographic Reduced Representations]] An associative memory that compresses a collection of key-value vectors into a fixed-size representation using an approximation * [[https://arxiv.org/pdf/2109.00301.pdf|Continuous unbounded memory]] (see sections 3.2-3.3) ===== RNN Cells ===== See also [[https://en.wikipedia.org/wiki/Recurrent_neural_network|Wikipedia - Recurrent Neural Networks]] and [[https://www.mitpressjournals.org/doi/pdf/10.1162/neco_a_01199|Yu et al 2019 - A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures]] * Feedforward network (Elman network) * Feedforward network with residual connections (with careful tuning, has been shown to perform as well as LSTMs I believe) * LSTM * Forget gate * Peephole connections * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Associative LSTM]] * [[https://arxiv.org/pdf/1406.1078.pdf|GRU]] (has been shown not to perform as well as the LSTM cell, for example [[https://arxiv.org/pdf/1611.01734.pdf|here]]) * Minimal Gated Unit (MGU) ===== Position Embeddings ===== See [[nlp:Transformers#Position Embeddings]]. ===== Attention Mechanisms ===== See also the [[nlp:Attention Mechanisms]] page. * Feedforward attention (the original one) * [[https://arxiv.org/pdf/1508.04025.pdf|Dot product attention]] (aka Luong attention) * [[https://arxiv.org/pdf/1701.01811.pdf|Structural attention]] * [[https://arxiv.org/pdf/1702.00887.pdf|Structured Attention Networks]] * [[https://arxiv.org/pdf/1911.03875.pdf|Label Attention Layer]] * [[https://arxiv.org/pdf/2006.16236.pdf|Linear Attention]] (Faster to compute - makes the Transformer O(n)) * [[https://openreview.net/forum?id=QtTKTdVrFBB|Random Feature Attention]] Uses random features to approximate the softmax, making it O(1). * [[https://arxiv.org/pdf/2006.07214.pdf|Continuous Attention Mechanism]], used [[https://arxiv.org/pdf/2109.00301.pdf|here]] * [[https://arxiv.org/pdf/2209.10655.pdf|Single-Headed Gated Attention]] Can simulate multi-head attention, and is more expressive (see Sect 3.3 and Theorem 1). ===== Neurosymbolic Networks ===== See also [[nlp:Neurosymbolic Methods]] * [[https://arxiv.org/pdf/1802.08535.pdf|PossibleWorldNet]] * [[https://openaccess.thecvf.com/content_cvpr_2016/papers/Andreas_Neural_Module_Networks_CVPR_2016_paper.pdf|Neural Module Networks]] (see [[ml:modularity#Neural Module Networks]]) * [[https://arxiv.org/pdf/1612.00712.pdf|Probabilistic Neural Programs]] ===== Dynamic Neural Networks ===== See also [[Conditional Computation]]. ===== Miscellaneous Architectures ===== * [[Infinite Neural Networks]] (GPNNs and Neural Tangent Kernel) * [[Model Compression#Binarized Neural Networks]] * [[https://arxiv.org/pdf/1508.05051.pdf|Auto-Sizing Neural Networks]]