User Tools

Site Tools


ml:nn_architectures

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:nn_architectures [2023/08/10 05:46] – [Activation Functions] jmflanigml:nn_architectures [2025/03/25 07:34] (current) – [Sequence Networks] jmflanig
Line 11: Line 11:
   * [[https://arxiv.org/pdf/1710.09829.pdf|Capsule networks]] (also used in a CNN-type architecture)   * [[https://arxiv.org/pdf/1710.09829.pdf|Capsule networks]] (also used in a CNN-type architecture)
   * [[https://arxiv.org/pdf/1701.06538.pdf|Sparsely-Gated Mixture-of-Experts]].  Used to greatly scale-up the number of parameters with (sub-linear? check this) increase in computation. Uses many overlapping feedforward networks that are gated by another network. 1000x improvements in model capacity.   * [[https://arxiv.org/pdf/1701.06538.pdf|Sparsely-Gated Mixture-of-Experts]].  Used to greatly scale-up the number of parameters with (sub-linear? check this) increase in computation. Uses many overlapping feedforward networks that are gated by another network. 1000x improvements in model capacity.
 +  * [[https://arxiv.org/pdf/1902.05770.pdf|Dou et al 2019 - Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement]]
 +  * [[https://arxiv.org/pdf/2204.00595|Dao et al 2022 - Monarch: Expressive Structured Matrices for Efficient and Accurate Training]]
  
 ===== Connections ===== ===== Connections =====
Line 28: Line 30:
   * Memory networks (i.e. End-to-end memory networks)   * Memory networks (i.e. End-to-end memory networks)
   * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Associative LSTM]]   * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Associative LSTM]]
 +  * [[https://arxiv.org/pdf/1705.07393|Recurrent Additive Networks]] An early state space model
   * [[https://arxiv.org/pdf/1704.04368.pdf|Pointer-Generator Networks]]   * [[https://arxiv.org/pdf/1704.04368.pdf|Pointer-Generator Networks]]
   * [[https://arxiv.org/pdf/1705.03122.pdf|Convolutional Seq2seq]]   * [[https://arxiv.org/pdf/1705.03122.pdf|Convolutional Seq2seq]]
Line 42: Line 45:
   * [[https://arxiv.org/pdf/2105.03824.pdf|FNet]] A faster, attention-free Transformer architecture based on Fourier transforms   * [[https://arxiv.org/pdf/2105.03824.pdf|FNet]] A faster, attention-free Transformer architecture based on Fourier transforms
   * [[https://arxiv.org/pdf/2305.10991.pdf|Anthe: Less is More! A slim architecture for optimal language translation]]   * [[https://arxiv.org/pdf/2305.10991.pdf|Anthe: Less is More! A slim architecture for optimal language translation]]
 +  * [[https://arxiv.org/pdf/2305.13048|RWKV (Receptance Weighted Key Value) Network]] Information is passed across positions using a positional weight decay which gates the information. Allows parallel training like the transformer, but more efficient inference like the RNN
   * [[https://arxiv.org/pdf/2307.08621.pdf|RetNet (Retentive Network)]]   * [[https://arxiv.org/pdf/2307.08621.pdf|RetNet (Retentive Network)]]
  
Line 65: Line 69:
   * [[https://arxiv.org/pdf/1606.08415.pdf|Gaussian error linear units (GELU)]] Roughly xσ(1.702x). Used in GTP-2 and BERT.   * [[https://arxiv.org/pdf/1606.08415.pdf|Gaussian error linear units (GELU)]] Roughly xσ(1.702x). Used in GTP-2 and BERT.
   * [[https://arxiv.org/pdf/1710.05941.pdf|Swish]] f(x) = xσ(βx). β=1.702 is GELU, β=1 is [[https://arxiv.org/pdf/1702.03118.pdf|Sigmoid weighted Linear Unit (SiL)]]   * [[https://arxiv.org/pdf/1710.05941.pdf|Swish]] f(x) = xσ(βx). β=1.702 is GELU, β=1 is [[https://arxiv.org/pdf/1702.03118.pdf|Sigmoid weighted Linear Unit (SiL)]]
-  * [[https://arxiv.org/pdf/2307.16389.pdf|STL]] Signed Truncated Logarithm. Very cool activation function.+  * [[https://arxiv.org/pdf/2307.16389.pdf|STL]] Signed Truncated Logarithm. Very cool activation function with great properties.
  
 Comparisons: Comparisons:
   * [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021]] Compares activation functions in the Transformer   * [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021]] Compares activation functions in the Transformer
  
 +===== Matrices ===
 +Various representations of matrices, such as sparse, or low-dimensional ones.
 +  * Tensor networks
 +  * [[https://arxiv.org/pdf/2106.09685|LoRA]]
 +  * [[https://arxiv.org/pdf/2204.00595|Monarch Matrices]]
  
 ===== Set and Pooling Networks ===== ===== Set and Pooling Networks =====
Line 120: Line 129:
 See also [[nlp:Neurosymbolic Methods]] See also [[nlp:Neurosymbolic Methods]]
   * [[https://arxiv.org/pdf/1802.08535.pdf|PossibleWorldNet]]   * [[https://arxiv.org/pdf/1802.08535.pdf|PossibleWorldNet]]
-  * [[https://openaccess.thecvf.com/content_cvpr_2016/papers/Andreas_Neural_Module_Networks_CVPR_2016_paper.pdf|Neural Module Networks]] (see [[Neural Module Networks]])+  * [[https://openaccess.thecvf.com/content_cvpr_2016/papers/Andreas_Neural_Module_Networks_CVPR_2016_paper.pdf|Neural Module Networks]] (see [[ml:modularity#Neural Module Networks]])
   * [[https://arxiv.org/pdf/1612.00712.pdf|Probabilistic Neural Programs]]   * [[https://arxiv.org/pdf/1612.00712.pdf|Probabilistic Neural Programs]]
  
ml/nn_architectures.1691646409.txt.gz · Last modified: 2023/08/10 05:46 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki