Differences

This shows you the differences between two versions of the page.

--- ml:nn_architectures [2024/04/30 08:26] – [Sequence Networks] jmflanig
+++ ml:nn_architectures [2025/03/25 07:34] (current) – [Sequence Networks] jmflanig
@@ Line 12: / Line 12: @@
   * [[https://arxiv.org/pdf/1701.06538.pdf|Sparsely-Gated Mixture-of-Experts]].  Used to greatly scale-up the number of parameters with (sub-linear? check this) increase in computation. Uses many overlapping feedforward networks that are gated by another network. 1000x improvements in model capacity.
   * [[https://arxiv.org/pdf/1902.05770.pdf|Dou et al 2019 - Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement]]
+  * [[https://arxiv.org/pdf/2204.00595|Dao et al 2022 - Monarch: Expressive Structured Matrices for Efficient and Accurate Training]]
 ===== Connections =====
@@ Line 29: / Line 30: @@
   * Memory networks (i.e. End-to-end memory networks)
   * [[http://proceedings.mlr.press/v48/danihelka16.pdf|Associative LSTM]]
-  * [[https://arxiv.org/pdf/1705.07393|Recurrent Additive Networks]] RNN with residual connections
+  * [[https://arxiv.org/pdf/1705.07393|Recurrent Additive Networks]] An early state space model
   * [[https://arxiv.org/pdf/1704.04368.pdf|Pointer-Generator Networks]]
   * [[https://arxiv.org/pdf/1705.03122.pdf|Convolutional Seq2seq]]
@@ Line 44: / Line 45: @@
   * [[https://arxiv.org/pdf/2105.03824.pdf|FNet]] A faster, attention-free Transformer architecture based on Fourier transforms
   * [[https://arxiv.org/pdf/2305.10991.pdf|Anthe: Less is More! A slim architecture for optimal language translation]]
+  * [[https://arxiv.org/pdf/2305.13048|RWKV (Receptance Weighted Key Value) Network]] Information is passed across positions using a positional weight decay which gates the information. Allows parallel training like the transformer, but more efficient inference like the RNN
   * [[https://arxiv.org/pdf/2307.08621.pdf|RetNet (Retentive Network)]]
@@ Line 72: / Line 74: @@
   * [[https://arxiv.org/pdf/2102.11972.pdf|Narang et al 2021]] Compares activation functions in the Transformer
+===== Matrices ===
+Various representations of matrices, such as sparse, or low-dimensional ones.
+  * Tensor networks
+  * [[https://arxiv.org/pdf/2106.09685|LoRA]]
+  * [[https://arxiv.org/pdf/2204.00595|Monarch Matrices]]
 ===== Set and Pooling Networks =====