Differences

This shows you the differences between two versions of the page.

--- ml:fine-tuning [2025/05/01 04:04] – [Parameter-Efficient Tuning (PET)] jmflanig
+++ ml:fine-tuning [2025/07/14 07:37] (current) – [General Papers] jmflanig
@@ Line 8: / Line 8: @@
   * [[https://arxiv.org/pdf/2205.05638.pdf|Liu et al 2022 - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning]]
   * **[[https://arxiv.org/pdf/2306.09782|Lv et al 2023 - Full Parameter Fine-tuning for Large Language Models with Limited Resources]]**
-  * [[https://arxiv.org/pdf/2310.10195|Lv et al 2023 - Low-memory Optimization with Adaptive Learning Rate]]
     * [[https://arxiv.org/pdf/2310.10195|Lv et al 2023 - AdaLomo: Low-memory Optimization with Adaptive Learning Rate]]
   * [[https://arxiv.org/pdf/2403.14608|Han et al 2024 - Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey]]
@@ Line 41: / Line 40: @@
     * [[https://arxiv.org/pdf/2401.14556|2024 - Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling]] Says "LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL)"
     * [[https://arxiv.org/pdf/2404.05961|LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders]] Shows that Mistral was probably pre-trained using some bi-directional attention.
+    * [[https://arxiv.org/pdf/2504.06225|Zhang et al 2025 - Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation]] They seem to think they are the first to do this (adapt pretrained decoder-only LLMs to encoder-decoder), which is incorrect.
 ===== Parameter-Efficient Tuning (PET) =====
@@ Line 56: / Line 56: @@
   * **QLoRA**: [[https://arxiv.org/pdf/2305.14314.pdf|Dettmers et al 2023 - QLORA: Efficient Finetuning of Quantized LLMs]]
   * [[https://arxiv.org/pdf/2305.17333.pdf|Malladi et al 2023 - Fine-Tuning Language Models with Just Forward Passes]]
+  * [[https://arxiv.org/pdf/2312.09979|Dou et al 2023 - LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin]] Combines LoRA with MoE to improve performance
   * [[https://arxiv.org/pdf/2402.03293|Hao et al 2024 - FLORA: Low-Rank Adapters Are Secretly Gradient Compressors]]
   * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Can also be used for pre-training