====== Fine-Tuning ======
This page lists fine-tuning methods such as Adaptors, LoRA, BitFit, NoisyTune, etc.

===== Overviews =====
  * **[[https://arxiv.org/pdf/2006.04884.pdf|Mosbach et al 2020 - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines]]** Gives a good baseline setting of hyperpameters for tuning BERT in section 6: fine-tune using ADAM with bias correction and a learning rate of 2e−5 for 20 epochs, with learning rate linearly increased for the first 10% of steps and linearly decayed to zero afterward.
  * [[https://arxiv.org/pdf/2203.06904.pdf|Ding et al 2022 - Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models]]
  * [[https://arxiv.org/pdf/2110.04366.pdf|He et al 2022 - Towards a Unified View of Parameter-Efficient Transfer Learning]]
  * [[https://arxiv.org/pdf/2205.05638.pdf|Liu et al 2022 - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning]]
  * **[[https://arxiv.org/pdf/2306.09782|Lv et al 2023 - Full Parameter Fine-tuning for Large Language Models with Limited Resources]]**
    * [[https://arxiv.org/pdf/2310.10195|Lv et al 2023 - AdaLomo: Low-memory Optimization with Adaptive Learning Rate]]
  * [[https://arxiv.org/pdf/2403.14608|Han et al 2024 - Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey]]
  * [[https://arxiv.org/pdf/2408.13296|2024 - The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities]] Missing lots of stuff.  Not really the ultimate guide.
  * [[https://arxiv.org/pdf/2411.09539|Szep et al 2024 - A Practical Guide to Fine-tuning Language Models with Limited Data]]
  * **Blog Posts, etc**
    * [[https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/working-with-llms/fine-tuning-recommend|Microsoft - Recommendations for LLM fine-tuning]]
    * [[https://www.acorn.io/resources/learning-center/fine-tuning-llm/|Acorn - Fine-Tuning LLMs: Top 6 Methods, Challenges and Best Practices]]
    * [[https://openpipe.ai/blog/fine-tuning-best-practices-chapter-2-models|OpenPipe - Fine-tuning Best Practices Chapter 2: Models]]

{{media:fine-tuning-methods.png}}\\
Figure from [[https://arxiv.org/pdf/2106.04647.pdf|Mahabadi 2021]].

===== General Papers =====

See also [[ml:Optimization#Instability of Fine-tuning|Optimization - Instability of Fine-tuning]].

  * [[https://arxiv.org/pdf/1909.11299.pdf|Lee et al 2019 - Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models]]
  * [[https://arxiv.org/pdf/2002.06305.pdf|Dodge et al 2020 - Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping]] The results can largely be mitigated by training for more epochs, see [[https://arxiv.org/pdf/2006.04884.pdf|Mosbach 2020]]
  * [[https://arxiv.org/pdf/2005.02178.pdf|Zhou et al 2020 - IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization]]
  * [[https://arxiv.org/pdf/2006.04884.pdf|Mosbach et al 2020 - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines]] Advocates a simple baseline in section 6: fine-tune using ADAM with bias correction and a learning rate of 2e−5 for 20 epochs, with learning rate linearly increased for the first 10% of steps and linearly decayed to zero afterward.
  * [[https://arxiv.org/pdf/2006.05987.pdf|Zhang et al 2020 - Revisiting Few-sample BERT Fine-tuning]]
  * Gradual Fine-Tuning: [[https://arxiv.org/pdf/2103.02205.pdf|Xu et al 2021 - Gradual Fine-Tuning for Low-Resource Domain Adaptation]]
  * [[https://arxiv.org/pdf/2106.14282.pdf|Zhou & Srikumar 2021 - A Closer Look at How Fine-tuning Changes BERT]]
  * EasyAdapt: [[https://arxiv.org/pdf/2109.04711|Bai et al 2021 - Pre-train or Annotate? Domain Adaptation with a Constrained Budget]] Adapts [[https://arxiv.org/pdf/0907.1815|Daumé III 2009 - Frustratingly Easy Domain Adaptation]] to Transformer era. Also considers the tradeoff between pretraining on in-domain data vs annotations on in-domain data under budget constraints.
  * [[https://arxiv.org/pdf/2109.05687.pdf|Xu et al 2021 - Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning]] Applies masking to only fine-tune a subset of the weights. Shows it outperforms regular fine-tuning.
  * [[https://arxiv.org/pdf/2202.12024.pdf|Wu et al 2022 - NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better]] Shows that adding some noise to the parameters (small perturbation) before fine-tuning can improve results.
  * [[https://arxiv.org/pdf/2305.17333.pdf|Malladi et al 2023 - Fine-Tuning Language Models with Just Forward Passes]]
  * [[https://arxiv.org/pdf/2310.10908|Qiu et al 2023 - Unlocking Emergent Modularity in Large Language Models]]
  * [[https://arxiv.org/pdf/2406.15330|Li et al 2024 - Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance]]
  * **Removing the Causal Mask In Decoder-Only Models**
    * [[https://arxiv.org/pdf/2401.14556|2024 - Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling]] Says "LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL)"
    * [[https://arxiv.org/pdf/2404.05961|LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders]] Shows that Mistral was probably pre-trained using some bi-directional attention.
    * [[https://arxiv.org/pdf/2504.06225|Zhang et al 2025 - Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation]] They seem to think they are the first to do this (adapt pretrained decoder-only LLMs to encoder-decoder), which is incorrect.

===== Parameter-Efficient Tuning (PET) =====
See also [[ml:gpu_deep_learning#Memory Reduction Techniques|]].
  * Adaptor Layers: [[https://arxiv.org/pdf/1902.00751.pdf|Houlsby et al 2019 - Parameter-Efficient Transfer Learning for NLP]]
    * PyTorch code examples: [[https://github.com/Adapter-Hub/adapter-transformers|PyTorch Adaptor Transformers]] [[https://github.com/Adapter-Hub/adapter-transformers/tree/master/notebooks|Colab notebook tutorials]] [[https://github.com/Adapter-Hub/adapter-transformers/blob/master/notebooks/01_Adapter_Training.ipynb|Training an Adapter for a Transformer model]]
  * P-tuning: [[https://arxiv.org/pdf/2103.10385.pdf|Liu 2021 - GPT Understands, Too]]
  * [[https://arxiv.org/pdf/2106.04489.pdf|Mahabadi et al 2021 - Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks]]
  * [[https://arxiv.org/pdf/2106.04647.pdf|Mahabadi et al 2021 - COMPACTER: Efficient Low-Rank Hypercomplex Adapter Layers]]
  * **LoRA**: [[https://arxiv.org/pdf/2106.09685.pdf|Hu et al 2021 - LoRA: Low-Rank Adaptation of Large Language Models]]
  * [[https://arxiv.org/pdf/2106.10199.pdf|Ben-Zaken et al 2021 - BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models]]
  * MAM-Adaptor: [[https://arxiv.org/pdf/2110.04366.pdf|He et al 2022 - Towards a Unified View of Parameter-Efficient Transfer Learning]]
  * **[[https://arxiv.org/pdf/2202.07962.pdf|Chen et al 2022 - Revisiting Parameter-Efficient Tuning: Are We Really There Yet?]]**
  * T-Few: [[https://arxiv.org/pdf/2205.05638.pdf|Liu et al 2022 - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning]]
  * **QLoRA**: [[https://arxiv.org/pdf/2305.14314.pdf|Dettmers et al 2023 - QLORA: Efficient Finetuning of Quantized LLMs]]
  * [[https://arxiv.org/pdf/2305.17333.pdf|Malladi et al 2023 - Fine-Tuning Language Models with Just Forward Passes]]
  * [[https://arxiv.org/pdf/2312.09979|Dou et al 2023 - LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin]] Combines LoRA with MoE to improve performance
  * [[https://arxiv.org/pdf/2402.03293|Hao et al 2024 - FLORA: Low-Rank Adapters Are Secretly Gradient Compressors]]
  * [[https://arxiv.org/pdf/2403.03507.pdf|Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection]] Can also be used for pre-training
  * [[https://arxiv.org/pdf/2402.09353|Liu et al 2024 - DoRA: Weight-Decomposed Low-Rank Adaptation]] **Says still often exists performance gap between PEFT and full fine-tuning** (cited by [[https://arxiv.org/pdf/2405.15525|He 2024]] for this).
  * [[https://arxiv.org/pdf/2405.15525|He et al 2024 - Sparse Matrix in Large Language Model Fine-tuning]]


===== Related Pages =====
  * [[nlp:Domain Adaptation]]
  * [[nlp:Pretraining]]
  * [[nlp:Prompting]]
  * [[nlp:prompting#soft-prompting_etc|Prompt-Tuning, Soft-Prompting, etc]]
  * [[NN Training]]