ml:fine-tuning
This is an old revision of the document!
Table of Contents
Fine-Tuning
This page lists fine-tuning methods such as Adaptors, LoRA, BitFit, NoisyTune, etc.
Overviews
- Mosbach et al 2020 - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines Gives a good baseline setting of hyperpameters for tuning BERT in section 6: fine-tune using ADAM with bias correction and a learning rate of 2e−5 for 20 epochs, with learning rate linearly increased for the first 10% of steps and linearly decayed to zero afterward.
- 2024 - The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities Missing lots of stuff. Not really the ultimate guide.
- Blog Posts, etc

Figure from Mahabadi 2021.
General Papers
See also Optimization - Instability of Fine-tuning.
- Dodge et al 2020 - Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping The results can largely be mitigated by training for more epochs, see Mosbach 2020
- Mosbach et al 2020 - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines Advocates a simple baseline in section 6: fine-tune using ADAM with bias correction and a learning rate of 2e−5 for 20 epochs, with learning rate linearly increased for the first 10% of steps and linearly decayed to zero afterward.
- Gradual Fine-Tuning: Xu et al 2021 - Gradual Fine-Tuning for Low-Resource Domain Adaptation
- Wu et al 2022 - NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better Shows that adding some noise to the parameters (small perturbation) before fine-tuning can improve results.
- Removing the Causal Mask In Decoder-Only Models
Parameter-Efficient Tuning (PET)
See also Memory Reduction Techniques.
-
- PyTorch code examples: PyTorch Adaptor Transformers Colab notebook tutorials Training an Adapter for a Transformer model
- P-tuning: Liu 2021 - GPT Understands, Too
- Zhao et al 2024 - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Can also be used for pre-training
Related Pages
ml/fine-tuning.1741343836.txt.gz · Last modified: 2025/03/07 10:37 by jmflanig