nlp:instruction-tuning
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| nlp:instruction-tuning [2025/05/04 23:48] – jmflanig | nlp:instruction-tuning [2025/06/01 22:58] (current) – [Papers] jmflanig | ||
|---|---|---|---|
| Line 15: | Line 15: | ||
| * **[[https:// | * **[[https:// | ||
| * [[https:// | * [[https:// | ||
| - | * [[https:// | + | * [[https:// |
| * **[[https:// | * **[[https:// | ||
| * RSO: [[https:// | * RSO: [[https:// | ||
| Line 24: | Line 24: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| - | * **[[https:// | + | * **[[https:// |
| - | RLHF" and that "PPO is unnecessarily complicated for a pre-trained LLM environment." | + | |
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| * Group Relative Policy Optimization (GRPO): [[https:// | * Group Relative Policy Optimization (GRPO): [[https:// | ||
| + | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| Line 36: | Line 36: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| * **Multi-Dimensional Rewards** | * **Multi-Dimensional Rewards** | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * **Analyzing, | ||
| + | * [[https:// | ||
| ===== Datasets ===== | ===== Datasets ===== | ||
| Line 60: | Line 63: | ||
| * [[Alignment]] | * [[Alignment]] | ||
| * [[Human-In-The-Loop]] | * [[Human-In-The-Loop]] | ||
| + | * [[ml: | ||
| * [[human-in-the-loop# | * [[human-in-the-loop# | ||
nlp/instruction-tuning.1746402490.txt.gz · Last modified: 2025/05/04 23:48 by jmflanig