User Tools

Site Tools


nlp:llm_safety

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:llm_safety [2025/08/28 05:13] – [Overviews] jmflanignlp:llm_safety [2026/03/07 22:18] (current) – [Papers] jmflanig
Line 13: Line 13:
   * [[https://arxiv.org/pdf/2404.12038|Xu et al 2024 - Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector]]   * [[https://arxiv.org/pdf/2404.12038|Xu et al 2024 - Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector]]
   * [[https://arxiv.org/pdf/2404.13208|Wallace et al 2024 - The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions]]   * [[https://arxiv.org/pdf/2404.13208|Wallace et al 2024 - The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions]]
 +  * [[https://arxiv.org/pdf/2508.06601|O'Brien et al 2025 - Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs]]
  
 ===== Jailbraking LLMs ===== ===== Jailbraking LLMs =====
nlp/llm_safety.1756357987.txt.gz · Last modified: 2025/08/28 05:13 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki