User Tools

Site Tools


nlp:llm_safety

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:llm_safety [2025/03/11 04:21] – [Jailbraking LLMs] jmflanignlp:llm_safety [2026/03/07 22:18] (current) – [Papers] jmflanig
Line 1: Line 1:
 ====== Large Language Model Safety ====== ====== Large Language Model Safety ======
 +
 +===== Overviews =====
 +  * [[https://arxiv.org/pdf/2305.11391|Huang et al 2023 - A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation]]
 +  * [[https://arxiv.org/pdf/2308.05374|Liu et al 2023 - Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment]]
 +  * [[https://arxiv.org/pdf/2402.09283|Dong et al 2024 - Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey]]
 +  * **[[https://arxiv.org/pdf/2412.17686|Shi et al 2024 - Large Language Model Safety: A Holistic Survey]]** Great survey
 +  * [[https://arxiv.org/pdf/2501.17805|2025 - International AI Safety Report]] Safety for AI in general
  
 ===== Papers ===== ===== Papers =====
   * [[https://arxiv.org/pdf/2310.01405|Zou et al 2023 - Representation Engineering: A Top-Down Approach to AI Transparency]]   * [[https://arxiv.org/pdf/2310.01405|Zou et al 2023 - Representation Engineering: A Top-Down Approach to AI Transparency]]
 +  * [[https://arxiv.org/pdf/2404.09932|Anwar et al 2024 - Foundational Challenges in Assuring Alignment and Safety of Large Language Models]]
   * [[https://arxiv.org/pdf/2404.12038|Xu et al 2024 - Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector]]   * [[https://arxiv.org/pdf/2404.12038|Xu et al 2024 - Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector]]
   * [[https://arxiv.org/pdf/2404.13208|Wallace et al 2024 - The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions]]   * [[https://arxiv.org/pdf/2404.13208|Wallace et al 2024 - The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions]]
 +  * [[https://arxiv.org/pdf/2508.06601|O'Brien et al 2025 - Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs]]
  
 ===== Jailbraking LLMs ===== ===== Jailbraking LLMs =====
Line 12: Line 21:
   * [[https://arxiv.org/pdf/2307.15043.pdf|Zhou et al 2023 - Universal and Transferable Adversarial Attacks   * [[https://arxiv.org/pdf/2307.15043.pdf|Zhou et al 2023 - Universal and Transferable Adversarial Attacks
 on Aligned Language Models]] on Aligned Language Models]]
 +  * [[https://arxiv.org/pdf/2403.12171|Zhou et al 2024 - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models]]
   * [[https://arxiv.org/pdf/2404.16873|Paulus et al 2024 - AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs]]   * [[https://arxiv.org/pdf/2404.16873|Paulus et al 2024 - AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs]]
  
-===== Related Work =====+===== Related Pages ===== 
 +  * [[AGI]] 
 +  * [[Alignment]] 
 +  * [[ml:Mechanistic Interpretability]]
   * [[ml:Model Editing and Unlearning|Model Editing]]   * [[ml:Model Editing and Unlearning|Model Editing]]
 +  * [[ml:Trustworthy AI]]
nlp/llm_safety.1741666875.txt.gz · Last modified: 2025/03/11 04:21 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki