Differences

This shows you the differences between two versions of the page.

--- nlp:llm_safety [2025/05/01 10:10] – [Papers] jmflanig
+++ nlp:llm_safety [2026/03/07 22:18] (current) – [Papers] jmflanig
@@ Line 1: / Line 1: @@
 ====== Large Language Model Safety ======
+===== Overviews =====
+  * [[https://arxiv.org/pdf/2305.11391|Huang et al 2023 - A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation]]
+  * [[https://arxiv.org/pdf/2308.05374|Liu et al 2023 - Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment]]
+  * [[https://arxiv.org/pdf/2402.09283|Dong et al 2024 - Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey]]
+  * **[[https://arxiv.org/pdf/2412.17686|Shi et al 2024 - Large Language Model Safety: A Holistic Survey]]** Great survey
+  * [[https://arxiv.org/pdf/2501.17805|2025 - International AI Safety Report]] Safety for AI in general
 ===== Papers =====
@@ Line 6: / Line 13: @@
   * [[https://arxiv.org/pdf/2404.12038|Xu et al 2024 - Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector]]
   * [[https://arxiv.org/pdf/2404.13208|Wallace et al 2024 - The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions]]
+  * [[https://arxiv.org/pdf/2508.06601|O'Brien et al 2025 - Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs]]
 ===== Jailbraking LLMs =====
@@ Line 19: / Line 27: @@
   * [[AGI]]
   * [[Alignment]]
+  * [[ml:Mechanistic Interpretability]]
   * [[ml:Model Editing and Unlearning|Model Editing]]
+  * [[ml:Trustworthy AI]]