Table of Contents
Large Language Model Safety
Overviews
Papers
Jailbraking LLMs
Related Pages
Large Language Model Safety
Overviews
Huang et al 2023 - A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation
Liu et al 2023 - Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Dong et al 2024 - Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Shi et al 2024 - Large Language Model Safety: A Holistic Survey
Great survey
2025 - International AI Safety Report
Safety for AI in general
Papers
Zou et al 2023 - Representation Engineering: A Top-Down Approach to AI Transparency
Anwar et al 2024 - Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Xu et al 2024 - Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector
Wallace et al 2024 - The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
O'Brien et al 2025 - Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
Jailbraking LLMs
Overviews
Yi et al 2024 - Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Wei et al 2023 - Jailbroken: How Does LLM Safety Training Fail?
Zhou et al 2023 - Universal and Transferable Adversarial Attacks on Aligned Language Models
Zhou et al 2024 - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
Paulus et al 2024 - AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Related Pages
AGI
Alignment
Mechanistic Interpretability
Model Editing
Trustworthy AI