Finding Safety Neurons in Large Language Models
CoRR(2024)
摘要
Large language models (LLMs) excel in various capabilities but also pose
safety risks such as generating harmful content and misinformation, even after
safety alignment. In this paper, we explore the inner mechanisms of safety
alignment from the perspective of mechanistic interpretability, focusing on
identifying and analyzing safety neurons within LLMs that are responsible for
safety behaviors. We propose generation-time activation contrasting to locate
these neurons and dynamic activation patching to evaluate their causal effects.
Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse
and effective. We can restore 90
on about 5
mechanisms. They exhibit consistent effectiveness on different red-teaming
datasets. The finding of safety neurons also interprets "alignment tax". We
observe that the identified key neurons for safety and helpfulness
significantly overlap, but they require different activation patterns of the
shared neurons. Furthermore, we demonstrate an application of safety neurons in
detecting unsafe outputs before generation. Our findings may promote further
research on understanding LLM alignment. The source codes will be publicly
released to facilitate future research.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要