LLM攻略！？「表現ハイジャック」で安全対策がダダ漏れ事件💥

Published：2025/12/3 13:19:34

LLM攻略！？「表現ハイジャック」で安全対策がダダ漏れ事件💥

超要約：LLM（AI）の弱点発見！悪用されないように対策しよっ💖

✨ ギャル的キラキラポイント ✨

● LLMの「頭の中」をハッキング！言葉をすり替えて、悪いことさせちゃう攻撃😳 ● 既存のセキュリティじゃ防げない！新しい攻撃手法「Doublespeak」ってコト🤯 ● 企業は対策必須！LLMの安全性を高めて、ビジネスチャンスGETだぜ😎

詳細解説

続きは「らくらく論文」アプリで

In-Context Representation Hijacking

Itay Yona / Amir Sarid / Michael Karasik / Yossi Gandelsman

We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.

cs / cs.CL / cs.AI / cs.CR / cs.LG

Arxivで見る