絵文字でLLMをブチ破れ！💥 ジェイルブレイク研究、爆誕！

Published：2026/1/2 10:49:06

絵文字でLLMをブチ破れ！💥 ジェイルブレイク研究、爆誕！

I. 研究の概要

研究の目的
- 社会的問題: LLM（大規模言語モデル）って、危険なこと話しちゃう場合があるじゃん？それを悪用する人がいるみたいなんだよね😭
- 学術的な課題: LLMの安全対策って、テキスト（文字）だけじゃダメ🙅‍♀️絵文字みたいな別の要素からの攻撃にも弱点があるってコトが分かったの！
- 成果:
  - 絵文字を使ってLLMを「ジェイルブレイク（脱獄）」できるってことを証明したよ😎
  - 色んなLLMで、安全性のレベルが違うってことも分かった！
  - プロンプト（命令文）レベルでの安全対策がめっちゃ大事ってことね！
- 影響:
  - LLMを作る時は、絵文字とかにも気を付けて安全対策を強化しないと！
  - AIをもっと安心して使えるようにできるかも💕
研究の背景

続きは「らくらく論文」アプリで

Emoji-Based Jailbreaking of Large Language Models

M P V S Gopinadh / S Mahaboob Hussain

Large Language Models (LLMs) are integral to modern AI applications, but their safety alignment mechanisms can be bypassed through adversarial prompt engineering. This study investigates emoji-based jailbreaking, where emoji sequences are embedded in textual prompts to trigger harmful and unethical outputs from LLMs. We evaluated 50 emoji-based prompts on four open-source LLMs: Mistral 7B, Qwen 2 7B, Gemma 2 9B, and Llama 3 8B. Metrics included jailbreak success rate, safety alignment adherence, and latency, with responses categorized as successful, partial and failed. Results revealed model-specific vulnerabilities: Gemma 2 9B and Mistral 7B exhibited 10 % success rates, while Qwen 2 7B achieved full alignment (0% success). A chi-square test (chi^2 = 32.94, p < 0.001) confirmed significant inter-model differences. While prior works focused on emoji attacks targeting safety judges or classifiers, our empirical analysis examines direct prompt-level vulnerabilities in LLMs. The results reveal limitations in safety mechanisms and highlight the necessity for systematic handling of emoji-based representations in prompt-level safety and alignment pipelines.

cs / cs.CR / cs.AI

Arxivで見る