iconLogo
Published:2026/1/4 22:35:37

RoguePrompt爆誕!LLMのガードをぶち破る方法💖 (IT企業の新規事業担当者向け)

  1. 超要約: LLMの弱点(プロンプトインジェクション)を突く攻撃「RoguePrompt」🤖💥 安全なLLMを作るための研究だよ!

  2. ギャル的キラキラポイント✨

    • ● 多段階(マルチステップ)変換でモデレーションを回避!普通の攻撃じゃ突破できない壁をブチ壊す!
    • ● LLM自身に「お願い!元々の命令を思い出して!」ってさせる、自己再構成ってテクがすごい✨
    • ● APIとかUIだけ使って攻撃できるから、色んなLLMに使えるってわけ😉
  3. 詳細解説

    • 背景: LLM(大規模言語モデル)って、色んなことに使えるけど、悪いことする命令(プロンプトインジェクション)にも弱いんだよね💔 それを何とかしたい!
    • 方法: RoguePromptは、プロンプトを何回も変身させるんだ! 隠された命令をLLMに気づかせずに、最終的に実行させる魔法🧙‍♀️✨
    • 結果: モデレーション(LLMの安全装置)をうまく避けつつ、攻撃に成功! これでLLMの弱点を暴けるぞ👀
    • 意義: IT企業が安全なAIサービスを作るために、この技術はマジで重要! 悪用を防ぎつつ、LLMの可能性を広げられるかも!
  4. リアルでの使いみちアイデア💡

    • AIチャットボット🤖💬:変なこと言わせないように、RoguePromptでセキュリティチェック!
    • AI開発ツール💻✨:間違ったコード生成を防いで、安全な開発をサポート!

続きは「らくらく論文」アプリで

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation

Benyamin Tafreshian

Large language models (LLMs) are becoming increasingly integrated into mainstream development platforms and daily technological workflows, typically behind moderation and safety controls. Despite these controls, preventing prompt-based policy evasion remains challenging, and adversaries continue to jailbreak LLMs by crafting prompts that circumvent implemented safety mechanisms. While prior jailbreak techniques have explored obfuscation and contextual manipulation, many operate as single-step transformations, and their effectiveness is inconsistent across current state-of-the-art models. This leaves a limited understanding of multistage prompt-transformation attacks that evade moderation, reconstruct forbidden intent, and elicit policy-violating outputs. This paper introduces RoguePrompt, an automated jailbreak pipeline that leverages dual-layer prompt transformations to convert forbidden prompts into safety-evading queries. By partitioning the forbidden prompts and applying two nested encodings (ROT-13 and Vigen\`ere) along with natural-language decoding instructions, it produces benign-looking prompts that evade filters and induce the model to execute the original prompt within a single query. RoguePrompt was developed and evaluated under a black-box threat model, with only API and UI access to the LLMs, and tested on 313 real-world hard-rejected prompts. Success was measured in terms of moderation bypass, instruction reconstruction, and execution, using both automated and human evaluation. It achieved an average of 93.93% filter bypass, 79.02% reconstruction, and 70.18% execution success across multiple frontier LLMs. These results demonstrate the effectiveness of layered prompt encoding and highlight the need for innovative defenses to detect and mitigate self-reconstructing jailbreaks.

cs / cs.CR