最強LLMエージェント守る！推論スタイル攻撃って何？

Published：2025/12/16 14:34:10

最強LLMエージェント守る！推論スタイル攻撃って何？💥

超要約: LLMの思考回路(推論スタイル)を操って、エージェントを騙す攻撃と、それを防ぐ技術の話✨

ギャル的キラキラポイント✨

● まるで心理戦！推論スタイルをハッキングするなんて斬新～！ ● セキュリティ対策をすり抜ける、ステルス攻撃ってコト！？ ● AIの安全性を高めて、未来を守るって、マジ神👏

詳細解説

背景 LLM（大規模言語モデル）エージェントって、色んなことできるスゴイやつ😎 でも、コイツらは外部の情報に左右されやすいんだよね💦 そこで、悪い人が情報をちょっとイジって、エージェントの思考を操ろうとしてるのが今回の研究なの！

方法「Generative Style Injection（GSI）」っていう方法で、エージェントの推論スタイルを攻撃🤯 具体的には、検索結果とかに「疑り深い」とか「自信満々」みたいな情報を混ぜ込むんだって😈 しかも、従来の対策をかいくぐるように、巧妙にね！あと、「RSP-M」ってシステムで、エージェントの行動を監視して、変な動きをしたらアラートを出すらしい🚨

続きは「らくらく論文」アプリで

Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space

Xingfu Zhou / Pengfei Wang

Large Language Model (LLM) agents relying on external retrieval are increasingly deployed in high-stakes environments. While existing adversarial attacks primarily focus on content falsification or instruction injection, we identify a novel, process-oriented attack surface: the agent's reasoning style. We propose Reasoning-Style Poisoning (RSP), a paradigm that manipulates how agents process information rather than what they process. We introduce Generative Style Injection (GSI), an attack method that rewrites retrieved documents into pathological tones--specifically "analysis paralysis" or "cognitive haste"--without altering underlying facts or using explicit triggers. To quantify these shifts, we develop the Reasoning Style Vector (RSV), a metric tracking Verification depth, Self-confidence, and Attention focus. Experiments on HotpotQA and FEVER using ReAct, Reflection, and Tree of Thoughts (ToT) architectures reveal that GSI significantly degrades performance. It increases reasoning steps by up to 4.4 times or induces premature errors, successfully bypassing state-of-the-art content filters. Finally, we propose RSP-M, a lightweight runtime monitor that calculates RSV metrics in real-time and triggers alerts when values exceed safety thresholds. Our work demonstrates that reasoning style is a distinct, exploitable vulnerability, necessitating process-level defenses beyond static content analysis.

cs / cs.CR / cs.AI

Arxivで見る