Published：2026/1/2 19:36:03

敵対的攻撃に強いAI！✨

深層学習の弱点克服！安全で信頼できるAIを作る研究だよ♪

超要約：AIの弱点を克服！説明可能性で敵対的攻撃に強いモデルを作る方法を紹介！

ギャル的キラキラポイント✨

● LIMEを味方に！: モデルの”どこを見てるか”を教えてくれるLIMEを攻撃に強いAI作りに活用！賢い～！ ● トリプル防御！: 3つの方法でAIを守る！まるで最強の盾🛡️✨ ● 透明性も大事！: なんでAIがそう判断したか分かるようにする！ユーザーも安心だね♪

続きは「らくらく論文」アプリで

Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks

Longwei Wang / Mohammad Navid Nayyem / Abdullah Al Rakin / KC Santosh / Chaowei Zhang / Yang Zhou

The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a closed-loop refinement pipeline. This approach does not require additional datasets or model architectures and integrates seamlessly into standard adversarial training. Theoretically, we derive an attribution-aware lower bound on adversarial distortion that formalizes the link between explanation alignment and robustness. Empirical evaluations on CIFAR-10, CIFAR-10-C, and CIFAR-100 demonstrate substantial improvements in adversarial robustness and out-of-distribution generalization.

cs / cs.LG

Arxivで見る