LLM評価の「バリアント汚染」対策！😎

Published：2026/1/8 12:48:40

LLM評価の「バリアント汚染」対策！😎

超要約: LLMのテストがズルしてないかチェックする技術開発！
ギャル的キラキラポイント✨
- ● テストデータ（試験問題）が、LLMの勉強した内容と似すぎてる問題を発見👀
- ● 従来のやり方じゃ見つけられなかった「言い換え」とかもバッチリ検出✨
- ● LLMの性能をちゃんと評価できるから、サービスももっと良くなるかも😍
詳細解説
- 背景: LLM（大規模言語モデル）って、色んな事に使えるスゴイやつ！でも、テストがズルされてると、本当の実力が分からなくなる…😭
- 方法: テストの時に「DVD」っていう新しい方法で、ズルしてないかチェックするよ！LLMの答え方の変なトコをみて、変なとこがあったら「あっ、これズルしてる！」って見抜く作戦！
- 結果: DVDを使うと、ズルを見つける精度がめっちゃUP！LLMの本当の強さが分かるようになるから、安心して使えるね💕
- 意義（ここがヤバい♡ポイント）: 正しい実力が分かれば、LLMをもっと良いサービスに使えるようになる！例えば、AIチャットボットがもっと賢くなったりするかも？😍
リアルでの使いみちアイデア💡
- 💡 AIを使ってサービスを作ってる会社は、LLMのテストがズルしてないかチェックして、サービスの質を上げれるね！
- 💡 AIの性能をちゃんと評価できるようになるから、色んな会社が安心してAIを使えるようになるかも！
もっと深掘りしたい子へ🔍 キーワード
- DVD (Detection via Variance of generation Distribution)
- バリアント汚染
- LLM評価

続きは「らくらく論文」アプリで

DVD: A Robust Method for Detecting Variant Contamination in Large Language Model Evaluation

Renzhao Liang / Jingru Chen / Bo Jia / Bo Deng / Chenggang Xie / Yidong Wang / Ke Jin / Xin Wang / Linfeng Zhang / Cunxiang Wang

Evaluating large language models (LLMs) is increasingly confounded by \emph{variant contamination}: the training corpus contains semantically equivalent yet lexically or syntactically altered versions of test items. Unlike verbatim leakage, these paraphrased or structurally transformed variants evade existing detectors based on sampling consistency or perplexity, thereby inflating benchmark scores via memorization rather than genuine reasoning. We formalize this problem and introduce \textbf{DVD} (\textbf{D}etection via \textbf{V}ariance of generation \textbf{D}istribution), a single-sample detector that models the local output distribution induced by temperature sampling. Our key insight is that contaminated items trigger alternation between a \emph{memory-adherence} state and a \emph{perturbation-drift} state, yielding abnormally high variance in the synthetic difficulty of low-probability tokens; uncontaminated items remain in drift with comparatively smooth variance. We construct the first benchmark for variant contamination across two domains Omni-MATH and SuperGPQA by generating and filtering semantically equivalent variants, and simulate contamination via fine-tuning models of different scales and architectures (Qwen2.5 and Llama3.1). Across datasets and models, \textbf{DVD} consistently outperforms perplexity-based, Min-$k$\%++, edit-distance (CDD), and embedding-similarity baselines, while exhibiting strong robustness to hyperparameters. Our results establish variance of the generation distribution as a principled and practical fingerprint for detecting variant contamination in LLM evaluation.

cs / cs.AI

Arxivで見る