LLM評価爆上げ！InfoSynthでベンチマーク自動生成だよ☆（超簡単ver.）

Published：2026/1/2 5:26:27

LLM評価爆上げ！InfoSynthでベンチマーク自動生成だよ☆（超簡単ver.）

超要約: LLM（大規模言語モデル）のテスト問題、自動で作っちゃお！効率的で、LLMのウソも見抜けるスグレモノ✨

✨ ギャル的キラキラポイント ✨ ● 手動で作ってたテスト問題が、自動でできちゃうなんて神！🤖✨ ● LLMがズルしてないか、ちゃんと見抜けるから安心💖 ● 新しいサービス作りにも役立つかも！🚀

詳細解説

背景

続きは「らくらく論文」アプリで

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Ishir Garg / Neel Kolhe / Xuandong Zhao / Dawn Song

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

cs / cs.CL

Arxivで見る