LLM（大規模言語モデル）の進化版ベンチマーク「EvolIF」で未来を先取り🚀

Published：2025/12/16 12:11:36

LLMの進化版ベンチマーク「EvolIF」で未来を先取り🚀

超要約：LLM（大規模言語モデル）の性能評価をアゲる、新しい方法を紹介だよ！

✨ ギャル的キラキラポイント ✨ ● 色んな質問（マルチターン）に答える能力を試せるのがスゴくない？😎 ● チャットボット（AI）の進化に貢献できるかも！ ● GPT-5が最強ってことが判明✨

詳細解説背景：LLM って、すごい進化してるけど、ホントの実力（マルチターン指示への対応力）を測るのって難しかったの。今までのテストは、ちょっと単調だったり、すぐに結果が飽和しちゃったりして、イマイチだったみたい。

方法：EvolIF っていう新しいベンチマーク（テスト方法）が登場！現実のユーザーみたいな、色んな質問のやり取りをシミュレーションするんだって。トピック（話題）や指示、制約を細かく設定して、LLM の実力を見るみたい。

続きは「らくらく論文」アプリで

One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

Qi Jia / Ye Shen / Xiujie Song / Kaiwei Zhang / Shibo Wang / Dun Pei / Xiangyang Zhu / Guangtao Zhai

Evaluating LLMs' instruction-following ability in multi-topic dialogues is essential yet challenging. Existing benchmarks are limited to a fixed number of turns, susceptible to saturation and failing to account for users' interactive experience. In this work, we propose a novel framework backed by a three-layer tracking mechanism and a query synthesis agent to mimic sequential user behaviors. Incorporating Flow Theory, we introduce process-centric metrics and terminate a conversational evaluation only upon exhausting user patience. Upon this framework, we present EvolIF, an evolving benchmark covering 12 constraint groups. Results indicate that GPT-5 excels, sustaining 14 turns with 66.40% robustness. It outperforms Gemini-3.0-Pro by a margin of 5.59%, while other models trail behind. Resources are available at https://github.com/JiaQiSJTU/EvolvingInstructionFollowing.

cs / cs.CL

Arxivで見る