iconLogo
Published:2025/12/16 9:13:56

LLMで問題の難易度をカンタンに測っちゃお💖

  1. 超要約: LLMを使って問題のレベルを自動で判定するスゴ技だよ!

  2. ギャル的キラキラポイント✨

    • ● 人間じゃムリな問題もLLMなら難易度をつけれるの!✨
    • ● 難易度判定が自動化されて、データ作りが超はかどる!💻
    • ● AIをもっと賢くするためのヒントが詰まってるってコト!🧠
  3. 詳細解説

    • 背景: 最近のAI(LLM)って、マジすごいじゃん? でも、もっと賢くするには、良質なデータが必要なんだよね。そのデータのレベル分けが、めっちゃ大変だったんだけど…。
    • 方法: LLMに問題のペアを比較させたら、難易度が分かっちゃった!まるでテストみたい!点数つけたり、人間が評価しなくてもいいから、すごい楽になるよね♪
    • 結果: この方法を使えば、AIが解けない問題のレベルも推定できるし、データ作りが爆速になるよ!AIの性能アップにも貢献できるかも!
    • 意義(ここがヤバい♡ポイント): AIを使ったサービスをレベルアップできるチャンス!たとえば、AIチャットボットがもっと賢くなったり、教育アプリが個別最適化されたり、イイコトばっかり!💖
  4. リアルでの使いみちアイデア💡

    • AI家庭教師アプリで、生徒のレベルに合わせて問題の難易度を自動調整!📚
    • クイズアプリで、ユーザーが飽きないように、問題のレベルを調整する機能を追加!🥳

続きは「らくらく論文」アプリで

Estimating problem difficulty without ground truth using Large Language Model comparisons

Marthe Ballon / Andres Algaba / Brecht Verbeken / Vincent Ginis

Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes--construction, scale and dependence--identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876$. Thirdly, we show that LLM compare is robust to hallucinations, with less than $6\%$ degradation in Pearson correlation for $10\%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.

cs / cs.LG / cs.AI