賢いAIを育てる！なぞなぞベンチマーク🚀

Published：2026/1/5 13:57:38

タイトル & 超要約：賢いAIを育てる！なぞなぞベンチマーク🚀

ギャルのみんな～！最新論文、めっちゃ面白いの見つけたよ！子供向けなぞなぞでAIの頭の良さを測っちゃう「NazoNazo Benchmark」だって！賢いAI作りに役立つかも💖

✨ ギャル的キラキラポイント ✨

● 既存のベンチマーク（AIのテスト）じゃ、AIの本当の頭の良さ、分かんなかったんだって！ 😱 ● 日本語のなぞなぞを使うから、日本に合ったAI開発に役立つってワケ💖 ● AIが「あっ！」って気づく力（洞察力）と、自分の頭の良さを分かってるか（メタ認知）を評価できるのがスゴイ✨

🌟 詳細解説 🌟

続きは「らくらく論文」アプリで

Japanese Children's Riddles as a Benchmark for Machine Insight and Metacognition

Masaharu Mizumoto / Dat Nguyen / Zhiheng Han / Jiyuan Fang / Heyuan Guan / Xingfu Li / Naoya Shiraishi / Yo Nakawake / Le Minh Nguyen

Benchmark saturation and contamination have obscured genuine advances in reasoning for large language models (LLMs). We introduce NazoNazo Benchmark, a low-cost, renewable test built from Japanese children's riddles that demand insight-based reasoning, or representational shifts rather than knowledge recall. We evaluate 38 frontier LLMs (2023-2025) on 201 riddles and a 120-item human-comparison subset, finding that non-reasoning models average 7.6%, reasoning models 17.6%, and humans ~53% accuracy. Importantly, thought-log analysis reveals that reasoning in Japanese did not necessarily improve accuracy, indicating that language understanding alone is insufficient for insight reasoning. Notably, models sometimes generated correct candidates but failed to endorse them, suggesting weak metacognitive control rather than a lack of knowledge. This "verification failure" indicates that CoT outputs can reflect genuine intermediate reasoning states rather than post-hoc rationalizations. By exposing this metacognitive bottleneck - models' inability to recognize when they are right - the benchmark provides a scalable, cross-linguistic testbed for studying machine insight, confidence calibration, and self-evaluation. NazoNazo Benchmark thus offers not only a fresh challenge to current LLMs but also a concrete target for developing AI metacognitive psychology and enhancing machine Aha! capability.

cs / cs.AI

Arxivで見る