最強データ分析ベンチマーク爆誕！🎉（DAComp）

Published：2025/12/3 23:21:28

最強データ分析ベンチマーク爆誕！🎉（DAComp）

超要約: データ分析AIの性能を評価する、新しいテストだよ！🤖✨
ギャル的キラキラポイント✨
- ● データ分析を全自動化するAIの、出来をチェックできるテストだって！すごくな～い？🤩
- ● 既存のテストじゃカバーしきれてなかった、現実世界の複雑さも評価できるんだって！優秀～💖
- ● 中国語版もあるから、グローバルな活躍も期待できちゃうね！🌎✨
詳細解説
- 背景: 最近のAI（LLM）の進化はマジ卍！データ分析もAIにおまかせできる時代が来たってワケ！でも、そのAIの性能をちゃんと評価する基準がないと困るじゃん？🤔そこで登場したのがDAComp！
- 方法: DACompは、データ分析の最初から最後まで（データの準備、分析、結果出し）を網羅したテスト。AIがどれだけ賢くデータ分析できるかを測るよ！現実の企業のデータとか課題を参考に、いろんなタスク（お仕事）で試すんだって！💻🌟
- 結果: 今までのテストじゃ測れなかった、AIの「本物の実力」がDACompでわかる！優秀なAIを見つけやすくなるし、AIの成長も加速するってこと！✨
- 意義（ここがヤバい♡ポイント）: DACompのおかげで、データ分析がもっと簡単になるかも！企業はデータをもっと有効活用できるようになり、ビジネスがさらに発展する可能性大！IT業界全体が盛り上がりそうじゃん？🚀💖
リアルでの使いみちアイデア💡
- AIのデータ分析サービスを選びたい時に、DACompの結果を参考にできる！😎
- 自分の会社でAIを導入する際に、DACompでそのAIの能力をチェック！👍

続きは「らくらく論文」アプリで

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Fangyu Lei / Jinxiang Meng / Yiming Huang / Junjie Zhao / Yitong Zhang / Jianwen Luo / Xin Zou / Ruiyi Yang / Wenbo Shi / Yan Gao / Shizhu He / Zuo Wang / Qian Liu / Yang Wang / Ke Wang / Jun Zhao / Kang Liu

Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io

cs / cs.CL / cs.AI

Arxivで見る