論文査読を爆(バク)上げするAIベンチマーク爆誕!
ギャル的キラキラポイント✨
● LLM(大規模言語モデル)の弱点を克服!論文の細かいミスや矛盾を見つけるのが苦手だったLLMを、PaperAudit-Benchがサポート💖 ● 論文のエラーを種類別に評価できるから、AI査読の精度が格段にアップ!まるでカリスマ校閲ガール👯♀️ ● 新規ビジネスのチャンス到来!AI査読プラットフォームとか、専門分野特化型AI査読サービスとか、夢広がる~✨
詳細解説
続きは「らくらく論文」アプリで
Large language models can generate fluent peer reviews, yet their assessments often lack sufficient critical rigor when substantive issues are subtle and distributed across a paper. In this paper, we introduce PaperAudit-Bench, which consists of two components: (1) PaperAudit-Dataset, an error dataset covering both errors identifiable within individual sections and those requiring cross-section reasoning, designed for controlled evaluation under long-context settings; and (2) PaperAudit-Review, an automated review framework that integrates structured error detection with evidence-aware review generation to support critical assessment. Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths, highlighting the difficulty of identifying such errors under long-context settings. Relative to representative automated reviewing baselines, incorporating explicit error detection into the review workflow produces systematically stricter and more discriminative evaluations, demonstrating its suitability for peer review. Finally, we show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.