LLMを活用した検索評価におけるバイアス分析：新規事業開発担当者向け解説

Published：2026/1/5 3:02:33

タイトル & 超要約：LLM評価のバイアス解析！検索の質UPを目指すぞ💖

🌟 ギャル的キラキラポイント✨ ● LLM（大規模言語モデル）の検索評価、バイアス（偏り）があるって知ってた？🥺 ● クエリ（検索ワード）とドキュメント（検索結果）の関係性を分析するんだって！✨ ● バイアスを特定して、もっと良い検索システムを作るって、すごくない？😍

詳細解説背景 LLMを使って検索の評価をするのが流行ってるんだけど、実はバイアスがあるみたい😥人間の評価と比べて、なんか違う！ってことが起きるらしいの。だから、LLMがどんな時に間違えちゃうのか、詳しく調べて、もっと良い検索結果を出せるようにしようって研究なんだね！

方法クエリとドキュメントを組み合わせて、意味の関係性を分析するよ！🧐それをグループ分け（クラスタリング）して、LLMと人間の評価がどれくらい一致してるかをチェック！💡一致してない部分があれば、それがバイアスだってわかるの。Gwet's AC1って評価指標を使うのがポイントみたい。

結果バイアスが出やすいクエリを発見！🔍特定の検索ワードだと、LLMと人間の評価がズレやすいんだって！🧐この結果から、検索システムを改善するためのヒントが見つかるってワケ💖

続きは「らくらく論文」アプリで

Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis

Samaneh Mohtadi / Gianluca Demartini

Large Language Models (LLMs) have been used as relevance assessors for Information Retrieval (IR) evaluation collection creation due to reduced cost and increased scalability as compared to human assessors. While previous research has looked at the reliability of LLMs as compared to human assessors, in this work, we aim to understand if LLMs make systematic mistakes when judging relevance, rather than just understanding how good they are on average. To this aim, we propose a novel representational method for queries and documents that allows us to analyze relevance label distributions and compare LLM and human labels to identify patterns of disagreement and localize systematic areas of disagreement. We introduce a clustering-based framework that embeds query-document (Q-D) pairs into a joint semantic space, treating relevance as a relational property. Experiments on TREC Deep Learning 2019 and 2020 show that systematic disagreement between humans and LLMs is concentrated in specific semantic clusters rather than distributed randomly. Query-level analyses reveal recurring failures, most often in definition-seeking, policy-related, or ambiguous contexts. Queries with large variation in agreement across their clusters emerge as disagreement hotspots, where LLMs tend to under-recall relevant content or over-include irrelevant material. This framework links global diagnostics with localized clustering to uncover hidden weaknesses in LLM judgments, enabling bias-aware and more reliable IR evaluation.

cs / cs.IR / cs.AI / cs.CL

Arxivで見る