iconLogo
Published:2026/1/5 13:19:13

旅行計画AIを爆アゲ🚀TravelBench

超要約:旅行計画AIの性能を測る、新しいテスト作ったよ!色んな旅行に対応できるからスゴイ✨

● 旅行計画AIの「実力テスト」みたいなもの! ● ユーザーの細かい希望も叶えられるように💖 ● 旅行計画AIの「限界」も教えてくれる優れもの!

詳細解説いくよ~!

背景

続きは「らくらく論文」アプリで

TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

Xiang Cheng / Yulan Hu / Xiangwen Zhang / Lu Xu / Zheng Pan / Xin Li / Yong Liu

Travel planning is a natural real-world task to test large language models (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and a lack of clear evaluation of agents' capability boundaries. To mitigate these gaps, we propose \textbf{TravelBench}, a benchmark for fully real-world travel planning. We collect user queries, user profile and tools from real scenarios, and construct three subtasks-Single-Turn, Multi-Turn, and Unsolvable-to evaluate agent's three core capabilities in real settings: (1) solving problems autonomously, (2) interacting with users over multiple turns to refine requirements, and (3) recognizing the limits of own abilities. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment that integrates ten travel-related tools. Agents can combine these tools to solve most practical travel planning problems, and our systematic verification demonstrates the stability of the proposed benchmark. We further evaluate multiple LLMs on TravelBench and conduct an in-depth analysis of their behaviors and performance. TravelBench provides a practical and reproducible evaluation benchmark to advance research on LLM agents for travel planning.\footnote{Our code and data will be available after internal review.

cs / cs.AI