LLM爆速化！クエリ管理で未来を掴め✨

Published：2026/1/1 17:26:59

LLM爆速化！クエリ管理で未来を掴め✨

超要約: LLMの遅延(TTFT, TPOT)を改善！クエリ管理で爆速AIを！
ギャル的キラキラポイント✨
- ● TTFT（最初の応答までの時間）を短縮して、爆速AIを実現😎
- ● Prefix Reuse (プレフィックス再利用) ってテクで、計算を節約💰
- ● k-LPM って新手法が、既存の弱点をカバーしてるっぽい🌟
詳細解説
- 背景: LLM（大規模言語モデル）って、色んなことに使えるけど、処理に時間かかるのがネックだった😭。特にオンラインサービスでは、応答速度が超大事！
- 方法: クエリ（質問とかリクエストのこと）のスケジューリング（順番とか決めること）を工夫✨Prefix Reuseを活かせるように、k-LPMって新しいアルゴリズムを開発したみたい。
- 結果: k-LPMを使うと、応答速度が速くなって、同時に処理できるリクエストの数も増えたらしい！やったね💖
- 意義: チャットボットとか、検索エンジンとか、色んなAIサービスの使い心地が格段に良くなるってこと！企業もユーザーもハッピーになれるね🥰
リアルでの使いみちアイデア💡
- AIチャットボットで、質問したら秒速で返事がくるようになる！話しててイライラしないって最高じゃん？
- ネットショッピングで、商品の詳細をAIがすぐ教えてくれる！欲しいものがすぐに見つかるから、ついつい買っちゃうかも🛍️

続きは「らくらく論文」アプリで

LLM Query Scheduling with Prefix Reuse and Latency Constraints

Gregory Dexter / Shao Tang / Ata Fatahi Baarzi / Qingquan Song / Tejas Dharamsi / Aman Gupta

The efficient deployment of large language models (LLMs) in online settings requires optimizing inference performance under stringent latency constraints, particularly the time-to-first-token (TTFT) and time-per-output-token (TPOT). This paper focuses on the query scheduling problem for LLM inference with prefix reuse, a technique that leverages shared prefixes across queries to reduce computational overhead. Our work reveals previously unknown limitations of the existing first-come-first-serve (FCFS) and longest-prefix-match (LPM) scheduling strategies with respect to satisfying latency constraints. We present a formal theoretical framework for LLM query scheduling under RadixAttention, a prefix reuse mechanism that stores and reuses intermediate representations in a radix tree structure. Our analysis establishes the NP-hardness of the scheduling problem with prefix reuse under TTFT constraints and proposes a novel scheduling algorithm, $k$-LPM, which generalizes existing methods by balancing prefix reuse and fairness in query processing. Theoretical guarantees demonstrate that $k$-LPM achieves improved TTFT performance under realistic traffic patterns captured by a data generative model. Empirical evaluations in a realistic serving setting validates our findings, showing significant reductions in P99 TTFT compared to baseline methods.

cs / cs.DS

Arxivで見る