超高速LLMデコーディング技術で、IT業界をブチアゲ🔥
ギャルのキラーポイント ● デコーディング(翻訳みたいなもん)を爆速にする技術だよ!処理が速くなると、アプリの動きも良くなるよね✨ ● 複数リクエストで同じ情報(プレフィックス)を共有して、ムダを省くらしい!賢すぎ👏 ● 既存のLLMフレームワーク(vLLM)にすぐ使えるから、導入も楽ちん🎵
詳細解説
リアルでの使いみちアイデア
続きは「らくらく論文」アプリで
LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention. This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 53.5% on average and TPOT by 17.0-93.1% under the same configurations against state-of-the-art attention kernels. PAT's source code is publicly available at https://github.com/flashserve/PAT.