iconLogo
Published:2025/10/23 11:14:41

最強LLM爆速化🚀 Twilight降臨!

超要約:LLMの計算を軽くする魔法🧙‍♀️✨Top-pで賢く選別!

🌟 ギャル的キラキラポイント✨

● LLMを賢く高速化!まるでダイエット成功みたいな気分💖 ● Top-pってのがミソ!状況に合わせて賢く選ぶの✨ ● 既存技術をパワーアップ!既存のLLMも爆速になるかも😍

詳細解説いくよ~!

続きは「らくらく論文」アプリで

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Chaofan Lin / Jiaming Tang / Shuo Yang / Hanshuo Wang / Tian Tang / Boyu Tian / Ion Stoica / Song Han / Mingyu Gao

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.

cs / cs.LG / cs.CL