iconLogo
Published:2025/11/7 19:27:58

SnapStream爆誕!LLMの長文処理を神進化🚀

超要約: LLMの長文処理、メモリ節約&爆速化!SnapStreamが神ってる✨

ギャル的キラキラポイント✨ ● LLM(大規模言語モデル)のメモリ問題を解決!コスト削減に貢献💰 ● 長文コンテキスト(長い文章)もサクサク処理!ストレスフリー💖 ● 既存システムに簡単導入!すぐに使えるのが最高じゃん😍

詳細解説 背景 LLMって賢いけど、長文になるとメモリ(記憶領域)めっちゃ使うのよね🥺 KVキャッシュ(データの保存場所)がデカくなるから、処理速度も遅くなっちゃう💦

方法 SnapStreamは、KVキャッシュを圧縮(小さくする)する技術と、長文処理が得意な技術を組み合わせたの!Llama-3とかDeepSeek-R1みたいな、すごいLLMで実験した結果、効果バッチリだったって🤩

続きは「らくらく論文」アプリで

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li / Nasim Farahini / Evgenii Iuliugin / Magnus Vesterlund / Christian H\"aggstr\"om / Guangtao Wang / Shubhangi Upasani / Ayush Sachdeva / Rui Li / Faline Fu / Chen Wu / Ayesha Siddiqua / John Long / Tuowen Zhao / Matheen Musaddiq / H\r{a}kan Zeffer / Yun Du / Mingran Wang / Qinghua Li / Bo Li / Urmish Thakker / Raghu Prabhakar

The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

cs / cs.AI / cs.AR / cs.DC