シーケンスモデル、もっとイケてる方法💖

Published：2025/12/17 6:15:24

シーケンスモデル、もっとイケてる方法💖

超要約：AttentionとSSMを合体！モデル設計が超進化するって話✨

✨ ギャル的キラキラポイント ✨ ● AttentionとSSMを融合(ゆうごう)！モデルの可能性が爆上がり🚀 ● 表現力と学習のしやすさのバランスを追求（きゅうきゅう）🥰 ● AI技術で、私たちの未来がもっとハッピーになるかもっ🎵

詳細解説いくよ～！

背景シーケンスモデル（時系列データを扱うモデル）は、自然言語処理とか色んな分野で大活躍中！でも、モデルの仕組みが複雑すぎて、理解が追いついてないのが現状💔 Transformer（Attention機構）とSSM、それぞれ得意なこと違うから、一緒に使えたら最強じゃん？✨

続きは「らくらく論文」アプリで

How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

Ali Ghodsi

Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.

cs / cs.LG / cs.AI

Arxivで見る