MS-Temba！長尺動画を賢く理解しちゃう魔法🧙‍♀️✨

Published：2025/12/17 8:05:56

超要約: 長尺動画をMambaで解析！活動認識を爆上げ🚀
ギャル的キラキラポイント✨
- ● Transformerよりコスパ最強！計算量が少ないから、長尺動画もサクサク処理できるってこと💖
- ● 時間スケール（短いやつから長い奴まで）を自由自在に扱えるから、色んな行動をちゃんと見抜けるってワケ😉
- ● 複数の行動が同時に起きても大丈夫！正確に検出できるから、色んな場面で役立つ予感🌟
詳細解説
- 背景: 長い動画（40分以上とか！）を見て、何が起きてるか理解するのって難しいじゃん？今までのAIは、長尺動画だと処理が大変だったり、色んな時間の長さの行動を区別するのが苦手だったんだよね😢
- 方法: 新しいAIモデル「Mamba」を使って、長尺動画の解析に挑戦💪 Mambaは計算量が少ないから、長い動画でもサクサク動くし、色んな時間スケールの情報も捉えられるように工夫したんだって！
- 結果: ADL動画（日常生活の動画）での活動認識の精度がめっちゃ上がったみたい！ Transformerベースの手法よりも、パラメータ効率も5倍以上も良くなったんだって👏
- 意義（ここがヤバい♡ポイント）: これ、監視カメラとか、介護とか、スマートホームとか、色んな分野で役立つ可能性大！人間みたいなAIが作れるかも⁉️
リアルでの使いみちアイデア💡
- 💡 街中の防犯カメラで、怪しい動きを自動で発見！犯罪を未然に防げるかも🚓
- 💡 お年寄りの見守りシステムで、転倒とか、異変をいち早くキャッチ！安心安全な暮らしをサポートできるね👵

続きは「らくらく論文」アプリで

MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

Arkaprava Sinha / Monish Soundar Raj / Pu Wang / Ahmed Helmy / Hieu Le / Srijan Das

Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.

cs / cs.CV

Arxivで見る