超要約: オンライン動画を賢く解析!未来予測で、リアルタイム&高精度なイベント検出を実現しちゃうんだ!
✨ ギャル的キラキラポイント ✨ ● 動画をリアルタイム(すぐ)に解析できるから、ライブ配信とかで大活躍間違いなし💖 ● 未来のことまで予測して、イベントを正確にキャッチ!ノイズにも強いってスゴくない?✨ ● 色んなデバイスで動くように、モデルも軽くて優秀なの!スマホでもOKってこと📱💕
詳細解説いくよ~!
背景 動画のイベント解析って、オフラインでじっくりやるのが普通だったけど、時間かかるし、リアルタイムじゃないと困るシーンも多いじゃん?🤔 この研究は、音声と映像を両方使って、オンライン動画を「今すぐ!」解析できる技術を目指したんだって。
続きは「らくらく論文」アプリで
Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.