● LLM (大規模言語モデル) で未来の映像を予測するんだって! ● 視覚情報だけじゃなく、言葉の知識も使うのがポイント💖 ● 監視システムとか、ロボットとか、色んな分野で役立つね!
背景 動画の内容を理解する「シーングラフ」っていう技術があるんだけど、未来を予測するのは難しかったの😢 そこで、言葉の知識を使ったらもっとすごい予測ができるんじゃない?って研究が始まったんだって!
方法 LLMを使って、動画に出てくるもの(オブジェクト)の関係性とか、どんな行動をするかとかを予測するんだって! オブジェクトの予測と関係性の予測を別々にやることで、LLMの能力を最大限に活かせるフレームワーク (OOTSM) を開発したんだってさ!
続きは「らくらく論文」アプリで
A scene graph is a structured representation of objects and their spatio-temporal relationships in dynamic scenes. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications in intelligent surveillance and human-machine collaboration. While recent SGA approaches excel at leveraging visual evidence, long-horizon forecasting fundamentally depends on semantic priors and commonsense temporal regularities that are challenging to extract purely from visual features. To explicitly model these semantic dynamics, we propose Linguistic Scene Graph Anticipation (LSGA), a linguistic formulation of SGA that performs temporal relational reasoning over sequences of textualized scene graphs, with visual scene-graph detection handled by a modular front-end when operating on video. Building on this formulation, we introduce Object-Oriented Two-Stage Method (OOTSM), a language-based framework that anticipates object-set dynamics and forecasts object-centric relation trajectories with temporal consistency regularization, and we evaluate it on a dedicated benchmark constructed from Action Genome annotations. Extensive experiments show that compact fine-tuned language models with up to 3B parameters consistently outperform strong zero- and one-shot API baselines, including GPT-4o, GPT-4o-mini, and DeepSeek-V3, under matched textual inputs and context windows. When coupled with off-the-shelf visual scene-graph generators, the resulting multimodal system achieves substantial improvements on video-based SGA, boosting long-horizon mR@50 by up to 21.9\% over strong visual SGA baselines.