製薬動画を賢く！VLMスケーリング研究🚀

Published：2026/1/8 12:42:17

製薬動画を賢く！VLMスケーリング研究🚀

超要約: 製薬業界の動画を、かしこいAI (VLM) でサクッと理解！GPUケチっても大丈夫👍

✨ ギャル的キラキラポイント ✨ ● 製薬業界の動画って、めっちゃ長いし種類も多いじゃん？それをAIで効率化するって、まさに神✨ ● GPU (画像処理の頭脳) のリソースが限られてても、大丈夫なように研究してるのがエモい💖 ● コンプライアンス (お約束) 遵守しながら、動画を解析できるプラットフォームを作るって、超優秀🌟

詳細解説いくよ～！

背景製薬業界って、動画資料が山ほどあるのね！臨床試験とか、会議とか、患者さん向けとか。でも、それを全部人間が見るのは大変💦 しかも、GPUって高いじゃん？💰 だから、GPUをケチりつつ、動画を賢く理解できるAIを作りたいってこと！

続きは「らくらく論文」アプリで

Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform

Suyash Mishra / Qiang Li / Srikanth Patil / Satyanarayan Pati / Baddu Narendra

Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.

cs / cs.CV / cs.LG

Arxivで見る