ロボット、つかむの上手すぎ問題！✨（ゼロショ把持検出）

Published：2025/11/8 1:47:40

ロボット、つかむの上手すぎ問題！✨（ゼロショ把持検出）

超要約: ロボットがAIで物つかむの、データなしでできちゃうって話！🤖💕

ギャル的キラキラポイント✨

● 大量データ不要！VLM (画像と言語のモデル) で賢く物体を認識するよ💖 ● 3段階プロンプトで、つかむ形とかを細かく指示できるのがスゴすぎ✨ ● 新しいモノもつかめちゃうから、色んな業界で大活躍の予感！😍

詳細解説

背景ロボットがモノを掴む (把持) のって、難しいのよ！💦 今までは、大量のデータで学習させたり、形を計算したりしてたんだけど、コレだと新しいモノに対応できないことがあったんだよね😢

方法そこで登場！VLM（Vision-Language Model）！✨ 画像と文章を一緒に学習するから、言葉で「こう掴んで」って指示できるんだって！ RGB-D画像（色と距離の情報）と3段階プロンプト（掴み方の指示書みたいなもの）を使って、色んなモノを掴むんだって！

続きは「らくらく論文」アプリで

VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models

Manav Kulshrestha / S. Talha Bukhari / Damon Conover / Aniket Bera

Robotic grasping is a fundamental capability for autonomous manipulation; however, most existing methods rely on large-scale expert annotations and necessitate retraining to handle new objects. We present VLAD-Grasp, a Vision-Language model Assisted zero-shot approach for Detecting grasps. From a single RGB-D image, our method (1) prompts a large vision-language model to generate a goal image where a straight rod "impales" the object, representing an antipodal grasp, (2) predicts depth and segmentation to lift this generated image into 3D, and (3) aligns generated and observed object point clouds via principal component analysis and correspondence-free optimization to recover an executable grasp pose. Unlike prior work, our approach is training-free and does not rely on curated grasp datasets. Despite this, VLAD-Grasp achieves performance that is competitive with or superior to that of state-of-the-art supervised models on the Cornell and Jacquard datasets. We further demonstrate zero-shot generalization to novel real-world objects on a Franka Research 3 robot, highlighting vision-language foundation models as powerful priors for robotic manipulation.

cs / cs.RO / cs.AI / cs.LG

Arxivで見る