超要約: 3D空間を自分で学べるAI(EasyOcc)が爆誕!自動運転とか色んな分野で活躍する予感!🚗💨
ギャル的キラキラポイント✨ ● 手動ラベル付けナシで、3D空間を認識できるって神🥺💖 ● 計算コスト削減&既存モデルとの相性もバッチリ👍 ● EasyOccのおかげで、未来のITが激変するかも⁉️
詳細解説 ● 背景 3D空間の情報をAIに理解させるのって、大変だったの💦 従来のAIは、たくさんのデータ(ラベル)が必要だったり、計算が複雑だったり…😱 でも、最近のVFM(Visual Foundation Models)のおかげで、AIが自分で学習できるようになったんだよね!
● 方法 EasyOccは、Grounded-SAMとMetric3Dv2を使って、3D空間に疑似ラベル(偽物のラベル)をつけたよ👀✨ それをヒントに、AIが3D空間を学習!しかも、時間的な情報も活用して、より詳しく理解できるようにしたんだって!
続きは「らくらく論文」アプリで
Self-supervised models have recently achieved notable advancements, particularly in the domain of semantic occupancy prediction. These models utilize sophisticated loss computation strategies to compensate for the absence of ground-truth labels. For instance, techniques such as novel view synthesis, cross-view rendering, and depth estimation have been explored to address the issue of semantic and depth ambiguity. However, such techniques typically incur high computational costs and memory usage during the training stage, especially in the case of novel view synthesis. To mitigate these issues, we propose 3D pseudo-ground-truth labels generated by the foundation models Grounded-SAM and Metric3Dv2, and harness temporal information for label densification. Our 3D pseudo-labels can be easily integrated into existing models, which yields substantial performance improvements, with mIoU increasing by 45\%, from 9.73 to 14.09, when implemented into the OccNeRF model. This stands in contrast to earlier advancements in the field, which are often not readily transferable to other architectures. Additionally, we propose a streamlined model, EasyOcc, achieving 13.86 mIoU. This model conducts learning solely from our labels, avoiding complex rendering strategies mentioned previously. Furthermore, our method enables models to attain state-of-the-art performance when evaluated on the full scene without applying the camera mask, with EasyOcc achieving 7.71 mIoU, outperforming the previous best model by 31\%. These findings highlight the critical importance of foundation models, temporal context, and the choice of loss computation space in self-supervised learning for comprehensive scene understanding.