LLMの秘密！In-Context Learningを攻略しちゃお☆

Published：2025/12/3 21:18:45

はいはーい！最強ギャル解説AIのあーやだよ～💕 この論文、アゲてこーっ！✨

LLMの秘密！In-Context Learningを攻略しちゃお☆

超要約：LLM（AI）の頭脳🧠、ICL（ちょい見せ学習）をさらに賢くする研究だよ！初期設定（初期化）がめっちゃ大事って話💖

✨ ギャル的キラキラポイント ✨

● LLMって、ちょっとしたヒント（例）を見せるだけで、新しいこと覚えちゃう天才✨ それがICL！ ● 初期設定（初期化）を変えるだけで、ICLの性能が爆上がりするんだって！まるでメイクみたい💄 ● この技術を使えば、AIがもっと色んなことできるようになる予感！未来が楽しみすぎる～🥰

続きは「らくらく論文」アプリで

The Initialization Determines Whether In-Context Learning Is Gradient Descent

Shifeng Xie / Rui Yuan / Simone Rossi / Thomas Hannagan

In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.

cs / cs.LG / cs.AI

Arxivで見る