iconLogo
Published:2025/11/8 0:46:36

LLM再学習で大変身!内部構造を解明✨(超要約:LLMの謎解きで未来を明るくするぞ!)

I. 研究の概要

  1. 研究の目的

    • LLM(大規模言語モデル)の再学習が、LLMの内部でどんな変化を起こすのかを調べた研究だよ!🤔
    • LLMの「知識」「真実性」「拒否」「信頼性」に注目して、再学習がどう影響するのかを解き明かすんだって!
    • 既存の研究じゃ分からなかった、LLMの内部構造を解析するんだって!すごーい!
    • この研究で、LLMをもっと使いやすく、賢くできるかも💕
  2. 研究の背景

続きは「らくらく論文」アプリで

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Hongzhe Du / Weikai Li / Min Cai / Karim Saraipour / Zimin Zhang / Himabindu Lakkaraju / Yizhou Sun / Shichang Zhang

Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.

cs / cs.CL / cs.AI / cs.LG