超要約: LLMの強化手法RLVR、すごそうだけど実は限界も? 徹底解剖して、未来のAIをマジ語るよ!✨
ギャル的キラキラポイント✨
詳細解説
リアルでの使いみちアイデア💡
続きは「らくらく論文」アプリで
Recent advances in LLMs highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI capabilities, particularly in solving complex logical tasks. However, it remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary or mainly amplifies high-reward outputs that the base model already knows, leading to improved precision. This study presents an empirical investigation that provides new insights into the potential limits of the common RLVR recipe. We examine how, under current training conditions, RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely novel solutions, remaining constrained by the base model's initial distribution. We also identify an entropy-reward trade-off: while the current RLVR recipe reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments show that although the current RLVR recipe consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, it leads to greater uncertainty at each generation step but declining answer-level entropy. This suggests that these seemingly more uncertain generation paths ultimately converge onto a smaller set of distinct answers. Taken together, our findings reveal potential limits of the current RLVR recipe in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations, such as explicit exploration mechanisms or hybrid strategies that allocate probability mass to underrepresented solution regions.