タイトル & 超要約:DeepSearch!LLMの推論を爆上げするフレームワーク🎉
🌟 ギャル的キラキラポイント✨ ● LLM(お利口AI)の"モヤモヤ"を解決✨ つまり、学習が止まっちゃう現象を改善するんだね! ● MCTS(ゲームAIで大活躍)をRLVR(すごい報酬システム)に合体!最強の組み合わせじゃん? ● 高度な問題解決AIが爆誕!賢いチャットボットとか、夢が広がる~💖
詳細解説 ● 背景 最近のLLMはすごいけど、推論(考えること)が苦手な部分も💧学習方法がイマイチで、解を全部見つけられないこともしばしば。DeepSearchは、この弱点を克服するべく登場したんだね!
● 方法 DeepSearchは、MCTSをRLVRにぶち込むことで、探索範囲を広げる作戦!色んな方法で答えを探せるから、学習がスムーズに進むってワケ✨ あとは、賢く報酬をあげたり、ムダな計算を省いたり…細かい工夫も満載だよ!
● 結果 DeepSearchを使ったら、学習が止まらない!つまり、どんどん賢くなるってこと💖 難しい問題も解けるようになって、AIのポテンシャルが爆上がり⤴︎⤴︎
続きは「らくらく論文」アプリで
Although RLVR has become an essential component for developing advanced reasoning skills in language models, contemporary studies have documented training plateaus after thousands of optimization steps, i.e., notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search (MCTS) directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models, while using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.