ゲームAI、最終回まで最強になるってマジ！？😎✨

Published：2025/11/7 20:30:07

ゲームAI、最終回まで最強になるってマジ！？😎✨

超要約：ゲームAIの学習が、最終的にちゃんと賢くなる方法を見つけたよ！🎉

● ゲームAIが、最後まできちんと賢くなるようにしたんだって！💖 ● 「ブラックボックス還元」っていうスゴ技で、どんなゲームにも応用できるっぽい！😳 ● AIがもっと賢くなるから、色んなことに役立つ予感！😍

詳細解説いくよ～！

背景ゲームAI、最近めっちゃ進化してるじゃん？🎮でもさ、AIの学習って、たまに「あれ？最後は微妙じゃね？」みたいなこと、あるんだよね～💦今回の研究は、AIが最後までちゃんと賢くなるように、頑張ったって話だよ！

続きは「らくらく論文」アプリで

From Average-Iterate to Last-Iterate Convergence in Games: A Reduction and Its Applications

Yang Cai / Haipeng Luo / Chen-Yu Wei / Weiqiang Zheng

The convergence of online learning algorithms in games under self-play is a fundamental question in game theory and machine learning. Among various notions of convergence, last-iterate convergence is particularly desirable, as it reflects the actual decisions made by the learners and captures the day-to-day behavior of the learning dynamics. While many algorithms are known to converge in the average-iterate, achieving last-iterate convergence typically requires considerably more effort in both the design and the analysis of the algorithm. Somewhat surprisingly, we show in this paper that for a large family of games, there exists a simple black-box reduction that transforms the average iterates of an uncoupled learning dynamics into the last iterates of a new uncoupled learning dynamics, thus also providing a reduction from last-iterate convergence to average-iterate convergence. Our reduction applies to games where each player's utility is linear in both their own strategy and the joint strategy of all opponents. This family includes two-player bimatrix games and generalizations such as multi-player polymatrix games. By applying our reduction to the Optimistic Multiplicative Weights Update algorithm, we obtain new state-of-the-art last-iterate convergence rates for uncoupled learning dynamics in multi-player zero-sum polymatrix games: (1) an $O(\frac{\log d}{T})$ last-iterate convergence rate under gradient feedback, representing an exponential improvement in the dependence on the dimension $d$ (i.e., the maximum number of actions available to either player); and (2) an $\widetilde{O}(d^{\frac{1}{5}} T^{-\frac{1}{5}})$ last-iterate convergence rate under bandit feedback, improving upon the previous best rates of $\widetilde{O}(\sqrt{d} T^{-\frac{1}{8}})$ and $\widetilde{O}(\sqrt{d} T^{-\frac{1}{6}})$.

cs / cs.GT / cs.LG / math.OC

Arxivで見る