Reinforcement Learning for Mean-field System with Unknown System Information
LIN Yingxia1, QI Qingyuan2
1. College of Automation, Qingdao University, Qingdao 266071, China; 2. Qingdao Innovation and Development Center of Harbin Engineering University, Qingdao 266000, China
Abstract:In this paper, the infinite horizon linear quadratic (LQ) optimal control problem for mean-field system with unknown system information is solved by using a completely model-free reinforcement learning (RL) approach. Although the introduction of the mean-field terms in system dynamics and the cost function will destroy the adaptiveness of the control law, the optimal stabilization control is successfully obtained based on the proposed RL algorithm and the Least Squares Temporal Difference estimation. In addition, combined with the idea of introducing off-policy learning, the control policy is further improved. We also prove that the algorithm produces stable policies given that the estimation errors remain small.
林迎霞, 亓庆源. 含未知系统信息的平均场系统强化学习研究[J]. 复杂系统与复杂性科学, 2025, 22(3): 153-160.
LIN Yingxia, QI Qingyuan. Reinforcement Learning for Mean-field System with Unknown System Information[J]. Complex Systems and Complexity Science, 2025, 22(3): 153-160.
[1] KAC M. Foundations of kinetic theory[C]. Proceedings of The third Berkeley symposium on mathematical statistics and probability, 1956, 3(600): 171-197. [2] MCKEAN H. A class of Markov processes associated with nonlinear parabolic equations[J]. Proceedings of the National Academy of Sciences, 1966, 56(6): 1907-1911. [3] ELLIOTT R, LI X, NI Y H. Discrete time mean-field stochastic linear-quadratic optimal control problems[J]. Automatica, 2013, 49(11): 3222-3233. [4] NI Y H, ELLIOTT R, LI X. Discrete-time mean-field stochastic linear-quadratic optimal control problems, II: infinite horizon case[J]. Automatica, 2015, 57: 65-77. [5] YONG J M. Linear-quadratic optimal control problems for mean-field stochastic differential equations-time-consistent solutions[J]. Transactions of the American Mathematical Society, 2017, 369(8): 5467-5523. [6] LI D, NG W L. Optimal dynamic portfolio selection: multiperiod mean-variance formulation[J]. Mathematical Finance, 2000, 10(3): 387-406. [7] NI Y H, ZHANG J F, LI X. Indefinite mean-field stochastic linear-quadratic optimal control[J]. IEEE Transactions on Automatic Control, 2015, 60(7), 1786-1800. [8] YONG J M. Linear-quadratic optimal control problems for mean-field stochastic differential equations[J]. SIAM journal on Control and Optimization, 2013, 51(4): 2809-2838. [9] HUANG J H, LI X, YONG J M. A linear-quadratic optimal control problem for mean-field stochastic differential equations in infinite horizon[J].Mathematical Control & Related Fields, 2015, 5(1):97-139. [10] QI Q Y, ZHANG H S, WU Z. Stabilization control for linear continuous-time mean-field systems[J]. IEEE Transactions on Automatic Control, 2018, 64(8): 3461-3468. [11] ZHANG H S, QI Q Y, FU M Y. Optimal stabilization control for discrete-time mean-field stochastic systems[J]. IEEE Transactions on Automatic Control, 2018, 64(3): 1125-1136. [12] 陆君安, 刘慧, 陈娟. 复杂动态网络的同步[M]. 北京: 高等教育出版社, 2016. [13] MA X, QI Q Y, LI X, et al. Optimal control and stabilization for linear continuous-time mean-field systems with delay[J]. IET Control Theory & Applications, 2022, 16(3): 283-300. [14] LIU H, SHANG Z C, REN Z Y, et al. Recovering unknown topology in a two-layer multiplex network: one layer infers the other layer[J]. Science China Technological Sciences, 2022, 65(7): 1493-1505. [15] LIU H, WANG B J, LU J A, et al. Node-set importance and optimization algorithm of nodes selection in complex networks based on pinning control[J]. Acta Physica Sinica, 2021, 70(5):056401-056401. [16] SUTTON R S, BARTO A G, WILLIAMS R J. Reinforcement learning is direct adaptive optimal control[J]. IEEE control systems magazine, 1992, 12(2): 19-22. [17] YAGHMAIE F A, GUSTAFSSON F. Using reinforcement learning for model-free linear quadratic control with process and measurement noises[C].2019 IEEE 58th Conference on Decision and Control (CDC). Nice, France: IEEE, 2019: 6510-6517. [18] LI N, LI X, PENG J, et al. Stochastic linear quadratic optimal control problem: a reinforcement learning method[J]. IEEE Transactions on Automatic Control, 2022, 67(9): 5009-5016. [19] BELLMAN R E, DREYFUS S E. Applied Dynamic Programming[M]. New Jersey US: Princeton University Press, 2015. [20] BERTSEKAS D. Reinforcement Learning and Optimal Control[M]. Athena Scientific, NH, USA: Athena Scientific, 2019. [21] LAGOUDAKIS M G, PARR R. Least-squares policy iteration[J]. The Journal of Machine Learning Research, 2003, 4(6): 1107-1149. [22] BRADTKES J, BARTO A G. Linear least-squares algorithms for temporal difference learning[J]. Machine learning, 1996, 22(3): 33-57. [23] HEWER G A. An iterative technique for the computation of the steady state gains for the discrete optimal regulator[J]. IEEE Transactions on Automatic Control, 1971, 16(4): 382-384. [24] YU H, BERTSEKAS D P. Convergence results for some temporal difference methods based on least squares[J]. IEEE Transactions on Automatic Control, 2009, 54(7): 1515-1531.