含未知系统信息的平均场系统强化学习研究

doi:10.13306/j.1672-3813.2025.03.020

复杂系统与复杂性科学

2025, Vol. 22

Issue (3): 153-160 DOI: 10.13306/j.1672-3813.2025.03.020

研究论文

本期目录 | 过刊浏览 | 高级检索

含未知系统信息的平均场系统强化学习研究

林迎霞¹, 亓庆源²

1.青岛大学自动化学院,山东青岛 266071;
2.哈尔滨工程大学青岛创新发展中心,山东青岛 266000

Reinforcement Learning for Mean-field System with Unknown System Information

LIN Yingxia¹, QI Qingyuan²

1. College of Automation, Qingdao University, Qingdao 266071, China;
2. Qingdao Innovation and Development Center of Harbin Engineering University, Qingdao 266000, China

摘要
参考文献
相关文章
Metrics

全文: PDF(1500 KB)
输出: BibTeX | EndNote (RIS)

摘要为解决系统信息完全未知的无限时域离散时间平均场线性二次(LQ)最优控制问题,利用完全无模型强化学习(RL)方法,在系统动力学中引入平均场项和代价函数。基于所提出的RL算法和最小二乘时序差分估计,成功获得了最优镇定控制。另外,结合引入off-policy学习的思路,改善了控制策略。结果表明,该算法在估计误差保持较小的情况下可以产生稳定策略。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	林迎霞
	亓庆源

关键词 ：强化学习, 平均场系统, 未知系统信息

Abstract：In this paper, the infinite horizon linear quadratic (LQ) optimal control problem for mean-field system with unknown system information is solved by using a completely model-free reinforcement learning (RL) approach. Although the introduction of the mean-field terms in system dynamics and the cost function will destroy the adaptiveness of the control law, the optimal stabilization control is successfully obtained based on the proposed RL algorithm and the Least Squares Temporal Difference estimation. In addition, combined with the idea of introducing off-policy learning, the control policy is further improved. We also prove that the algorithm produces stable policies given that the estimation errors remain small.

Key words： reinforcement learning mean-field system unknown system information

收稿日期: 2023-05-15 出版日期: 2025-10-09

ZTFLH:	O232
	TP273

基金资助:国家自然科学基金(61903210);山东省自然科学基金(ZR2019BF002);中国博士后科学基金(2019M652324,2021T140354);青岛市博士后应用研究项目,山东省自然科学基金重大基础研究项目(ZR2021ZD14)

通讯作者: 亓庆源(1990-),男,山东青岛人,博士,副教授,主要研究方向为随机控制与最优估计。

作者简介: 林迎霞(1998-),女,山东烟台人,硕士,主要研究方向为随机控制与强化学习。

引用本文:

林迎霞, 亓庆源. 含未知系统信息的平均场系统强化学习研究[J]. 复杂系统与复杂性科学, 2025, 22(3): 153-160.
LIN Yingxia, QI Qingyuan. Reinforcement Learning for Mean-field System with Unknown System Information[J]. Complex Systems and Complexity Science, 2025, 22(3): 153-160.

链接本文:

https://fzkx.qdu.edu.cn/CN/10.13306/j.1672-3813.2025.03.020 或 https://fzkx.qdu.edu.cn/CN/Y2025/V22/I3/153

[1] KAC M. Foundations of kinetic theory[C]. Proceedings of The third Berkeley symposium on mathematical statistics and probability, 1956, 3(600): 171-197.
[2] MCKEAN H. A class of Markov processes associated with nonlinear parabolic equations[J]. Proceedings of the National Academy of Sciences, 1966, 56(6): 1907-1911.
[3] ELLIOTT R, LI X, NI Y H. Discrete time mean-field stochastic linear-quadratic optimal control problems[J]. Automatica, 2013, 49(11): 3222-3233.
[4] NI Y H, ELLIOTT R, LI X. Discrete-time mean-field stochastic linear-quadratic optimal control problems, II: infinite horizon case[J]. Automatica, 2015, 57: 65-77.
[5] YONG J M. Linear-quadratic optimal control problems for mean-field stochastic differential equations-time-consistent solutions[J]. Transactions of the American Mathematical Society, 2017, 369(8): 5467-5523.
[6] LI D, NG W L. Optimal dynamic portfolio selection: multiperiod mean-variance formulation[J]. Mathematical Finance, 2000, 10(3): 387-406.
[7] NI Y H, ZHANG J F, LI X. Indefinite mean-field stochastic linear-quadratic optimal control[J]. IEEE Transactions on Automatic Control, 2015, 60(7), 1786-1800.
[8] YONG J M. Linear-quadratic optimal control problems for mean-field stochastic differential equations[J]. SIAM journal on Control and Optimization, 2013, 51(4): 2809-2838.
[9] HUANG J H, LI X, YONG J M. A linear-quadratic optimal control problem for mean-field stochastic differential equations in infinite horizon[J].Mathematical Control & Related Fields, 2015, 5(1):97-139.
[10] QI Q Y, ZHANG H S, WU Z. Stabilization control for linear continuous-time mean-field systems[J]. IEEE Transactions on Automatic Control, 2018, 64(8): 3461-3468.
[11] ZHANG H S, QI Q Y, FU M Y. Optimal stabilization control for discrete-time mean-field stochastic systems[J]. IEEE Transactions on Automatic Control, 2018, 64(3): 1125-1136.
[12] 陆君安, 刘慧, 陈娟. 复杂动态网络的同步[M]. 北京: 高等教育出版社, 2016.
[13] MA X, QI Q Y, LI X, et al. Optimal control and stabilization for linear continuous-time mean-field systems with delay[J]. IET Control Theory & Applications, 2022, 16(3): 283-300.
[14] LIU H, SHANG Z C, REN Z Y, et al. Recovering unknown topology in a two-layer multiplex network: one layer infers the other layer[J]. Science China Technological Sciences, 2022, 65(7): 1493-1505.
[15] LIU H, WANG B J, LU J A, et al. Node-set importance and optimization algorithm of nodes selection in complex networks based on pinning control[J]. Acta Physica Sinica, 2021, 70(5):056401-056401.
[16] SUTTON R S, BARTO A G, WILLIAMS R J. Reinforcement learning is direct adaptive optimal control[J]. IEEE control systems magazine, 1992, 12(2): 19-22.
[17] YAGHMAIE F A, GUSTAFSSON F. Using reinforcement learning for model-free linear quadratic control with process and measurement noises[C].2019 IEEE 58th Conference on Decision and Control (CDC). Nice, France: IEEE, 2019: 6510-6517.
[18] LI N, LI X, PENG J, et al. Stochastic linear quadratic optimal control problem: a reinforcement learning method[J]. IEEE Transactions on Automatic Control, 2022, 67(9): 5009-5016.
[19] BELLMAN R E, DREYFUS S E. Applied Dynamic Programming[M]. New Jersey US: Princeton University Press, 2015.
[20] BERTSEKAS D. Reinforcement Learning and Optimal Control[M]. Athena Scientific, NH, USA: Athena Scientific, 2019.
[21] LAGOUDAKIS M G, PARR R. Least-squares policy iteration[J]. The Journal of Machine Learning Research, 2003, 4(6): 1107-1149.
[22] BRADTKES J, BARTO A G. Linear least-squares algorithms for temporal difference learning[J]. Machine learning, 1996, 22(3): 33-57.
[23] HEWER G A. An iterative technique for the computation of the steady state gains for the discrete optimal regulator[J]. IEEE Transactions on Automatic Control, 1971, 16(4): 382-384.
[24] YU H, BERTSEKAS D P. Convergence results for some temporal difference methods based on least squares[J]. IEEE Transactions on Automatic Control, 2009, 54(7): 1515-1531.

[1]	李卓群, 王舒仪, 蔡子诚. 深度强化学习库存决策结果的动态演化分析[J]. 复杂系统与复杂性科学, 2025, 22(3): 25-33.
[2]	马明扬, 杨洪勇, 刘飞. 基于强化学习的双人博弈差分隐私保护研究[J]. 复杂系统与复杂性科学, 2024, 21(4): 107-114.
[3]	李雪岩, 张同宇, 祝歆. 基于深度强化学习的通勤走廊韧性恢复双层规划[J]. 复杂系统与复杂性科学, 2024, 21(1): 92-99.
[4]	韩艺琳, 王丽丽, 杨洪勇, 范之琳. 基于强化学习的多机器人系统的环围编队控制[J]. 复杂系统与复杂性科学, 2023, 20(3): 97-102.
[5]	陈卓然, 韩定定. 一类交通信息物理系统的动态路径引导[J]. 复杂系统与复杂性科学, 2022, 19(1): 81-87.
[6]	徐泽洲, 曲大义, 洪家乐, 宋晓晨. 智能网联汽车自动驾驶行为决策方法研究[J]. 复杂系统与复杂性科学, 2021, 18(3): 88-94.
[7]	郑振华, 刘其朋. 基于视觉特征提取的强化学习自动驾驶系统[J]. 复杂系统与复杂性科学, 2020, 17(4): 30-37.

Viewed

Full text

Abstract

Cited

Shared

Discussed