SCIENCE CHINA Information Sciences, Volume 62, Issue 12: 222201(2019) https://doi.org/10.1007/s11432-018-9865-9

Online adaptive Q-learning method for fully cooperative linear quadratic dynamic games

More info
  • ReceivedDec 19, 2018
  • AcceptedMar 29, 2019
  • PublishedNov 12, 2019


A model-based offline policy iteration (PI) algorithm and a model-free online Q-learning algorithm are proposed for solving fully cooperative linear quadratic dynamic games. The PI-based adaptive Q-learning methodcan learn the feedback Nash equilibrium online using the state samples generated by behavior policies, without sending inquiries to the system model. Unlike the existing Q-learning methods, this novel Q-learning algorithm executes both policy evaluation and policy improvement in an adaptive manner. We prove the convergence of the offline PI algorithm by proving its equivalence to Newton's method while solving the game algebraic Riccati equation (GARE). Furthermore, we prove that the proposed Q-learning method will converge to the Nash equilibrium under a small learning rate if the method satisfies certain persistence of excitation conditions, which can be easily met by suitable behavior policies. Our simulation results demonstrate the good performance of the proposed online adaptive Q-learning algorithm.


[1] Basar T, Olsder G J. Dynamic Noncooperative Game Theory (Classics in Applied Mathematics). 2nd ed. Philadelphia: SIAM, 1999. Google Scholar

[2] Falugi P, Kountouriotis P A, Vinter R B. Differential Games Controllers That Confine a System to a Safe Region in the State Space, With Applications to Surge Tank Control. IEEE Trans Automat Contr, 2012, 57: 2778-2788 CrossRef Google Scholar

[3] Lin F H, Liu Q, Zhou X W. Towards green for relay in InterPlaNetary Internet based on differential game model. Sci China Inf Sci, 2014, 57-042306 CrossRef Google Scholar

[4] Luo B, Wu H N, Huang T W. Off-policy reinforcement learning for H control design.. IEEE Trans Cybern, 2015, 45: 65-76 CrossRef PubMed Google Scholar

[5] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 1998. Google Scholar

[6] Xia R S, Wu Q X, Chen M. Disturbance observer-based optimal longitudinal trajectory control of near space vehicle. Sci China Inf Sci, 2019, 62: 050212 CrossRef Google Scholar

[7] Wang D, Mu C X. Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism. Sci China Inf Sci, 2017, 60: 058201 CrossRef Google Scholar

[8] Yan X H, Zhu J H, Kuang M C. Missile aerodynamic design using reinforcement learning and transfer learning. Sci China Inf Sci, 2018, 61: 119204 CrossRef Google Scholar

[9] Watkins C, Dayan P. Q-Learning. Mach Learn, 1992, 8: 279--292. Google Scholar

[10] Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration. In: Proceedings of American Control Conference, Baltimore, 1994. 3475--3479. Google Scholar

[11] Chen C L, Dong D Y, Li H X. Hybrid MDP based integrated hierarchical Q-learning. Sci China Inf Sci, 2011, 54: 2279-2294 CrossRef Google Scholar

[12] Wei Q L, Liu D R. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci China Inf Sci, 2015, 58-122203 CrossRef Google Scholar

[13] Wei Q L, Lewis F L, Sun Q Y. Discrete-Time Deterministic $Q$ -Learning: A Novel Convergence Analysis.. IEEE Trans Cybern, 2017, 47: 1224-1237 CrossRef PubMed Google Scholar

[14] Luo B, Liu D R, Huang T W. Model-Free Optimal Tracking Control via Critic-Only Q-Learning.. IEEE Trans Neural Netw Learning Syst, 2016, 27: 2134-2144 CrossRef PubMed Google Scholar

[15] Vamvoudakis K G. Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach. Syst Control Lett, 2017, 100: 14-20 CrossRef Google Scholar

[16] Vrabie D, Lewis F L. Adaptive dynamic programming for online solution of a zero-sum differential game. J Control Theor Appl, 2011, 9: 353-360 CrossRef Google Scholar

[17] Zhu Y H, Zhao D B, Li X G. Iterative Adaptive Dynamic Programming for Solving Unknown Nonlinear Zero-Sum Game Based on Online Data.. IEEE Trans Neural Netw Learning Syst, 2017, 28: 714-725 CrossRef PubMed Google Scholar

[18] Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556-1569 CrossRef Google Scholar

[19] Zhang H G, Cui L L, Luo Y H. Near-Optimal Control for Nonzero-Sum Differential Games of Continuous-Time Nonlinear Systems Using Single-Network ADP.. IEEE Trans Cybern, 2013, 43: 206-216 CrossRef PubMed Google Scholar

[20] Liu D R, Li H L, Wang D. Online Synchronous Approximate Optimal Learning Algorithm for Multi-Player Non-Zero-Sum Games With Unknown Dynamics. IEEE Trans Syst Man Cybern Syst, 2014, 44: 1015-1027 CrossRef Google Scholar

[21] Vamvoudakis K G. Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica, 2015, 61: 274-281 CrossRef Google Scholar

[22] Zhao D B, Zhang Q C, Wang D. Experience Replay for Optimal Control of Nonzero-Sum Game Systems With Unknown Dynamics.. IEEE Trans Cybern, 2016, 46: 854-865 CrossRef PubMed Google Scholar

[23] Song R Z, Lewis F L, Wei Q L. Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-Time Multiplayer Nonzero-Sum Games.. IEEE Trans Neural Netw Learning Syst, 2017, 28: 704-713 CrossRef PubMed Google Scholar

[24] Mehraeen S, Dierks T, Jagannathan S. Zero-sum two-player game theoretic formulation of affine nonlinear discrete-time systems using neural networks.. IEEE Trans Cybern, 2013, 43: 1641-1655 CrossRef PubMed Google Scholar

[25] Zhang H G, Jiang H, Luo C M. Discrete-Time Nonzero-Sum Games for Multiplayer Using Policy-Iteration-Based Adaptive Dynamic Programming Algorithms.. IEEE Trans Cybern, 2017, 47: 3331-3340 CrossRef PubMed Google Scholar

[26] Zhang H G, Jiang H, Luo Y H. Data-Driven Optimal Consensus Control for Discrete-Time Multi-Agent Systems With Unknown Dynamics Using Reinforcement Learning Method. IEEE Trans Ind Electron, 2017, 64: 4091-4100 CrossRef Google Scholar

[27] Kiumarsi B, Lewis F L, Jiang Z P. $H_\infty$ control of linear discrete-time systems: Off-policy reinforcement learning. Automatica, 2017, 78: 144-152 CrossRef Google Scholar

[28] Vamvoudakis K G, Modares H, Kiumarsi B. Game theory-based control system algorithms with real-time reinforcement learning: how to solve multiplayer games online. IEEE Control Syst, 2017, 37: 33-52 CrossRef Google Scholar

[29] Al-Tamimi A, Lewis F L, Abu-Khalaf M. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 2007, 43: 473-481 CrossRef Google Scholar

[30] Rizvi S A A, Lin Z L. Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control. Automatica, 2018, 95: 213-221 CrossRef Google Scholar

[31] Li J N, Chai T Y, Lewis F L. Off-Policy Q-Learning: Set-Point Design for Optimizing Dual-Rate Rougher Flotation Operational Processes. IEEE Trans Ind Electron, 2018, 65: 4092-4102 CrossRef Google Scholar

[32] Leake R J, Liu R W. Construction of Suboptimal Control Sequences. SIAM J Control, 1967, 5: 54-63 CrossRef Google Scholar

[33] loannou P, Fidan B. Adaptive Control Tutorial. Philadelphia: SIAM, 2006. Google Scholar

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号