SCIENCE CHINA Information Sciences, Volume 62, Issue 5: 052204(2019) https://doi.org/10.1007/s11432-018-9602-1

Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

More info
  • ReceivedJul 4, 2018
  • AcceptedSep 5, 2018
  • PublishedApr 2, 2019


In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.


This work was supported by National Natural Science Foundation of China (Grant No. 61203078) and the Key Project of Shenzhen Robotics Research Center NSFC (Grant No. U1613225).


[1] Basar T, Olsder G J. Dynamic Noncooperative Game Theory (Classics in Applied Mathematics). 2nd ed. Philadelphia: SIAM, 1999. Google Scholar

[2] Falugi P, Kountouriotis P A, Vinter R B. Differential Games Controllers That Confine a System to a Safe Region in the State Space, With Applications to Surge Tank Control. IEEE Trans Automat Contr, 2012, 57: 2778-2788 CrossRef Google Scholar

[3] Zha W, Chen J, Peng Z. Construction of Barrier in a Fishing Game With Point Capture.. IEEE Trans Cybern, 2017, 47: 1409-1422 CrossRef PubMed Google Scholar

[4] Lin F H, Liu Q, Zhou X W, et al. Towards green for relay in InterPlaNetary Internet based on differential game model. Sci China Inf Sci, 2014, 57: 042306. Google Scholar

[5] Luo B, Wu H N, Huang T. Off-policy reinforcement learning for $H_\infty$ control design. IEEE Trans Cybern, 2015, 45: 65-76 CrossRef PubMed Google Scholar

[6] Bea R W. Successive Galerkin approximation algorithms for nonlinear optimal and robust control. Int J Control, 1998, 71: 717-743 CrossRef Google Scholar

[7] Abu-Khalaf M, Lewis F L, Jie Huang F L. Neurodynamic Programming and Zero-Sum Games for Constrained Control Systems. IEEE Trans Neural Netw, 2008, 19: 1243-1252 CrossRef Google Scholar

[8] Freiling G, Jank G, Abou-Kandil H. On global existence of solutions to coupled matrix Riccati equations in closed-loop Nash games. IEEE Trans Automat Contr, 1996, 41: 264-269 CrossRef Google Scholar

[9] Li T Y, Gajic Z. Lyapunov iterations for solving coupled algebraic riccati equations of nash differential games and algebraic riccati equations of zero-sum game. In: New Trends in Dynamic Games and Applications. Boston: Birkhäuser, 1995. 333--351. Google Scholar

[10] Possieri C, Sassano M. An algebraic geometry approach for the computation of all linear feedback Nash equilibria in LQ differential games. In: Proceedings of the 54th IEEE Conference on Decision and Control, Osaka, 2015. 5197--5202. Google Scholar

[11] Engwerda J C. LQ Dynamic Optimization and Differential Games. New York: Wiley, 2005. Google Scholar

[12] Mylvaganam T, Sassano M, Astolfi A. Constructive $\epsilon$-Nash Equilibria for Nonzero-Sum Differential Games. IEEE Trans Automat Contr, 2015, 60: 950-965 CrossRef Google Scholar

[13] Sutton R S, Barto A G. Reinforcement Learning: an Introduction. Cambridge: MIT Press, 1998. Google Scholar

[14] Werbos P J. Approximate dynamic programming for real-time control and neural modeling. In: Handbook of Intelligent Control. NEW York: Van Nostrand, 1992. Google Scholar

[15] Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming. Belmont: Athena Scientific, 1996. Google Scholar

[16] Werbos P J. Elements of intelligence. Cybernetica, 1968, 11: 131. Google Scholar

[17] Doya K. Reinforcement Learning in Continuous Time and Space. Neural Computation, 2000, 12: 219-245 CrossRef Google Scholar

[18] Wei Q L, Lewis F L, Sun Q Y, et al. Discrete-time deterministic Q-learning: a novel convergence analysis. IEEE Trans Cyber, 2016, 47: 1--14. Google Scholar

[19] Wang D, Mu C. Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism. Sci China Inf Sci, 2017, 60: 058201 CrossRef Google Scholar

[20] Vrabie D, Pastravanu O, Abu-Khalaf M. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 2009, 45: 477-484 CrossRef Google Scholar

[21] Jiang Y, Jiang Z P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 2012, 48: 2699-2704 CrossRef Google Scholar

[22] Luo B, Wu H N, Huang T. Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica, 2014, 50: 3281-3290 CrossRef Google Scholar

[23] Zhang H, Wei Q, Liu D. An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica, 2011, 47: 207-214 CrossRef Google Scholar

[24] Vrabie D, Lewis F. Adaptive dynamic programming for online solution of a zero-sum differential game. J Control Theor Appl, 2011, 9: 353-360 CrossRef Google Scholar

[25] Zhu Y, Zhao D, Li X. Iterative Adaptive Dynamic Programming for Solving Unknown Nonlinear Zero-Sum Game Based on Online Data. IEEE Trans Neural Netw Learning Syst, 2017, 28: 714-725 CrossRef PubMed Google Scholar

[26] Modares H, Lewis F L, Jiang Z P. H tracking control of completely unknown continuous-time systems via off-policy reinforcement learning.. IEEE Trans Neural Netw Learning Syst, 2015, 26: 2550-2562 CrossRef PubMed Google Scholar

[27] Kiumarsi B, Lewis F L, Jiang Z P. $H_\infty$ control of linear discrete-time systems: off-policy reinforcement learning. Automatica, 2017, 78: 144-152 CrossRef Google Scholar

[28] Vamvoudakis K G, Lewis F L, Hudas G R. Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality. Automatica, 2012, 48: 1598-1611 CrossRef Google Scholar

[29] Huaguang Zhang , Lili Cui , Yanhong Luo . Near-Optimal Control for Nonzero-Sum Differential Games of Continuous-Time Nonlinear Systems Using Single-Network ADP.. IEEE Trans Cybern, 2013, 43: 206-216 CrossRef PubMed Google Scholar

[30] Zhang H, Jiang H, Luo C. Discrete-Time Nonzero-Sum Games for Multiplayer Using Policy-Iteration-Based Adaptive Dynamic Programming Algorithms.. IEEE Trans Cybern, 2017, 47: 3331-3340 CrossRef PubMed Google Scholar

[31] Vamvoudakis K G. Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica, 2015, 61: 274-281 CrossRef Google Scholar

[32] Zhao D, Zhang Q, Wang D. Experience Replay for Optimal Control of Nonzero-Sum Game Systems With Unknown Dynamics.. IEEE Trans Cybern, 2016, 46: 854-865 CrossRef PubMed Google Scholar

[33] Johnson M, Kamalapurkar R, Bhasin S. Approximate N-Player Nonzero-Sum Game Solution for an Uncertain Continuous Nonlinear System.. IEEE Trans Neural Netw Learning Syst, 2015, 26: 1645-1658 CrossRef PubMed Google Scholar

[34] Liu D, Li H, Wang D. Online Synchronous Approximate Optimal Learning Algorithm for Multi-Player Non-Zero-Sum Games With Unknown Dynamics. IEEE Trans Syst Man Cybern Syst, 2014, 44: 1015-1027 CrossRef Google Scholar

[35] Song R, Lewis F L, Wei Q. Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-Time Multiplayer Nonzero-Sum Games.. IEEE Trans Neural Netw Learning Syst, 2017, 28: 704-713 CrossRef PubMed Google Scholar

[36] Vrabie D, Lewis F L. Integral reinforcement learning for online computation of feedback Nash strategies of nonzero-sum differential games. In: Proceedings of the 49th IEEE Conference on Decision and Control, Atlanta, 2010: 3066--3071. Google Scholar

[37] Vamvoudakis K G, Modares H, Kiumarsi B. Game theory-based control system algorithms with real-time reinforcement learning: how to solve multiplayer games online. IEEE Control Syst, 2017, 37: 33-52 CrossRef Google Scholar

[38] Leake R J, Liu R W. Construction of Suboptimal Control Sequences. SIAM J Control, 1967, 5: 54-63 CrossRef Google Scholar

[39] Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556-1569 CrossRef Google Scholar

[40] Watkins C, Dayan P. Q-Learning. Mach Learn, 1992, 8: 279--292. Google Scholar

[41] Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration. In: Proceedings of American Control Conference, Baltimore, 1994. 3475--3479. Google Scholar

[42] Chen C L, Dong D Y, Li H X. Hybrid MDP based integrated hierarchical Q-learning. Sci China Inf Sci, 2011, 54: 2279-2294 CrossRef Google Scholar

[43] Wei Q L, Liu D R. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci China Inf Sci, 2015, 58: 122203. Google Scholar

[44] Palanisamy M, Modares H, Lewis F L. Continuous-time Q-learning for infinite-horizon discounted cost linear quadratic regulator problems.. IEEE Trans Cybern, 2015, 45: 165-176 CrossRef PubMed Google Scholar

[45] Yan P, Wang D, Li H. Error Bound Analysis of $Q$ -Function for Discounted Optimal Control Problems With Policy Iteration. IEEE Trans Syst Man Cybern Syst, 2017, 47: 1207-1216 CrossRef Google Scholar

[46] Luo B, Liu D, Wu H N. Policy Gradient Adaptive Dynamic Programming for Data-Based Optimal Control.. IEEE Trans Cybern, 2017, 47: 3341-3354 CrossRef PubMed Google Scholar

[47] Vamvoudakis K G. Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach. Syst Control Lett, 2017, 100: 14-20 CrossRef Google Scholar

[48] Vamvoudakis K G, Hespanha J P. Cooperative Q-Learning for Rejection of Persistent Adversarial Inputs in Networked Linear Quadratic Systems. IEEE Trans Automat Contr, 2018, 63: 1018-1031 CrossRef Google Scholar

[49] Rizvi S A A, Lin Z. Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control. Automatica, 2018, 95: 213-221 CrossRef Google Scholar

[50] Li J, Chai T, Lewis F L. Off-Policy Q-Learning: Set-Point Design for Optimizing Dual-Rate Rougher Flotation Operational Processes. IEEE Trans Ind Electron, 2018, 65: 4092-4102 CrossRef Google Scholar

[51] Kleinman D. On an iterative technique for Riccati equation computations. IEEE Trans Automat Contr, 1968, 13: 114-115 CrossRef Google Scholar


    Algorithm 1 Model-based offline PI algorithm

    Step 1: (Initialization) Start with a set of initially stabilizing feedback gains $K_1^1,~\ldots,K_N^1$.

    Step 2: (Policy evaluation) For a given set of stabilizing feedback gains $K_1^l,~\ldots,K_N^l$, solve for the positive definite matrices $P_1^l,~\ldots,P_N^l$ using the following Lyapunov equations:

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有