logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 3 : 375-395(2020) https://doi.org/10.1360/SSI-2019-0184

Feasibility of reinforcement learning for UAV-based target searching in a simulated communication denied environment

More info
  • ReceivedAug 27, 2019
  • AcceptedOct 4, 2019
  • PublishedFeb 27, 2020

Abstract

Target searching is crucial in real-world scenarios such as search and rescue in disaster sites and battlefield target reconnaissance. Unmanned aerial vehicles (UAVs) are an ideal technical solution for target searching in large-scale and high-risk areas because they are agile, low cost, and able to collaborate and carry different sensors. In complex scenarios like battlefields, due to the lack of communication infrastructures and the intensive interference, UAVs often operate in communication denied environments. As a result, fast and reliable communication channels between UAVs and ground operators are difficult to establish. Thus, in such conditions, UAVs must be able to complete tasks autonomously and intelligently, without receiving real-time commands from the operators. With the rapid advances in artificial intelligence, reinforcement learning has shown potentiality for solving continuous decision problems. The target searching problem studied in this paper falls into this category and is suitable for adopting reinforcement learning technologies. However, the feasibility of reinforcement learning in UAV-based target searching in communication denied environments is not clear and, thus, requires in-depth investigations. As a pilot study in this direction, this paper models the target searching problem in communication denied and confrontation situations and proposes a simulation environment based on this model. Extensive experiments are conducted to answer the following questions. (1) Can reinforcement learning be applied in target searching by multi-UAVs in communication denied environments? (2) What are the advantages and disadvantages of different reinforcement learning algorithms in solving this problem? (3) How the degree of communication denial influences the performance of these algorithms? The current mainstream reinforcement learning technologies are adopted to perform simulations, whose results are analyzed quantitatively, leading to the following observations. (1) Reinforcement learning can effectively solve target searching problems for multi-UAVs in communication denied environments. (2) Compared with other algorithms, an autonomous decision-making UAV cluster based on a deep Q-network (DQN)exhibits the best problem-solving ability. (3) The algorithm performance changes with the degree of communication denial but remains largely stable when the communication condition varies.


Funded by

2018年度科技创新2030 —“新一代人工智能"重大项目(2018AAA0102302)

南京大学软件新技术与产业化协同创新中心


References

[1] Tomic T, Schmid K, Lutz P. Toward a Fully Autonomous UAV: Research Platform for Indoor and Outdoor Urban Search and Rescue. IEEE Robot Automat Mag, 2012, 19: 46-56 CrossRef Google Scholar

[2] Doherty P, Rudol P. A uav search and rescue scenario with human body detection and geolocalization. In: Proceedings of Australasian Joint Conference on Artificial Intelligence, 2007. 1--13. Google Scholar

[3] Li C C, Zhang G S, Lei T J. Quick image-processing method of UAV without control points data in earthquake disaster area. Trans Nonferrous Met Soc China, 2011, 21: s523-s528 CrossRef Google Scholar

[4] Ryan A, Zennaro M, Howell A, et al. An overview of emerging results in cooperative uav control. In: Proceedings of 2004 43rd IEEE Conference on Decision and Control (CDC), 2004. 602--607. Google Scholar

[5] Chmaj G, Selvaraj H. Distributed processing applications for uav/drones: a survey. In: Proceedings of Progress in Systems Engineering, 2015. 449--454. Google Scholar

[6] Srinivasan S, Latchman H, Shea J, et al. Airborne traffic surveillance systems: video surveillance of highway traffic. In: Proceedings of the ACM 2nd International Workshop on Video Surveillance & Sensor Networks, 2004. 131--135. Google Scholar

[7] O'Young S, Hubbard P. Raven: a maritime surveillance project using small uav. In: Proceedings of 2007 IEEE Conference on Emerging Technologies and Factory Automation (EFTA 2007), 2007. 904--907. Google Scholar

[8] Xu X W, Lai J Z, Lv P, et al. A Literature Review on the Research Status and Progress of Cooperative Navigation Technology for Multiple UAVs. Navigation Positioning and Timing, 2017, 4: 1--9. Google Scholar

[9] Li L, Wang T, Hu Q L, et al. White force network in DARPA CODE program. Aerospace Electron Warfare, 2018, 34: 54--59. Google Scholar

[10] Chen J. Aerodynamic Missile J, 2016, 1: 24--26. Google Scholar

[11] Duan H B, Li P. Autonomous control for unmanned aerial vehicle swarms based on biological collective behaviors. Sci Tech Rev, 2017, 35: 17--25. Google Scholar

[12] Wu Y J. Aerospace Electron Warfare, 2002, 3: 22--25. Google Scholar

[13] Balamurugan G, Valarmathi J, Naidu V P S. Survey on uav navigation in gps denied environments. In: Proceedings of International Conference on Signal Processing, 2017. Google Scholar

[14] Cesare K, Skeele R, Yoo S-H, et al. Multi-uav exploration with limited communication and battery. In: Proceedings of 2015 IEEE International Conference on Robotics and Automation (ICRA), 2015. 2230--2235. Google Scholar

[15] Gupta L, Jain R, Vaszkun G. Survey of Important Issues in UAV Communication Networks. IEEE Commun Surv Tutorials, 2016, 18: 1123-1152 CrossRef Google Scholar

[16] Duan H B, Zhang D F, Fan Y M. From wolf pack intelligence to UAV swarm cooperative decision-making. Sci Sin-Inf, 2019, 49: 112-118 CrossRef Google Scholar

[17] Silver D, Hubert T, Schrittwieser J, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. 2017,. arXiv Google Scholar

[18] Silver D, Schrittwieser J, Simonyan K. Mastering the game of Go without human knowledge.. Nature, 2017, 550: 354-359 CrossRef PubMed Google Scholar

[19] Vinyals O, Ewalds T, Bartunov S, et al. Starcraft ii: A new challenge for reinforcement learning. 2017,. arXiv Google Scholar

[20] Wang Y S, Tong Y X, Long C, et al. Adaptive dynamic bipartite graph matching: a reinforcement learning approach. In: Proceedings of 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019. 1478--1489. Google Scholar

[21] Beard W R. Multiple uav cooperative search under collision avoidance and limited range communication constraints. In: Proceedings of IEEE Conference on Decision & Control, 2003. Google Scholar

[22] Pham H X, La H M, Feil-Seifer D, et al. Cooperative and distributed reinforcement learning of drones for field coverage. 2018,. arXiv Google Scholar

[23] Zhang B C, Mao Z L, Liu W Q. Geometric Reinforcement Learning for Path Planning of UAVs. J Intell Robot Syst, 2015, 77: 391-409 CrossRef Google Scholar

[24] Zeng Y, Xu X. Path design for cellular-connected uav with reinforcement learning. 2019,. arXiv Google Scholar

[25] Ghavamzadeh M, Mahadevan S, Makar R. Hierarchical multi-agent reinforcement learning. Auton Agent Multi-Agent Syst, 2006, 13: 197-229 CrossRef Google Scholar

[26] Sutton R S, Precup D, Singh S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999, 112: 181-211 CrossRef Google Scholar

[27] Kaelbling L P, Littman M L, Moore A W. Reinforcement Learning: A Survey. jair, 1996, 4: 237-285 CrossRef Google Scholar

[28] Sutton R S, Barto A G. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, 1998, 9: 1054--1054. Google Scholar

[29] Watkins C J C H, Dayan P. Q-Learning. Mach Learn, 1992, 8: 279-292 CrossRef Google Scholar

[30] Mnih V, Kavukcuoglu K, Silver D. Human-level control through deep reinforcement learning.. Nature, 2015, 518: 529-533 CrossRef PubMed Google Scholar

[31] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay. 2015,. arXiv Google Scholar

[32] Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn, 2012, 4: 26--31. Google Scholar

[33] Melo F S, Ribeiro M I. Q-Learning with linear function approximation. In: Proceedings of International Conference on Computational Learning Theory, 2007. 308--322. Google Scholar

[34] Rahimi A, Recht B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In: Proceedings of Advances in neural information processing systems, 2009. 1313--1320. Google Scholar

[35] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning. In: Proceedings of International Conference on Machine Learning, 2016. 1928--1937. Google Scholar

[36] Williams R J, Peng J. Function Optimization using Connectionist Reinforcement Learning Algorithms. Connection Sci, 1991, 3: 241-268 CrossRef Google Scholar

[37] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms. 2017,. arXiv Google Scholar

[38] Abbeel P, Schulman J. Deep reinforcement learning through policy optimization. In: Proceedings of Tutorial at Neural Information Processing Systems Conference, 2016. Google Scholar

[39] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. 2018,. arXiv Google Scholar

[40] Grondman I, Busoniu L, Lopes G A D. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Trans Syst Man Cybern C, 2012, 42: 1291-1307 CrossRef Google Scholar

[41] Chen Y, Zhang H, Xu M. The coverage problem in uav network: A survey. In: Proceedings of Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), 2014. 1--5. Google Scholar

[42] Araujo J, Sujit P, Sousa J B. Multiple uav area decomposition and coverage. In: Proceedings of 2013 IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2013. 30--37. Google Scholar

[43] Busoniu L, Babuska R, De Schutter B. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans Syst Man Cybern C, 2008, 38: 156-172 CrossRef Google Scholar

[44] Fasano G, Accardo D, Moccia A, et al. Multisensor based fully autonomous non-cooperative collision avoidance system for uavs. Journal of Aerospace Computing Information & Communication, 2008, 5: 338--360 https://doi.org/10.2514/6.2007-2847. Google Scholar

[45] La H M, Lim R, Sheng W. Multirobot Cooperative Learning for Predator Avoidance. IEEE Trans Contr Syst Technol, 2015, 23: 52-63 CrossRef Google Scholar

[46] Li J Q, Deng G Q, Luo C W. A Hybrid Path Planning Method in Unmanned Air/Ground Vehicle (UAV/UGV) Cooperative Systems. IEEE Trans Veh Technol, 2016, 65: 9585-9596 CrossRef Google Scholar

  • Figure 1

    (Color online) Simulation environment. (a) Overview of battlefield simulation environment; (b) partial enlargement

  • Figure 2

    (Color online) RQ1 result. (a) The mean of target acquired; (b) mission complete rate; (c) the mean of mission complete time

  • Figure 3

    RQ3 result. (a) Mission complete rate; (b) the mean of mission complete time; (c) the mean of target acquired

  • Table 1   Reward settings
    Environmental feedback Reward
    Finding a target 1000
    Destorying a drone on enemy side 125
    A drone being destroyed $-$125
    Moving a step $-$1
  • Table 2   RQ1 experiment settings
    Number Algorithm (red) Algorithm (blue)
    1 Random walk Random walk
    2 DQN Random walk
    3 L-QL Random walk
    4 A3C Random walk
    5 DPPO Random walk
  • Table 3   RQ2 experiment settings
    Number Algorithm (red) Algorithm (blue)
    1 DQN L-QL
    2 DQN DPPO
    3 L-QL DPPO
  • Table 4   Score settings
    Missing result Score
    One target acquired 1
    Both targets acquired 2
    No target acquired 0
    Each drone being destroyed $-0.1$
  • Table 5   Victory rate (VR) (%)
    DQN L-QL DPPO
    DQN 72.2 90.1
    L-QL 14.9 42.7
    DPPO 5.1 28.4
  • Table 6   Mission complete rate (MCR) (%)
    DQN L-QL DPPO
    DQN 49.1 72.0
    L-QL 4.6 20.4
    DPPO 0.7 2.6
  • Table 7   Mission complete time steps ($\overline{\rm~MCT}$)
    DQN L-QL DPPO
    DQN 358.56 297.79
    L-QL 484.61 613.55
    DPPO 905.14 683.58

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备17057255号       京公网安备11010102003388号