Loading…

Multi-Agent Reinforcement Learning with Information-sharing Constrained Policy Optimization for Global Cost Environment

Multi-agent Reinforcement Learning (MARL) is a machine learning method that solves problems by using multiple learning agents in a data-driven manner. Because of the advantage of utilizing multiple agents simultaneously, MARL has become an efficient solution to large-scale problems in a wide range o...

Full description

Saved in:

Bibliographic Details
Published in:	IFAC-PapersOnLine 2023-01, Vol.56 (2), p.1558-1565
Main Authors:	Okawa, Yoshihiro, Dan, Hayato, Morita, Natsuki, Ogawa, Masatoshi
Format:	Article
Language:	English
Subjects:	Base station sleep control Constrained MDP Learning algorithm Mobile network Multi-agent reinforcement learning Multi-agent system
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Multi-agent Reinforcement Learning (MARL) is a machine learning method that solves problems by using multiple learning agents in a data-driven manner. Because of the advantage of utilizing multiple agents simultaneously, MARL has become an efficient solution to large-scale problems in a wide range of fields. However, as with general single-agent reinforcement learning, MARL requires trial and error to acquire the appropriate policies for each agent in the learning process. Therefore, how to guarantee performance and constraint satisfaction in MARL is a critical issue for application to real-world problems. In this study, we propose an Information-sharing Constrained Policy Optimization (IsCPO) method for MARL that guarantees constraint satisfaction during learning. In detail, IsCPO sequentially updates the policies of multiple agents in random order while sharing information of the surrogate costs and KL-divergence for evaluating the current and updated policies to the next agent. In addition, if there are no candidates of policies to be updated in accordance with the shared information, IsCPO skips updating the policies of the rest of the agents until the next iteration. As a result, IsCPO makes it possible to acquire the individual suboptimal policies of agents, satisfying constraints on global costs related to the state of the environment and the actions from multiple agents. We also introduce a practical algorithm for IsCPO that simplifies its implementation by adopting several mathematical approximations. Finally, we show the validity and effectiveness through simulation results on a multiple cart-pole problem and base station sleep control problem in a mobile network.
ISSN:	2405-8963 2405-8963
DOI:	10.1016/j.ifacol.2023.10.1854