International Journal of Advanced Robotic Systems Game Theory Models for Multi-robot Patrolling of Infrastructures Regular Paper Article

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Keywordsȱ MultiȬRobotȱ Patrolling,ȱ Gameȱ Theory,ȱ


Introduction
Domains, where distributed surveillance, inspection or control are required, are candidates for being secured by performing patrolling tasks, usually by walking throughout at regular intervals (Abate, 1996), (Almeida et al., 2004). Currently, security system solutions are mostly predictable and inflexible. Additionally, since they are controlled by human operators, their performance can be affected by limitations such as boredom, distraction, or fatigue. Furthermore, in some environments, people must deal with hazardous conditions. As a consequence, it is important to improve the security elements used in these types of systems, which assist human beings in dangerous scenarios such as mine clearing or search and rescue operations. They are then able to perform other type of high level tasks, i.e., monitoring the system from a safe location (Oates et al., 2009). Recently, new research efforts have arisen trying to solve some challenges related to security tasks automation by using mobile robots (Everett, 2003). Thus, mobile robots aim to perform some useful task that a human either cannot, or would prefer not to do. Moreover, the robot should hopefully do it better, cheaper, safer, and more reliably. Security systems that utilize mobile robots in these types of applications have a great deal of advantages, i.e., they do not experience human limitations. However, some tasks are too complex that a single robot cannot achieve good results, especially in the presence of uncertainties, incomplete information, distributed control, and so forth. To overcome these challenges, Multi-Robot Systems can be used. They are characterized as a set of homogeneous or heterogeneous robots operating in the same environment using cooperative behaviors (Farinelli et al., 2004).
In this paper, new collaborative multi-robot approaches for infrastructure security applications at critical facilities are explored. The work is focused on area patrol, i.e., the activity of going throughout an area. Thus, given a set of robots and a set of points of interest, the patrolling problem consists of constantly visiting these points at irregular time intervals for security purposes. This problem has been formulated using concepts of Graph Theory to represent an environment where nodes stand for specific location of interest and edges for possible paths. By using this representation, each path has a cost that represents the time required to go from one node to another. The main advantage of this representation is its application. It can be easily used in other domains, i.e., computer networks, distributed coverage, and so forth. Additionally, there is a wide variety of problems that may be reformulated as particular patrolling task such as cleaning or surveillance. Since the patrolling problem seeks to maximize the number of visits to each node in a given environment, a good patrolling strategy must reduce the time lag between two visits to the same location (Chevaleyre, 2004).
The main contributions of this work are summarized as follows: An analysis of the behavior of game theory models in the multi-robot patrolling problem context is presented. A dynamic and distributed solution has been developed in order to solve the aforementioned problem. A novel decision-making rule has been defined. This rule attempts to allow robot dispersion, i.e., at each point of interest, each robot chooses a different available set of actions. A demonstration of how multiple robot interaction arises with the definition of multiple games at each point of interest has been defied. Finally, a detailed study of the behavior of the implemented model parameters has been described.
The rest of this paper is organized as follows. Section 2 briefly describes related work. Section 3 gives definitions of game theory and introduces the problem. Section 4 shows the implemented models in order to solve the patrolling problem. Section 5 presents the evaluation and experimental results. Finally, section 6 summarizes the obtained results.

Related work
The multi-robot patrolling problem has received much attention in recent years, specially in works that develop algorithms to coordinate decision-making among robots, (Portugal and Rocha, 2011). These works have implemented different principles such as reinforcement learning ; negotiations methods (Hwang, 2009); swarm optimization (Glad and Buffet, 2009); cycle and partitioning strategies (Chevaleyre, 2004); and adaptive solutions (Sempé and Drogoul, 2003). A description of all of them can be found on a recent survey by (Portugal and Rocha, 2011). Beyond this survey, the multi-robot patrolling problem was tackled in (Ahmadi and Stone, 2006). In that work, the problem was called Continuous Area Sweeping, which is solved with a partitioning area method. Moreover, in (Aguirre et al., 2011), the multi-robot patrolling is applied to patrol national borders. In that work, elements of game theory as well as Monte Carlo simulation are used to solve the problem via genetic algorithms. Another work that utilizes game theory principles is described in (An et al., 2012). In that work, solutions to solve competitive or zero-sum games for the protection of critical infrastructure via Stackelberg Games are presented.
Among all these works, three of them are directly related with this work. The pioneer work in the multi-robot patrolling problem was carried out by (Machado et al., 2003). In that work, authors defined an evaluation criterion based on idleness. Idleness is the time that a place remains unvisited. Thus, Total Idleness is defined as the average of the idleness of all places of a given environment. Since this criterion is widely used in literature, it was used to measure the performance of the methods proposed in this work. Moreover, the problem of generating a patrol path inside a target area was tackled in (Elmaliach et al., 2007). The algorithm applied to generate this patrol path is called Cycle and it guarantees that each point is covered with the same optimal frequency. The solution presented in that work uses Spanning Tree Coverage method to find a minimal Hamilton path of minimal costs. Once a path is obtained, robots are uniformly distributed along this path and follow the same patrol route over and over. Thus, uniform frequency of the multi-robot patrolling task is achieved as long as one robot continues working properly. Moreover, authors present criteria based on frequency optimization in order to evaluate multi-robot patrolling algorithms. Finally, in (Portugal and Rocha, 2010) is presented an algorithm called MPS. Such an algorithm divides the environment into regions with the same dimension by using a balanced graph partitioning approach. Each of these regions is assigned to a robot that follows a local patrolling route. The procedure to obtain this patrolling route mainly seeks Euler and Hamilton circuits and paths. However, if such circuits and paths do not exist, the procedure seeks longest paths and Non-Hamiltonian cycles. Non-Hamiltonian cycles are selected only when they have at least half of the vertices of a graph; if not, the patrolling route remains the longest path. Since the longest path and the Non-Hamiltonian cycle do not contain all vertices of the graph, the procedure includes such vertices to complete the patrolling route. Then, ultimately inverse path procedure is used to return to the starting vertex of the route when is required.
Previous literature has demonstrated the effectiveness of methods that implement solutions based on cycles and paths (Chevaleyre, 2004). The good performance of these approaches could be explained by their centralized and explicit coordinator scheme, (Almeida et al., 2004). However, a centralized solution has several disadvantages such as lack of scalability in the number of places to protect and susceptibility to single-point failure, due to its unique, and hence vulnerable, control point. In addition, these approaches are deterministic, and therefore not suitable for security purposes due to their predictability.
The present work differs from others on the manner in which the patrolling problem was solved by implementing learning models from Game Theory. The theory of learning in games defines equilibrium as the result of dynamic adjustment processes in which players interact for optimality over time in repeated normal-form games. Thus, they compute their myopic best response based on the accumulated experience achieved by tracking previous plays history of other players. The learning model selected in this work to patrol throughout an environment was proposed by Camerer and is called Experience-Weighted Attraction (Camerer, 1999). Implementing such adaptive models allows developing dynamic and distributed solutions similar to  in contrast to several literature works.

Concepts from game theory
A brief overview of concepts as well as some definitions of game theory (Fudenberg, 1998) are given in order to clarify the description in the following sections. In this work, an abstract representation of the environment as an undirected weighted graph G has been adopted. This graph is an ordered pair consisting of a set   E G of edges and a set   N G of nodes. Each node is a special point of interest that needs to be observed in search of intruders, but it is assumed that such observation is instantaneous. Each edge represents a path by using a number corresponding to the cost proportional to its length.
Thus, given such graph and a set of robots, the patrolling task consists of visiting at each time step as many nodes as possible in order to minimize the time lag between two visits at the same node. Therefore, each node not only is an environment point of interest to be inspected, but also a point where interaction among agents arises, i.e., each robot in graph node   n N G  must select, based on other robots selections, an appropriate action in order to choose the next node to visit. Taking into account this interaction, normalform games at each graph node   n N G  have been defined.

Definition 1 (Normal-Form Game)
Formally, a finite nrobot normal-form game  is made of: profile for the game .  A strategy is the criterion taken into account to determine the action to be selected.
where S is the set of strategy profiles. Thus, at every time step, the robot i M  reaches a node   n N G  and plays its corresponding normal-form game .  As a consequence, the robot chooses its individual strategy i j i i s S  considering the strategies selected by all other robots. The action related to the strategy chosen leads the robot to the next node. Finally, the interaction among robots arises when each robot sends a message indicating the strategy selected. A robot can select an action with probability one or by randomizing over the set of available actions according to some probability distribution. Such strategies are called pure and mixed, respectively.

Definition 2 (Pure Strategy) Given a set of available actions
Pr a is the probability that action i j will be played by robot i M.  Thus, the robots that interact in these types of games choose an action that maximize its expected payoff considering the actions selected by all other robots. This is called best response and it leads to the central solution concept of game theory, the Nash equilibrium. s S  has a numerical value called attraction, which specifies the probability of choosing that strategy. Each attraction has an initial value, which is updated each period through the use of two rules that update two variables. The first variable   The decay rate  depreciates previous attraction   The second rule updates the amount of experience according to where   where  is the response sensitivity. With 0   the choice is stochastic while    is best response.
Beyond these rules, specific values of   0 ,   and  reduce this general model to special cases such as reinforcement and belief-based models.

Reinforcement Model
In the reinforcement model of EWA, every time step that a robot i M  reaches a graph node   n N G ,  it performs three steps. In the first step, the robot selects one of the strategies available at such node. This selection is based on a logistic stochastic response function defined by where  is the response sensitivity. With When this reinforcement value is updated, its related strategy is reinforced. Thus, in the second step, once a strategy is selected, only this strategy is reinforced by previous received payoffs according to As can be seen, this rule is the result of setting in the EWA where reinforcements are averages of previous attractions and incremental reinforcement. The initial reinforcement value of the strategies available at each node is defined by Finally, in the last step, the robot i M  communicates the strategy selected to the other robots, so that they update the reinforcement value of the strategy selected by the robot i M  in the node   n N G .
 Thus, similar to the behavior of attractions in EWA model, in the reinforcement case, each robot shapes the reinforcement of each strategy by utilizing the aforementioned rules. The Algorithm 1 describe the three steps accomplished in this model.

Belief-Based Model
Belief-based models start with the premise that each robot i M  identifies that it is playing a game  with other robots, and it forms beliefs about what these robots will play in the future based on its past observation. Then, it attempts to define dynamic processes that lead to a Nash Equilibrium by choosing a best response strategy that maximizes its expected payoff to its beliefs.
There are different iterative learning rules to form beliefs. One widely used model of learning is the process of Weighted Fictitious Play and its variants, such as Cournot Best-Response Dynamics, which looks back only one play, as opposed to Fictitious Play which looks back the t most recent plays, (Brown, 1951). At each time step in the model of Weighted Fictitious Play, each robot i M  chooses its strategies to maximize its expected payoff given its prediction about the distribution over strategies of other robots at that time step. Therefore, Weighted Fictitious Play is an instance of model-based learning in which a robot maintains beliefs   In the prediction of this learning rule, the initial prior belief that robot i M  assigns to strategies i s  of robots i M   is governed by The initial weight assigned is different for each strategy. This assignation permits that the updates performed by 9 do not lead to weights with the same value, which allows to avoid selection problems.
The belief that robot i M  assigns to the robots i M   playing i s  at time step t is given by The updating rule formulated by 10 can be defined in terms of previous-period beliefs by As in the case of beliefs, expected payoffs can be expressed as a function of previous-period expected payoffs which yields Finally, the best response of the robot i M  in Weighted Fictitious Play is given by

Experiments and result
In order to evaluate and compare this implementation with other methods, a patrolling simulator developed from pioneers works (Machado et al., 2003) has been used.
Thus, the first experiments aim at analyzing the behavior of these models with different values of their parameters, namely, for EWA model ,   and ,  for reinforcement model  and ,  and for belief-based model .  In order to do so, the map shown in figure 1(a) was used. Where unfilled small circles stand for nodes or points of interest, lines stand for edges of a graph or paths that robots use to move throughout the map. Filled big circles stand for robots patrolling such map. In this set of experiments, a group of 20 robots started at node number 22 and patrol until each node had been visited 256 times.  Figure 2 shows the performance of the models evaluated. At each plot, color intensity, coordinates, and box plots represent the total idleness. Thus, figure 2(a) shows the behavior of EWA model using six slice planes at the axis through a volumetric data created with values of 0 , ,  Most notably, these results indicate that regardless of which map is used, in all cases, at least one of the methods presented in this work improves MSP algorithm. Taking into account that both Cycle and MPS algorithms use a centralized and explicit coordinator scheme, this improvement in performance is significant. Finally, in 80% of cases MPS algorithm does not work due to partitioning problems. Portugal and Rocha (2010) describe the reasons of these problems. It is worth noting that the proposed solution does not have these problems.

Conclusions
Several dynamic and distributed collaborative multirobot approaches for security applications at critical facilities have been developed. Thus, a team of robots endowed with patrolling behaviors based on learning models from game theory as well as a thorough study of such models in the context of the patrolling problem has been presented. As shown in section 5, a significant improvement in performance was obtained by using the proposed methods with respect to Cycle and MPS algorithms. Moreover, the distributed characteristics of these models offer solutions with several advantages, such as scalability, modularity, and incremental expandability. Furthermore, the behavior of the robots patrolling that are using the techniques of this work is non-deterministic, which is suitable for security applications due to the fact that intelligent intruders can learn patrolling paths, and based on this information, perform attacks to the protected system. The evaluation to support this claim is not part of the scope of this work. However, results in (Sak and Wainer, 2008) demonstrate that system protection based on not static solutions is less susceptible to be attacked.
Despite the good performance achieved with the models implemented, there are significant remaining questions for future research. Firstly, interference among robots arises when more than one robot utilizes the same edge. In order to avoid this interference, it is necessary to evaluate if the edge selected is used or not by other robots. Future research consisting of studying the behavior of these methods including such aspect is necessary. Secondly, the metric of the patrolling simulator to evaluate the performance of the algorithms only includes the idleness of each node, however it does not take into account if one edge connected to such node (in the case that it has more than one) is used or not. Thus, an interesting future research consisting of evaluating the behavior of the algorithms including such a restriction because it allows to have a more secure system. Finally, even though the expected payoff matrix defined has achieved suitable results, new definition matrices should be explored.