Consensus, cooperative learning, and flocking for multiagent predator avoidance

Multiagent coordination is highly desirable with many uses in a variety of tasks. In nature, the phenomenon of coordinated flocking is highly common with applications related to defending or escaping from predators. In this article, a hybrid multiagent system that integrates consensus, cooperative learning, and flocking control to determine the direction of attacking predators and learns to flock away from them in a coordinated manner is proposed. This system is entirely distributed requiring only communication between neighboring agents. The fusion of consensus and collaborative reinforcement learning allows agents to cooperatively learn in a variety of multiagent coordination tasks, but this article focuses on flocking away from attacking predators. The results of the flocking show that the agents are able to effectively flock to a target without collision with each other or obstacles. Multiple reinforcement learning methods are evaluated for the task with cooperative learning utilizing function approximation for state-space reduction performing the best. The results of the proposed consensus algorithm show that it provides quick and accurate transmission of information between agents in the flock. Simulations are conducted to show and validate the proposed hybrid system in both one and two predator environments, resulting in an efficient cooperative learning behavior. In the future, the system of using consensus to determine the state and reinforcement learning to learn the states can be applied to additional multiagent tasks.


Introduction Motivation
Multiagent cooperative learning has been continuing to be a large research interest in the field of robotics with a wide range of applications. [1][2][3] Tracking wildfires using multiple agents communicating together to better handle the fire, so it does not destroy as much, is one such possibility. 4,5 Another one is that using multiple agents to better map the structure of a pipeline that if it structurally fails could cause large damage and loss of money. 6 An additional possibility is mapping and exploring unknown environments. 7 This article aims to solve the task of multiagent predator avoidance 8,9 with an intelligent hybrid system.
Most current multiagent research incorporates some form of consensus, 10 movement control, 11 or reinforcement learning, 12 but a combination of all three to achieve an efficient cooperative learning behavior is largely unexplored. Some uses of consensus are to determine the location of obstacles 13 or to make measurements in a scalar field. [14][15][16] Movement control of multiagent systems can come in the form of cooperatively doing path planning as in the literature. [17][18][19] Alternatively, there are means of control through flocking in varying formations [20][21][22] to achieve a variety of tasks. 23,24 Reinforcement learning has been implemented cooperatively in a variety of ways for multiagent environments, such as a GridWorld 25,26 and box pushing. 27 Although all of these use for consensus, movement control, and reinforcement learning that are good in their own right, this article aims to make a more intelligent hybrid system.
In nature, flocking has long been observed in many environments [28][29][30] with one possible goal being to defend from predators as can be seen with schools of fish. Even simulated environments 31,32 that reward individuals for their own survival result in flocking like formations of the agents. It is thus clear that flocking for survival has clear benefits in both natural and simulated environments. The goal of this article is then to create a hybrid system that combines consensus, flocking, and multiagent reinforcement learning into one intelligent system that can sense and learn to escape from attacking predators.

Literature review
Flocking background.-Methods of ensuring agents flocking have been proposed and studied in the literature. [33][34][35][36][37] Inspired by the natural world of birds and fish flocking together, 38 flocking algorithms have been formed. The algorithms allow agents to flock in different patterns in a distributed manner that requires on communication between direct neighbors rather than the entire flock. The main rules that define flocking are that flockmates maintain a distance from each other without getting too close or far from each other and match the velocity of the other flockmates. Reinforcement learning approaches have also been applied to flocking in which agents individually or cooperatively learn to flock without the use of specific algorithms. 39,40 There are also cases in which reinforcement learning is implemented to teach agents to go to specific targets, but in cases where learning fails, a flocking algorithm to avoid obstacles is used. 41 Flocking by itself is not particularly useful if the agents flock directly to a predator so this work aims to combine learning with flocking to effectively escape from predators. Other flocking implementations such as that proposed in the literature 42 allow agents to flock together with minimal information transfer, but the formation is not ideal for quickly avoiding a predator. Additional flocking features can be added, such as handling faulty robots, 43 time delay for information transfer between robots, 44 or change of formation. 45 For this article, these are not the focus but could potentially be added in the future.
Reinforcement learning background.-Cooperation is an important part of a flock learning to do a task together. In the literature, 46 agents are not necessarily flocking together, but through cooperation, they are completing their task effectively. Cooperation is already a large part of flocking algorithms so that they can be done in a distributed manner. Cooperation must also be used to effectively learn in a distributed manner. Using simple Q-learning 47 does not achieve the necessary amounts of cooperation required, so a more cooperative approach 9,48 is required. Unfortunately, the number of agents increases the state space that grows and requires longer amounts of training to effectively learn to flock together to the same destination. For this, function approximation techniques 49,50 are useful. However, in current research, function approximation in combination with cooperative learning is largely unexplored. 51,52 Consensus background.-Partial observability of the state space is another issue for purposes of escaping attacking predators. For a school of fish, not all fish will be able to see an attacking predator yet they manage to utilize flocking to maximize their safety anyway. For our agents, this is the same case in that only agents on the outside of a flock will be able to see an oncoming predator. A method of communicating to the other agents that a predator is approaching is thus necessary. This can be seen as a sort of event-triggered consensus such as that proposed in the literature. 53 Algorithms have been proposed to allow multiagent systems to come to a consensus on measurements from multiple agents that do not necessarily agree with each other. 54,55 Many uses of consensus, however, are for estimating some kinds of measurements. 56,57 This work intends to use consensus in combination with reinforcement learning to make a more intelligent system. Some works, such as by Zhang et al. 51 and Xu et al., 52 implement a hybrid system of consensus and reinforcement learning, but they use consensus to determine the global reward of the system. This work uses consensus to determine the state relying on local rewards instead.

Contributions
In this article, we combine the benefits of consensus, flocking, and reinforcement learning to create a hybrid system, as shown in Figure 1. This system assumes partial observability in that only agents on the outside of the flock near an approaching predator are able to see the attacking direction of the predator. They must use consensus to inform the rest of the flock about the attacking direction of the predator, which is then used with reinforcement learning to learn a target (a safe place) to move toward. That target is then used by a flocking algorithm to give each agent a control input to move each agent toward the target in a flocking formation.
The contributions of this article are then as follows: • Utilization of consensus between agents for state approximation in reinforcement learning.

•
Cooperative learning with a large number of agents.
• Implementation of function approximation to reduce state space for a large number of agents used.

•
Integration of reinforcement learning and flocking to learn where to flock.
• A fast and accurate consensus algorithm for a large number of agents.
• An entirely distributed system.

Paper organization
The organization of the remainder of this article is as follows. In "Flocking control" section, a method for flocking is introduced. The "Multiagent learning" section goes over the multiagent learning that used to ensure the agents flock together to the same target. The "Consensus for multiagent state approximation" section details a method of state approximation called consensus for the agents. This is followed by "Simulation and results" section, which details the combination of flocking, reinforcement learning, and consensus into a hybrid system for the agents to learn to flock away from a predator. Lastly, "Conclusion and future work" section covers the conclusion with analysis and potential future development.

Flocking control algorithm
In this section, the flocking algorithm used for the hybrid system is presented. To learn to avoid predators, the agents must be able to flock together. Using flocking methodologies presented in the literature, 33 a network topology consisting of a graph G that is a pair (V, E) with a set of vertices V = {1, 2, . . . , n} and edges E ⊆ (i, j): i, j ∈ V , j ≠ i . In this graph, the agents are considered vertices, and the edges are communication links between neighboring agents. Using agents modeled as particles, the equations of motion are given by where q i is the position of agent i, p i is the velocity, and u i is the acceleration or the control input.
The neighbors of an agent can be determined by N i = j ∈ V : q j − q i < r (2) where ||.|| is the Euclidean norm and r is the interaction range of an agent. There are many formations that flocking can take, but the formation used here is an α-lattice formation in which q j − q i = d ∀j ∈ N i (q) (3) for desired distance d, where d = r/k for a scale factor k.
In flocking, each agent determines its control input with a gradient-based term f i g given by 33 where c 1 is a positive constant, n ij = σ ε q j − q i = q j − q i / 1 + ε q j − q i 2 , and ϕ α ( . ) is a pairwise attractive/repulsive force to maintain the desired distance d between agents. With σ-norm, ⋅ σ is given by x σ = 1/ε 1 + ε x 2 − 1 that is differentiable every-where for ε > 0. An obstacle avoidance term is given by f i o that is the repulsive force of f i g using points on obstacles as virtual neighbors N i β given by where b ij (q) = p ℎ q j − q i σ / r σ is an element at row i and column j of an adjacency matrix over the interval [0, 1) for virtual neighbors j. A velocity consensus term f i d is given by where a ij (q) = p ℎ q j − q i σ / r σ is an adjacency matrix over the interval [0,1), and c 2 is a positive constant. p ℎ ( . ) is a bump function that smoothly varies between 0 and 1. One possible definition is given by 0 otherwise (7) where ℎ ∈ (0, 1). A navigational term, f i γ , that determines the direction the agents should be moving toward, is given by where c 1t and c 2t are positive constants, and q t and p t are the position and velocity of the target, respectively. These equations can be combined to find the control input for each agent u i given by This method allows the agents to flock together in an α-lattice formation toward a target location.

Results of flocking algorithm
The results of this flocking algorithm (u i ) can be seen in Figure 2, where the agents are flocking to the green dot without an obstacle, and in Figure 3 with an obstacle. The agents are initialized randomly over a 120 × 120 area and flock toward the green target. The blue lines represent communication links between agents. It can be seen that the agents maintain their distance from each other without getting too far away from each other and eventually converging to an α-lattice formation. In the case of an obstacle, the agents manage to avoid colliding with it. This flocking algorithm thus provides a viable method of escaping from predators as well as provides a communication structure between agents to use for communications required to cooperatively learn.

Multiagent learning
In this section, an entirely decentralized reinforcement learning method for a network to learn to flock together to specified targets is presented. Independent and cooperative learning methods are presented. In addition to this, a method of cooperative learning with function approximation is evaluated against standard cooperative learning.

Learning model
The model of the learning algorithm is similar to that proposed in the literature. 9 Using a state, action, and reward model for an agent i, let current state, action, and reward be s i , a i , and r i with the next state and next action as s′ i and a′ i , respectively. Reward.-The flocking algorithm used provides flocking in an α-lattice formation. This formation ensures that agents on the inside of the formation have up to six neighbors while agents on the outside have one to five neighbors. To match this formation, the reward is then defined as 6 ⋅ D r otherwise (10) so that the max reward that an agent can get is 6 if it has all six neighbors to encourage flocking.
The reward is then scaled depending on the direction of the predator. The scaling factor is split into five categories consisting of the best target, good targets, average targets, bad targets, and the worst target, which can be visualized in Figure 4. Agents choosing the action corresponding to the best target have their reward equal to the reward defined in equation (10). D r is a scale factor that is determined as follows: Actions corresponding to good targets are scaled down to 75% of the reward above, average targets to 50%, bad targets to 25%, and the worst target to 0%. This is done to encourage the agents to learn the optimal target to go toward while maintaining the importance of flocking together. The addition of more predators multiplicatively scales the reward. For example, if there are two predators and an agent chooses an action that is a good direction for both of them, the reward will be scaled down by 75% twice. This would fail if there are eight predators with one in each direction, but in that case, there is no safe space for the agents to go.

Cooperative learning
To learn to flock to the same target together, a cooperative learning method is implemented. Agents learning independently in this environment will take many learning episodes to converge or never converge at all which can be seen in the literature. 9 However, to cooperatively learn, each agent must first do independent learning 47 for an individual table, Q i , as follows: where α is a learning rate and γ is a discounting factor. This independent learning is not capable of converging in any reasonable amount of time for this application, so cooperative learning must be used. After performing independent learning, the Q-table of each agent is further updated by communicating with its neighbors using the following 9 : where w is a weight, such that 0 ≤ w ≤ 1 to determine how much an agent should trust neighbors versus itself. It can be seen w = 1 would mean the agent only trusts itself, and w = 0 would mean the agent only trusts its neighbors. The weight chosen can either be a static value or in this application, the weight is defined as w = 1 N i + 1 so that each agent equally trusts each other agent. Dividing the sum by |N i | is required so that over the course of the learning, the Q-values do not converge to infinity too quickly. Note that the update from the neighbors is based on the neighbors state s j and the agents own action a i .

Action selection
The action selection of an agent is based on the maximum Q-value approach 48,58 in which the action with the highest Q-value for a given state is the action chosen. This method of choosing the action is highly exploitative with no exploration. To introduce exploration, we use ε-greedy. 47 We use a small probability 0 ≤ ε g ≤ 1 in which to ignore the highest Q-value and instead select an action at random. This can be modeled as follows a i = a max ∈ A i ε g < random (0, …, 1) a random ∈ A i otherwise (13) This random action selection allows an agent to explore a new action that might return a higher reward. The same action selection can be used for function approximation learning replacing the Q-value with the θ parameter vector.

Function approximation
Despite the cooperative Q learning algorithm performing better than independent learning, as seen in the literature, 9 it still can be improved upon to get better results and faster convergence. The direction of the predator dir p is already discretized into eight directions, however, the number of neighbors |N i | grows in size with the number of agents used. Due to the random initialization of agents at the start of each episode, it is possible for each agent to be a neighbor of each other agent. However, as the episode progresses and the α-lattice formation is achieved, this state will have one of seven values for either no neighbor or one to six neighbors. This state size is not particularly large, but the Q values for higher neighbor amounts are ideally found to ensure smooth flocking to the target. A radial basis function (RBF) method of function approximation is used to achieve quicker learning. A fixed sparse representation method was explored, but RBF was found to perform better. The state, action, and reward representations remain the same, but the number of neighbors |N i | is now being approximated using RBF.

Radial basis function.
Function approximation allows to approximate a state space rather than just discretize it.
There are many methods of doing function approximation, but the method that seemed most applicable was a simple RBF approach. The RBF scheme maps the original Q table to a parameter vector θ as 49,50 Q i s i , a i = ∑ ϕ l s i , a i θ i, l = ϕ T s i , a i θ i (14) where the RBF kernel ϕ is a column vector of length l · |{A}|. The output of the l′ th RBF kernel is given as where s is the current state, s l is the center of the RBF kernel l, and μ l is the radius of the RBF kernel l producing the shape of a Gaussian bell. A larger μ l thus produces a flatter RBF.

Function approximation learning.
The cooperative learning algorithm from equations (11) and (12) is still used to learn, however, it is modified to account for the parameter vector θ and RBF kernel ϕ in equation (15) rather than Q-values. 49 The independent part of RBF learning is then given as with the same learning rate α and discount factor γ as before. The cooperative portion is as The key difference here is that each agent must now communicate a θ vector rather than just a single Q which is approximately one-sixth of the original size. In a state space in which the direction of predators is ignored |dir p | = 1, the Q size is 2 × 10 4 while θ is 3.2 ×10 3 . The results of this learning in this state space are shown in Figure 5. It is clear that using cooperative learning with function approximation, the performance is significantly better than without in both space and training time required. Because of this large gap in learning effectiveness, only cooperative learning with function approximation is used for the larger state space, where |dir p |= 8.

Consensus for multiagent state approximation
In this section, a way of sensing and communicating the direction of a predator is presented. Each agent has a predator sensing radius r p that allows them to sense a predator. If a predator is within that radius, then the agent is able to know its relative angle to the predator. These angles can be used to determine the direction the predator is coming from. However, not every agent will be in range of the predator to see the direction that it is coming from and not every agent that is in range will agree on which direction the predator is coming from.
To solve this, a weighted voting method is introduced for agents to share and achieve a consensus on the direction of the predator.

Predator sensing
A method of agents achieving consensus proposed in the literature 56 is used as a start point for the weighted voting procedure. The algorithm is split into two components: a measurement step and a consensus step. For the measurement step, if an agent is in range of the predator, then it performs a measurement of the relative direction of the predator. As mentioned previously, the direction is discretized into eight evenly split directions. These directions are assigned to an information vector info i as  (18) where w p is the angle between the agent and the predator such that 0 ≤ w p ≤ 360. These directions are visualized in Figure 6. These directions do not have to be symmetrical or positioned, as have been positioned here. There can also be more or less directions as desired. However, the directions for this application were chosen as eight evenly divided directions such that they are all 45° in width, 360/8 = 45, and they are aligned to the cardinal directions.
Each agent's information vector is assigned a weight or a belief factor weight i,d . This weight vector is of size number of agents by the number of directions n · |dir p | and is determined from an agent's measurement given by equation (19), where w m is the middle angle of the direction measured, info i + 1 is the next direction counterclockwise, and info i -1 is the next direction clockwise. For example, if info i = 1, then info i − 1 = 8. The scale factor w m − w p + 45 45 splits the distance weight 1 − q p − q i r p into two directions of the weight vector.
This scale factor is between 0 and 1, and is determined by where the measured angle is relative to the center of the direction. For example, the center of direction one w m is 0°, and

Consensus
For consensus, each agent updates its information info i and weight weight i,d based on its neighbors N i . The goal is for all agents to agree on the same info i and for that info i to be as accurate as possible, thus achieving consensus on the direction of the predator. To do this, a weighted voting method is implemented, where the weights for an agent and its neighbors are summed together into the weighted direction vector weight i , such that weight i = weight i + ∑ j = 1 |N i | weight j . The info i is then set to the direction that has a maximum weight info i = max d (weight i,d ). The weight and information are updated for all agents for a set amount of iterations c m in this manner, and then, the weight for each agent is updated to the maximum weight among itself and its neighbors such that weight i = max weight (weight Ni ⋃ weight i ). Sharing the maximum weight after the set amount of iterations allows for all agents to converge to the same predator direction in a quick manner. The measurement and consensus steps can be combined, as seen in Algorithm 2.
Using this algorithm, info i is found for each agent, and given enough iterations, the proposed consensus will converge to the same value for all agents. This value is used to determine the state dir p in the reinforcement learning component.

NASA Author Manuscript
Validation Algorithm 2 is tested in an environment, where 50 agents are flocking to a static position while seeing a predator, denoted by a large red circle, moving in a circle around the flock over 900 iterations, as can be seen in Figure 7. Visually, the agents that are represented by the triangles change color in association with the direction they perceive the predator to be in after consensus. The consensus component was allowed to run for 20 consensus iterations and was found that all agents converged to the same info i within that duration. The number 20 was arbitrarily chosen as a large number to ensure that the agents have enough iterations to achieve a consensus. One run of the average time it took to converge for each of the 900 iterations can be seen in Figure 8. Figure 9 shows the comparison of the state found through consensus to the actual state relative to the center of mass of the flock. It can be seen that it is not always perfectly accurate, but this can be attributed to a lack of full observability and lack of symmetry in the flock. It was always able to fully converge for varying values of c m , which can be seen in Table 1. However, if one does not do the maximum weight sharing by setting c m to an arbitrarily high value, the algorithm will not fully converge even given up to 40 iterations, as can be seen in Figures 10 and 11.
From Table 1, we can see that if the largest measured weight is spread from one side of the flock to the other, c m = 0, it takes 8.2 iterations to completely reach every agent in the flock.
That amount of iterations to achieve a consensus is thus not able to be made smaller due to the network communication limitations and size of the flock. By adding additional weighted voting iterations c m , we can see that it takes additional iterations to converge for each weighted voting iteration added. It is also clear that adding more weighted voting iterations does not necessarily increase the accuracy of the consensus as can be seen by comparing c m = 2 and c m = 3. By letting c m = 10, it can be seen that there is about an 1:5% increase in performance compared to c m = 2 but at a much higher computational cost. The error in accuracy can be attributed to the formation of the flock, as can be seen in Figure 12. Despite the predator being east of the flock as a whole, the agents will converge to northeast due to only one agent being in range of the predator to sense it. Using this data going forward to the hybrid system, we let c m = 2 and let the number of consensus iterations be 12 to allow some margin of error to account for a poor flocking structure.

Algorithm 2.
Consensus on direction of predator.
Through testing alternate consensus methods, one was able to achieve a higher accuracy. This method involved using consensus to determine the center of mass of the flock and absolute position of the predator. When each agent has that center of mass position and the position of the predator, it can then determine for itself the direction of the predator, but this approach is not used for a few reasons. First and foremost, the time it takes to reach a consensus is at least two to three times the number of iterations that the proposed approach uses, thus making it take much longer to both learn and later use practically for predator detection. Secondly, it requires that each agent has an absolute coordinate for itself and the predator rather than relative directions for those agents in range of the predator. In many environments, this may not be known or if it may have a large amount of noise associated with it, so this approach was not used.

Simulation and results
In this section, we go over implementation details and results found for the hybrid learning system. We use consensus to determine the direction of the predator dir p . This state is then used in the multiagent learning for the θ table. The multiagent learning then produces an action, which is a target to flock toward that is used by the flocking algorithm for the agents. Finally, the flocking algorithm produces a control input for each agent to flock to its chosen target. Using this system, we can teach agents to detect and flock away from predators. The simulation was developed using MATLAB. Each learning episode consists of a certain time duration that is long enough for agents to flock to the determined target. Within each episode, the number of iterations that it is run for is determined by the time step. A smaller time step gives longer episodes but faster learning due to more communications occurring over the episode. Smaller time steps also provide for smoother flocking. Each agent is represented by plotting its position as a triangle at each time step in a two-dimensional MATLAB plot.

Simulation environment
The learning environment is set up in a manner, as shown in Figure 13, where the triangles represent the agents, and the large red circle represents the predator. The eight smaller green circles around the edge represent the eight static targets for the eight actions. Each episode begins by randomly initializing 50 agents in a 120 × 120 area and the predator in one of eight directions. The predator then moves toward the center of mass of the agents. The predator is placed far enough away that the agents will be fully connected to each other but not necessarily in perfect flocking formation by the time the predator gets in range. This is done to ensure that each agent is able to get the direction of the predator through consensus so that an agent does not get left behind due to being initialized too far away from the rest of the agents.

Learning configuration.
For the single predator environment, the direction of the predator is initially east, then northeast, and so on, in a counter-clockwise rotation so that ideally every possible state dir p is encountered once every eight episodes. The learning is conducted over 56 episodes and the results can be seen below. For the two-predator environment, the learning is conducted over 216 episodes with the reason explained below. For the ε-greedy action selection in equation (13), an ε value of 0.1 is chosen to allow the agents to explore other actions more quickly while not hindering the smoothness of flocking too extremely.

Results
Single direction state.-We first look at a scenario in which all agents are in the same direction state with the use of consensus. The average results of 10 runs over eight episodes can be seen in Figures 14 and 15. We can see that the agents are able to fully converge for a single direction state in four episodes. Thus, the theoretical number of episodes required to learn is 8 directions times 4 episodes required or 32. However, due to the nature of flocks not being perfect, it is possible for each direction to not be seen four times within those 32 episodes, so 56 episodes are used to learn for the single predator and 216 for the two-predator environment to account for slower learning due to consensus.
Single predator.-Six runs were conducted and averaged for the single predator environment. In Figure 16, it can be seen that by episode 32, or the predator coming from each direction four times, the agents have mostly converged to the same target, but there are a few cases in which it is not fully converged until approximately episode 48 or each direction occurring six times, which is expected due to potential consensus inaccuracies.
The position of agents during the first learning episode can be seen in Figure 17. Some agents are in different states and the agents in the same state have not learned to go to the same target yet. This produces the messy flocking shape that can be seen. In Figure 18, the agents are all in the same state and have learned to go to the same target in an α-lattice formation.
In Figure 19, we have the trajectory the agents take in the final learning episode, where the pink triangles are the random initialization of the agents. By the last episode, it can be seen that all agents have converged to the same target flocking in a relatively smooth manner despite the random action selection of ε-greedy.
Two predators.-Two predators have also been tested to perform well with an expanded information vector to account for the extra predator and thus a larger state space as well.
The addition of a second predator increases the number of predator starting positions by a factor of eight from 8 positions to 64. In addition to longer computation times for handling a second predator, there is now eight times the amount of episodes that must be performed for all direction combinations to be encountered. Unfortunately, this cannot be reduced, without reducing the problem size, by applying a function approximation approach to the learning process. However, there is a way to lower the amount of direction combinations. Instead of learning the direction of the two predators separately, we treat both predators as the same predator. This way if the first predator is detected in direction 1 and the second predator is detected in direction 2, it is the same as if the first predator is in direction 2 and the second predator is in direction 1. Thus, the system learns two state combinations simultaneously. However, there are eight direction combinations, where both predators are detected in the same direction, which is not reducible. This reduces the number of direction combinations from 64 down to 36, which reduces the number of episodes and thus the time it takes to learn from 8 times that of a single predator to 4.5 times. For this reason, the two-predator learning is done over 216 episodes, and the results of which can be seen in one run in Figure  20. It can be seen that by 144 episodes, or all combinations of directions being encountered four times the agents have almost converged. By 180 episodes or each direction being seen five times, the agents have completely converged to the same action for each state.
The movement of the agents flocking away from the two predators before they have learned and after they have learned to flock away from the predators can be seen in Figures 21 and  22, respectively, which looks similar to that of the single predator.

Discussion
In Figure 21, it can be seen that the agents split into two subgroups. This was a fairly common occurrence for unlearned episodes caused by the communication between neighbors. In this case, a lot of the agents may have found the west direction to be the best while others may have found the northwest direction to be best. Eventually, their close neighbors would agree with them and move with them to those targets but not enough to cause the other majority to agree. In this case, the neighborhoods are pretty split between the top half and the bottom half visually with agents on the border picking one or the other depending on the ε-greedy learning algorithm. The best direction is then learned in later episodes when the agents are randomly positioned so that the neighborhoods become mixed together.
Another aspect to consider is what would happen if agents were surrounding a predator. In particular, if agents were equally distributed around the predator, then the predator is not in any particular cardinal direction relative to the flock. Currently, there is no check for that occurring but that would be easily adaptable to stop the agents from moving until the predator moves into a position, where there is a cardinal direction. Alternatively, currently, the agents will move in either the first or last direction depending on implementation at which point the predator will likely obtain a direction after one time step.

Conclusion and future work Conclusion
This article presented a hybrid system that achieves an efficient cooperative learning behavior. The system is applied to the task of escaping attacking predators while maintaining a flocking formation.
Flocking is used to move the agents in an α-lattice formation to a target location. Then, multiple reinforcement learning methods were presented utilizing the flocking targets as actions in the learning algorithm. Independent learning with and without function approximation proved to be unreliable in learning to flock together to the same target. Cooperative learning without function approximation was shown to be better than both independent learning methods but still took a considerable amount of episodes to learn. Cooperative learning with function approximation was shown to perform the best, requiring very few episodes for the agents to converge to the same target.
Finally, a method of detecting predators is used to determine the direction of attacking predators. The method proposed was shown to be very accurate, requiring only a few more iterations than it takes to transmit information from one edge of the flock to the other. The direction of the detected predators is used for determining the state of the system for the reinforcement learning component.
The hybrid system was then developed consisting of flocking control, function approximated cooperative learning, and consensus to allow agents to learn the location of a predator and where to flock away from it. The system was tested in one and two predator environments with results showing the success of the system as an efficient cooperative learning method.

Future work
Although the hybrid-system proposed here was used to solve the task of avoiding predators by agents flocking together to targets, it is generic enough to be used in a variety of multiagent tasks. Applying this hybrid system of using consensus to determine the states and reinforcement learning to learn the states is something that can be looked further into in the future to achieve efficient learning for a variety of tasks. A possible improvement to this work is to create a more parallelized framework for the simulation to allow for faster learning particularly for increased amounts of predators that grow the state space and learning configuration. Further testing can be done with an expanded state size of more predators and a three-dimensional simulation environment. In addition to this, implementing smarter predators and different types of agents are also tasks that could be looked into in the future.    Visualization of the direction classifications. The large red circle represents the predator while the triangle represents the agent. The blue circle represents the best target, green circles are good targets, yellow circles are average targets, orange circles are bad targets, and the small red circle is the worst target in this scenario.  Comparison of convergence between RBF and Q learning with and without cooperative learning. All four algorithms were run 10 times for 20 episodes, and the results were averaged. Over the course of these runs, cooperative RBF was able to converge within four episodes while cooperative Q learning was unsuccessful in fully converging within 20 episodes. Both independent algorithms were not able to learn. RBF: radial basis function.  Direction and associated information values for an agent observing a predator. Each color corresponds to an information value 1 through 8 with 1 representing east, 3 representing north, and so forth.    The average info i for the agents in each iteration. Some states occur for a longer duration due to the shape of the flock.    The cause of error of the consensus algorithm. Despite the predator being east of the flock as a whole the only agent in range senses it as northeast causing the error seen in Table 1.  Initialization of an episode with agents randomly distributed.  The number of agents choosing the same action for each episode for a single direction state by episode. The agents are able to fully converge by four episodes.  The number of agents choosing the same action for each episode for a single direction state by iteration with the dashed lines representing the start of an episode. It can be seen that most learning is done in the beginning of an episode while the flock is connected.  The number of agents choosing the same action for each episode for a single predator environment averaged over six runs. It can be seen that the agents converge by each direction being encountered six times.  (a-d) Fifty agents flocking away from one predator before learning the same target.   Trajectory the agents take in the last episode where the triangles are the initial position of the agents. The flocking can be seen to be smooth to the target after agents get into flocking formation despite the ε-greedy random action selection.  Number of agents choosing the same action for two predators over 216 episodes. It can be seen that after 180 episodes, all agents choose the same action for each state.  (a-d) Fifty agents flocking away from two predators before learning the same target.  (a-d) Fifty agents flocking away from two predators after learning the same target.