Auton Agent Multi-Agent Syst DOI 10.1007/s10458-016-9355-3
Allocating training instances to learning agents for team formation Somchaya Liemhetcharat1,2 · Manuela Veloso3
© The Author(s) 2016
Abstract Agents can learn to improve their coordination with their teammates and increase team performance. There are finite training instances, where each training instance is an opportunity for the learning agents to improve their coordination. In this article, we focus on allocating training instances to learning agent pairs, i.e., pairs that improve coordination with each other, with the goal of team formation. Agents learn at different rates, and hence, the allocation of training instances affects the performance of the team formed. We build upon previous work on the Synergy Graph model, that is learned completely from data and represents agents’ capabilities and compatibility in a multi-agent team. We formally define the learning agents team formation problem, and compare it with the multi-armed bandit problem. We consider learning agent pairs that improve linearly and geometrically, i.e., the marginal improvement decreases by a constant factor. We contribute algorithms that allocate the training instances, and compare against algorithms from the multi-armed bandit problem. In our simulations, we demonstrate that our algorithms perform similarly to the bandit algorithms in the linear case, and outperform them in the geometric case. Further, we apply our model and algorithms to a multi-agent foraging problem, thus demonstrating the efficacy of our algorithms in general multi-agent problems. Keywords Learning agent · Ad hoc agent · Multi-agent · Team formation
1 Introduction Multi-agent teams have been applied to various domains, such as task allocation, foraging, and multi-agent planning. We recently introduced the Synergy Graph model, where team
B
Somchaya Liemhetcharat
[email protected]
1
Institute for Infocomm Research, A*STAR, Singapore, Singapore
2
Present Address: Uber Advanced Technologies Center, Pittsburgh, PA, USA
3
Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA
123
Auton Agent Multi-Agent Syst
performance depends on single-agent capabilities and the pairwise compatibility among the agents [28,33]. The single-agent capabilities and pairwise compatibility are assumed to be fixed and initially unknown, and the Synergy Graph is completely learned from observations, thus removing the need to have a priori knowledge about the agents’ capabilities and performance in a multi-agent team. The learned Synergy Graph is then used to form a near-optimal multi-agent team for a task. In this work, we use the term agents to refer to physical robots, simulated robots, and software agents. What if some agents are learning agents, i.e., they are capable of learning to improve their coordination with their teammates? Learning agents, which we detail in the related work section, improve coordination by modeling their teammates and varying their behaviors to maximize the team performance. We consider learning agent pairs that improve their coordination with each other. By doing so, we encapsulate pairs of learning agents that simultaneously learn, as well as pairs consisting of a learning agent and a regular agent. In this article, we focus on and formally define the learning agents team formation problem, where the goal is to form a multi-agent team, and there is a fixed number of training instances available before the team is formed. Each training instance is an opportunity for a learning agent pair to improve its coordination. A motivating example is from sports, where a coach has limited opportunities to train his/her team before the actual game. We use the term “allocating a training instance” to refer to the selection (with replacement) of a learning agent pair, and giving the pair the opportunity to improve their coordination. For example, in sports, the coach allocates training instances to pairs (e.g., a pair that practices passing the ball upfield in soccer), and the pair improves its coordination in passing and receiving the ball. Across multiple training instances, the same learning agent pair may be selected, i.e., the pairs are selected with replacement for each allocation of a training instance. Using the sports example again, the coach may select a particular pair of players to continue practicing over a few training instances, or the coach may select one pair for one training instance, a different pair for the next, and then choose the first pair again for the third training instance. After all the training instances have been allocated, the multi-agent team is selected for the task (e.g., in sports, after all the training instances are done, right before the sports season begins, the coach selects which members form the team). Hence, the allocation of training instances has a large impact on the performance of the formed team. In particular, a team with low performance before training may outperform all other teams after training, if it is comprised of learning agent pairs that have high learning rates, i.e., large improvements in coordination per training instance. However, the heterogeneous learning rates of learning agent pairs are initially unknown, and have to be learned from observations after each training instance. Thus, solving the learning agents team formation problem requires balancing exploring and exploiting, while modeling the learning rates and keeping the end goal of team formation in mind. We consider scenarios where learning agent pairs improve linearly, and where learning agent pairs improve geometrically, i.e., the marginal improvement decreases by a constant factor 0 < γ < 1 after each training instance. We consider the optimal allocation of training instances if the learning rates are known, and introduce algorithms that allocate training instances while modeling the learning rates, in both the linear and geometric scenarios. There are parallels between the learning agents team formation problem and the multi-armed bandit problem, namely by considering each learning agent pair as an arm and a training instance as pulling an arm. However, a key difference is that our goal is to form the team with optimal performance, while the goal of the multi-armed bandit problem is to maximize the
123
Auton Agent Multi-Agent Syst
cumulative sum of rewards. We compare our algorithms with the upper confidence bound and Thompson sampling algorithms from the multi-armed bandit problem. Thus, the focus of this article is on how to allocate the training instances to the learning agent pairs, so as to balance exploration (better modeling the agents’ learning rates) and exploitation (improving the agents’ coordination), with the goal of forming a multi-agent team with high team performance after all the training instances have been allocated. We build upon previous work to learn an initial Synergy Graph of the agents’ capabilities and compatibility, then use the model and algorithms contributed in this article to select learning agent pairs for the training instances, and to update the Synergy Graph. Finally, we use the previously-contributed Synergy Graph team formation algorithm to form the multi-agent team for the task. Existing research in multi-agent teams typically focus on forming teams of existing agents with known capabilities, or assume the capabilities are static. On the other hand, research in learning agents typically focus on how agents can learn to improve their performance in a team, such as in the presence of previously unknown teammates. Our article blends these two fields together—to form teams comprising learning agents. Our approach is general and applicable to most learning agent algorithms, since we do not control the agents’ actions and policies; instead, we model their learning rate through observations, even when the agents have no prior coordination experience with one another. We balance exploration and exploitation to improve the agents’ coordination, with the goal of forming an effective multiagent team. We conduct experiments in simulation, varying the number of learning agent pairs to choose from and training instances, and whether learning agent pairs improve linearly or geometrically. We demonstrate that the algorithms we contribute perform similarly to the bandit algorithms when improvements in coordination are linear, and outperform them in the geometric scenario. Our algorithms perform close to optimal without having a priori information about the heterogeneous learning rates. Further, we apply the Synergy Graph model and our learning agent algorithms to the multi-agent continuous foraging problem, where the foraging agents’ goal is to maximize the rate of foraging per unit time. We show that we outperform the comparison algorithms, thus demonstrating the efficacy of our model and algorithms in the learning agent team formation problem. The layout of our paper is as follows: Sect. 2 discusses related work and how our work builds upon prior research on learning agents, and Sect. 3 discusses the background on the Synergy Graph model. Section 4 formally defines the learning agents team formation problem and gives an overview of our approach. Sections 5 and 6 consider learning agent pairs that improve linearly and geometrically. Section 7 presents how our model and algorithms are applied to general multi-agent problems. Sections 8 and 9 present our experiments and results with synthetic data and in the multi-agent foraging domain respectively, and Sect. 10 concludes.
2 Related work In this section, we describe the related work in four areas: multi-agent learning and learning agents, ad hoc agents (a subset of learning agents), team formation and coalition formation, and the multi-armed bandit problem and how it has been applied to multi-agent problems. Lastly, we describe how our contributions in this article use prior research in these areas, and how our work differs from them.
123
Auton Agent Multi-Agent Syst
2.1 Multi-agent learning and learning agents Multi-agent learning and learning agents have been studied in various fields, such as game theory, multi-agent planning, and reinforcement learning, e.g., [37,42,44]. There are various challenges in the multi-agent reinforcement learning approach (MARL), e.g., the “complexity of MARL is exponential” due to the number of agents, states and actions that add to the curse of dimensionality [13]. Also, the learning problem is non-stationary as “each agent is faced with a moving-target learning problem: the best policy changes as the other agents’ policies change”. The focus of multi-agent learning and learning agents is typically to improve the team performance, by changing the agents’ actions and policies through learning. Our focus in this article is not on how agents can learn and improve their performance, but how to model the agents’ learning rates from observations, balance exploration (improving our estimates) and exploitation (improving the final team’s performance) to allocate training instances to the agents, in order to form the optimal team. There are many different approaches to multi-agent learning and learning agents, and we give some examples below. Albrecht and Ramamoorthy [3] compared five multi-agent learning (MAL) algorithms using repeated matrix games where the agents adapt their behaviors instead of having fixed behaviors in a series of ad hoc team problems. Albrecht and Ramamoorthy [4] also contributed a game-theoretic model called the stochastic Bayesian game and a best-response learning method based on Harsanyi-Bellman Ad Hoc Coordination and reinforcement learning for ad hoc coordination in a multiagent logistics domain. Blumenthal and Parker [12] used punctuated anytime learning to co-evolve team capture strategies for dissimilar robots, and assumed that the robots can employ different strategies in their behaviors to improve the team’s performance of capturing a robot. Jumadinova et al. [24] considered agents that learn from interactions with other agents and acquire capabilities, so as to determine the team to form based on how much capabilities can be learned and which agent to learn from. Eck and Soh [20] proposed a learning-based solution where agents learn how different types of information gathering actions improve their knowledge about the environment. In our case, we do not make any assumptions about what the agents and teams learn, but instead measure and model the improvement over time. Hence, the learning agents could be any one of the above approaches (or any others in the literature), and we model the agents’ learning rates instead, in order to allocate the training instances and form an effective multi-agent team.
2.2 Ad hoc agents The ad hoc problem was recently introduced, where an autonomous agent learns to collaborate with previously unknown teammates [40]. An ad hoc agent can lead multiple teammates to select the optimal joint action [2], and an ad hoc agent can also model its teammates and change its policy to optimize a team goal [9]. Ad hoc agents are a subset of learning agents, with the key difference being that the ad hoc agent has no prior information about its teammates (that may or may not learn from experience). We describe three approaches of ad hoc agents below. One approach of ad hoc research focuses on task selection, e.g., Melo and Sardinha [36] investigated the problem of ad hoc teamwork, where a learning agent has to identify the cooperative task among a set of possible tasks and coordinate with other unknown agents to complete the task .
123
Auton Agent Multi-Agent Syst
Another approach of ad hoc agents focuses on different types of opponents and teammates. Chakraborty and Stone [16] explored optimal strategies against a memory-bounded learning opponent and also examined the problem of learning a Markovian ad hoc teammate [15,17]. Similarly, Agmon and Stone [1] considered a N-agent team that is led by a single agent to determine their optimal joint utility, where the other agents are memory-bounded. Wu et al. [46] considered ad hoc agent teams that learn from past interaction history and find best-response actions that are consistent with a fixed past history window. A third approach considers how an ad hoc agent can build models of its teammates in order to adapt its behavior [10]. Barrett [6] contributed the Planning and Learning to Adapt Swiftly to Teammates to Improve Cooperation (PLASTIC) algorithms that enable an ad hoc agent to cooperate with different teammates, where prior knowledge of past teammates can be applied to current teammates. Recently, Barrett and Stone [7,8] introduced PLASTICPolicy that learns policies based on the interactions with past teammates and reuses policies to adapt to new teammates. However, while we discuss ad hoc agents in detail, our work is not specific to ad hoc agents and is applicable to general learning agents that learn to improve their coordination with teammates. We do not control how the agents learn or select the task, and instead model the agents’ learning rate, allocate training instances to them, and form the final team for the task.
2.3 Team formation and coalition formation We are interested in modeling learning agents for team formation. Our perspective is different from other learning agents research in that they typically focus on how an agent can learn and adjust its behaviors based on its teammates, while our focus is on modeling the impact of learning agents on the team, and allocating training instances for the learning agents to improve their coordination with teammates, and hence improve team performance. We use the recently-introduced Synergy Graph model [28,29,33] to compute team performance, where team performance is beyond the sum of single-agent capabilities, but also depends on pairwise compatibility. We provide a background of Synergy Graphs and how we apply it to our problem in the next section. Team formation typically focuses on selecting the agents for a team, given information about the agent’s performance. For example, Gordin et al. used genetic algorithms with a shared memory approach to select two agents (one from each population) to determine the most successful team of two [23,38]. More recently, Caire and Bikakis [14] described a methodology for computing possible coalitions for a multi-robot system. To form the best coalition, Bikakis and Caire [11] modeled “agents as contexts in Multi-Context Systems (MCS) and dependence relations among agents as bridge rules” and found the best solutions by determining all possible coalitions using MCS equilibria algorithms and the set of requirements. Zihayat et al. [47] contributed a two-phase solution to determine the team that optimizes three objectives and find the Pareto front. Another aspect of team formation is credit assignment of team performance, in order to model the agent’s performance. Chang et al. [18] used a Kalman filter to distribute the observed reward according to the assumption that the observed reward is a sum of the agent’s direct contribution and the contributions of other teammates. In this article, we use the Synergy Graph team formation algorithm to find the near-optimal team after a Synergy Graph has been learned from data and performs credit assignment of the team performance. The benefit of the Synergy Graph approach is that the agents’ capabilities and compatibility are learned from data, and the Synergy Graph is used to form the nearoptimal team for the task.
123
Auton Agent Multi-Agent Syst
Further, while various team performance models and coalition formation algorithms have been proposed, we use the Synergy Graph model because it has been applied to many domains, such as software agents [33], robot teams [29], and human teams [26].
2.4 Multi-armed bandit problem Multi-armed bandit problems have been applied to ad hoc teamwork, e.g., by consideing two agents as a multi-armed bandit problem where a memory-bounded agent B observes the actions of agent A and its consequences and determine what actions agent A should choose in order to maximize utility [41]. Crandall proposed the Game Abstraction by Experts (GABE) method, that enables repeated stochastic games to be formulated as a multi-armed bandit problem that can be solved using existing algorithms [19]. There are similarities between the learning agents team formation problem and the multiarmed bandit problem, which we detail later in Sect. 4. We consider two algorithms from the multi-armed bandit literature: upper confidence bound (UCB) [5] and Thompson sampling (TS) [43]. Each arm in the multi-armed bandit problem has an unknown probability p that is estimated, and UCB pulls the arm with the highest upper confidence bound on p. TS draws a sample for each arm based on its estimated distribution, and pulls the arm with the highest sample. We compare our algorithms for the learning agents team formation problem against the UCB and TS algorithms.
2.5 How our work fits Our goal is to allocate training instances, i.e., select learning agent pairs and give them the opportunity to improve their coordination, and then select the optimal team for the multiagent task after all training instances have been allocated. We build upon the related research, and focus on the following: – We use the existing Synergy Graph research to construct a Synergy Graph completely from data, to provide an initial model of the agents’ individual capabilities and their compatibility in a team; – We do not control the agents’ policies or learning algorithms; instead, we treat the agents as “black boxes” and seek to learn, model and quantify how they improve their coordination with experience; – We select learning agent pairs and provide a pair with a training instance, i.e., an opportunity for the pair to improve the pairwise coordination, and to improve our estimate of the agents’ learning rate; – Selecting learning agent pairs involves a trade-off between exploring (improving our estimates of the learning rate) and exploiting (increasing the coordination and performance of the multi-agent team at the end), and we describe the similarities and differences to the multi-armed bandit problem later; – After all training instances have been allocated (i.e., no more learning can be done), we form the multi-agent team that is optimal for the task, using the learned model that is improved from data.
3 Background on the Synergy Graph model This section provides background on the Synergy Graph model. While the Synergy Graph model is not strictly required for the contributions of this article, it is used in the experiments to
123
Auton Agent Multi-Agent Syst
C1
Fig. 1 A Weighted Synergy Graph with 5 agents. The distances between agent pairs represent their pairwise compatibility, and agent capabilities Ci are represented as normally-distributed variables
C2
3
a1
a2
2 4
a3 C3
1
3 1
a4 C4
3
a5 C5
generate an initial model of the multi-agent team performance, and the equations underlying the improvement of learning agents over time are based on it. As such, a background is given in this section to provide context for the following sections. The definitions and symbols used in this section will be consistent with the rest of the paper.
3.1 The Synergy Graph model The Synergy Graph model was first introduced to solve team formation problems [28], where given a set of agents A, the goal is to select the optimal team A∗ ⊆ A. The capabilities of every agent ai ∈ A, and performance of teams are initially unknown, but observations o A of some teams A ⊆ A are available. From these observations o A , a Synergy Graph is constructed. We first introduced the unweighted Synergy Graph [28], where every agent ai ∈ A is represented as a vertex in a graph, and unweighted edges connect some pairs of agents, such that the entire graph is connected (not fully connected). Each agent also has an associated capability Ci ∼ N (μi , σi2 ), which represents the agent’s individual capability at the task, where N (μi , σi2 ) is a Normal distribution with mean μi and variance σi2 . The distance between agents represent their task-based compatibility, where agents that work well together have shorter pairwise distances, i.e., agents that are not directly connected can still be selected to work together on the task, where their pairwise distance in the graph is the shortest number of edges between them.
3.2 Extensions to the Synergy Graph model The Synergy Graph model was extended to use weighted edges for a larger representational space [33], and Fig. 1 shows an example of a Weighted Synergy Graph with 5 agents. The Weighted Synergy Graph was shown to effectively model ad hoc team performance in RoboCup Rescue, where each RoboCup Rescue algorithm (that controlled multiple simulated robots each) was treated as a software agent. In addition, the Synergy Graph model has been used for role assignment of robots [29], configuring robot teams [32], forming teams robust to failure [30], and predicting game outcomes in human basketball (NBA) teams [26,27]. Thus, the efficacy of the Synergy Graph model has been demonstrated on software agents, simulated and real robots, and on human teams. In the above-mentioned work, the number of agents was assumed to be initially known, i.e., A is known a priori, and research has been performed to incrementally add new agents into an existing Synergy Graph [31].
3.3 Modeling learning agents with the Synergy Graph model In the work described above, the capabilities Ci of the agents and their pairwise compatibility φi, j has so far been assumed to be static, and the observations of performance assume that
123
Auton Agent Multi-Agent Syst
S(A) is static. In this article, we consider the scenario that A is fixed and known a priori, but S2 (ai , a j ) increases with each training instance. Further, we consider that the compatibility φi, j increases while the agents’ capabilities (Ci and C j ) remain static. In particular, we assume that thereare K training instances, where each training instance is an opportunity for a pair of agents ai , a j to improve their coordination. For example, in human soccer, K training instances could represent K days of training, where a coach can drill and train one pair of soccer players every day, and the same pair could be chosen to be trained for multiple days if desired (i.e., the agent pairs are selected with replacement). The goal is to form a team of n ∗ ≤ |A| agents after the K training instances, e.g., the coach has to form a team of 11 soccer players after the K days of training. When we use the term “allocate” a training instance, we refer to the selection of a pair of agents ai , a j such that the coordination φi, j will improve. Hence, allocating K training instances refers to selecting K pairs of agents (with possible duplicate pairs) and improving their pairwise coordination. Thus, the problem is to decide which pairs of agents to allocate for the K training instances, such that the best team with n ∗ agents can be formed afterwards.
4 Formal problem definition and our approach In this section, we formally define the learning agents team formation problem, give an overview of our approach, and compare the learning agents team formation problem with the multi-armed bandit problem.
4.1 Learning agents team formation problem We are interested in team formation, i.e., selecting a subset of agents, in order to maximize a performance function. We begin with the set of agents and the definition of a team: Definition 1 The set of agents is A = {a1 , . . . , a N }, where each ai ∈ A is an agent. Definition 2 A team is any subset A ⊆ A.
Definition 3 A learning agent pair is a pair of agents ai , a j ∈ A2 , i = j. The set of learning agent pairs is L ⊆ A2 . Learning agent pairs improve their performance when they are allocated training instances, which we define next: Definition 4 A training instance k ∈ {1, . . . , K } is the selection (with replacement) of a learning agent pair ai , a j ∈ L, such that the coordination of ai , a j improves. The performance of a team depends on its composition, and in this work, we modify the Synergy function of the Synergy Graph model [28,33]: Definition 5 The synergy S(k) (A) of a team A after k training instances is: 1 (k) S2 (ai , a j ), such that S(k) (A) = |A| 2 {ai ,a j }∈A (k)
(k)
S2 (ai , a j ) = φi, j · (Ci + C j ) (k)
where φi, j ∈ R+ is the coordination level (or compatibility) between ai and a j , and Ci , C j are Normally-distributed variables representing ai and a j ’s capabilities at the task.
123
Auton Agent Multi-Agent Syst
When agents learn to perform better over time, there are two possible reasons: the agent learns about the task and improves its capability Ci at the task; or the agent learns to coordinate with its teammates better and improves φi, j . We are interested in the latter, where agents learn to coordinate better with their teammates. We use the Synergy function to compute team performance for two main reasons. First, the performance of a team goes beyond the sum of single-agent capabilities, i.e., the capabilities of an agent pair are weighted by the coordination between them. Second, improvements in coordination are modeled with changes in φi, j . We consider only positive φi, j in this work (please see [26,27] on adversarial teams where teams play a zero-sum game). However, “negative coordination” is handled by the Synergy function, where it computes the average of the pairwise Synergy. Hence, a pair with poor coordination will have a low φi, j , which causes the overall team synergy to decrease. Please see our previous work [33] for a detailed example. We focus on pairs of agents that learn to improve their coordination. By doing so, we focus on the improvement of the pair, without doing credit assignment on which agent of the pair (or both) is actually learning. Research on ad hoc agents, e.g., [9], demonstrated that changing the behavior of an agent in response to a teammate improves team performance. hoc agent An ad ai would be represented by considering all agent pairs that include it (e.g., ai , a j , {ai , ak }). Our formulation thus encompasses ad hoc agents, and other learning agents, such as pairs of agents that can only improve performance with each other and no other agents. We use the term learning agents to refer to agents that learn to coordinate better with teammates. The coordination of a learning agent pair increases after they are allocated training instances: Definition 6 The coordination of a learning agent pair ai , a j ∈ L after the kth training (k) instance is φi, j . (k) (k−1) Note that if the training instance k was allocated to ai , a j ∈ L, then φi, j > φi, j , (k)
(k−1)
otherwise φi, j = φi, j . Training instances are allocated to learning agent pairs, and observations are obtained: Definition 7 If the k th training instance is allocated to the learning agent pair ai , a j ∈ L, (k) then an observation oi, j ∼ S2 (ai , a j ) is obtained for ai , a j . Since ai , a j are learning, φi, j increases as a function of the number of training instances (k) (k−1) allocated to ai , a j (and φi, j > φi, j ), and oi, j similarly increases on expectation. There are K > 0 training instances, and the goal is to form the optimal team of given size n ∗ after K instances: Definition 8 The optimal team A∗K is the team of size n ∗ with the highest mean synergy S(K ) after K training instances. Since learning agent pairs improve their coordination given training instances, the performance of a team A ⊆ A at the end of the training instances depends on the number of learning agent pairs in A, and the number of training instances each learning agent pair is allocated out of K .
4.1.1 Representing coordination improvement in Synergy Graphs We previously introduced the Weighted Synergy Graph model [33], where agents are represented as vertices in a weighted connected graph, and their capabilities are represented
123
Auton Agent Multi-Agent Syst
as Normally-distributed variables. The compatibility between agents are represented by the shortest distance in the weighted graph, and the compatibility and agent capablities are used to compute team performance. In this section, we discuss how our learning agents team formation problem is related to Synergy Graphs. However, our contribution in this article is not limited to Synergy Graphs—it applies to general learning agents who learn to coordinate better over time. We assume that a Weighted Synergy Graph Sinitial is given, that represents the team performance of the agents in A prior to the K training instances. In particular, each ai , a j ∈ L corresponds to an edge (ai , a j , wi, j ) in Sinitial . We formally define the Dynamic Weighted Synergy Graph (DyWeSG) model, that represents such learning edges and their improvement rate: Definition 9 The Dynamic Weighted Synergy Graph (DyWeSG) model is a tuple (G, C), where: G = (V, E) is a connected weighted graph; V = A, i.e., the set of vertices corresponds to the set of agents; E = E learn ∪ E regular , i.e., there are two types of edges, learning edges and regular edges; ∀ ai , a j ∈ L, ei, j = (ai , a j , wi, j , li, j ) ∈ E learn is a learning edge between agents ai and a j with initial weight wi, j ∈ R+ and a learning rate li, j ∼ N (μi, j , σi,2 j ); / L, ei, j = (ai , a j , wi, j ) ∈ E regular is an edge between agents ai and a j with – ∀ ai , a j ∈ weight wi, j ∈ R+ ; – C = {C1 , . . . , C N }, where Ci ∼ N (μi , σi2 ) is agent ai ’s capability.
– – – –
The learning rate li, j models how much the coordination between ai and a j improves in each training instance. We assume that the improvement rate is static and constant, and we use a Normal distribution N (μi, j , σi,2 j ) as our estimate of it. Thus, from a Weighted Synergy Graph Sinitial , a DyWeSG S is created where the agent capabilities are identical, and the graph structure is identical except for learning agent pairs. Such pairs are modeled with special edges that have an initial weight corresponding to the edge in Sinitial and is augmented with the learning rate li, j . Figure 2 shows an example DyWeSG with 5 agents, where 4 agent pairs that learn to coordinate better over time are represented with bold edges. An ad hoc agent [40] is represented with all its edges bold, such as a1 in Fig. 2. Other forms of learning agents, such as specific pairs of agents that only improve with respect to each other, are represented only with their edges, such as {a4 , a5 } in Fig. 2. Using the DyWeSG model, we compute the dynamic pairwise synergy of two agents:
Fig. 2 A Dynamic Weighted Synergy Graph with 5 agents. Agent pairs that learn to coordinate better over time are denoted with bold edges. Initial edge weights are shown in black text, and the learning rates are shown in blue (Color figure online)
C1
a1 2 l1,3 4
a3 C3
123
3 l1,2
3 l1,4 1
a4 C4
C2
a2 1 l1,5 3 l4,5
a5 C5
Auton Agent Multi-Agent Syst
Definition 10 The dynamic pairwise synergy between ai and a j where ai , a j ∈ L after k ≤ K training instances is: (k) (0) Sd,2 (ai , a j ) = φi, j + Fi, j (ki, j , li, j ) · (Ci + C j ) (0)
where φi, j is the initial compatibility between ai and a j , ki, j ≤ k is the number of train ing instances ai , a j have had out of the k ≤ K training instances allocated so far, and li, j ∼ N (μi, j , σi,2 j ) is the estimate of the ai , a j ’s learning rate. The dynamic pairwise synergy uses the sum of the agents’ capabilities Ci + C j , and (0) multiplies it by the sum of the initial compatibility φi, j and the increase in compatibility Fi, j (ki, j , li, j ), which on the learning rate li, j and the number of training instances is based ki, j the agent pair ai , a j has received. As such, the DyWeSG assumes that each training instance improves the compatibility of an agent pair monotonically. For any other pair of agents, the pairwise synergy function of the Weighted Synergy Graph (0) model applies, i.e., S2 (ai , a j ) = φi, j · (Ci + C j ). The synergy of a team containing learning agents is: Definition 11 The synergy of a set of agents A ⊆ A after k ≤ K training instances is: (k) 1 Sd,2 (ai , a j ) if ai , a j ∈ L (k) Sd (A) = |A| · S (a , a ) otherwise 2 {ai ,a j }∈A 2 i j
4.2 Overview of approach We use the modified Synergy function to model the performance of multi-agent teams [28,33]. However, our approach is general and applicable to other multi-agent team models: 1. We model the coordination of learning agent pairs as (k) (0) φi, j = φi, j + Fi, j (ki, j , li, j ), where: (0) – φi, j is the initial coordination level of ai , a j ; + + – Fi, j : Z+ 0 × R → R is the coordination gain function (we consider Fi, j being a linear or geometric function); – ki, j ≤ k is the number of training instances allocated to ai , a j after k ≤ K training instances; – li, j is an initially-unknown learning rate of ai , a j ; 2. We iteratively allocate training instances using estimates of li, j ; 3. We use the observations oi, j after each training instance to improve our estimate of li, j ; 4. After the K training instances have been allocated, we use the Synergy Graph team formation algorithm to form a near-optimal team with n ∗ members. We assume that the capability Ci of all agents ai ∈ A, coordination gain function Fi, j , and (0) initial coordination φi, j of all agent pairs ai , a j are known a priori1 . The only unknowns are li, j . 1 More generally, we use our previous Synergy Graph algorithms to learn the capabilities and coordination
of the agents completely from observations.
123
Auton Agent Multi-Agent Syst
4.3 Comparison with multi-armed bandits There are similarities between the learning agents team formation problem and the multiarmed bandit problem, where learning agent pairs are arms in the bandit problem: 1. There are a fixed number of trials (training instances); 2. Each trial improves the estimate of li, j ; 3. There is an optimal allocation of the K trials if all li, j were known. However, there is a key difference between the two problems: the typical goal of the multi-armed bandit problem is to maximize the cumulative sum of rewards, while the goal of the learning agents team formation problem is to maximize the mean performance of a team after the K trials. Pulling an arm in the multi-armed bandit problem always improves the final score on expectation. In the learning agents team formation problem, assigning an agent pair a training instance improves theircoordination, but may not affect the final team’s score. For example, if the agent pair ai , a j received ki, j ≤ K training instances, but the team A that is formed does not contain the pair ai , a j , then the ki, j training instances did not add to the final score. There is recent work where, for example, the goal is to minimize simple regret [21] instead of total regret, but the difference still remains that our goal is to maximize the mean performance of a team after the K trials, and is not directly tied to regret.
5 Learning agents that improve linearly In this section, we consider learning agent pairs that improve their coordination linearly: (k) (0) φi, j = φi, j + Fi, j (ki, j , li, j ), where Fi, j (ki, j , li, j ) = ki, j · li, j . First, we consider the optimal allocation of K training instances assuming all li, j are known. Next, we adapt two algorithms from the bandit literature to the learning agents team formation problem, and contribute an algorithm that approximates the optimal allocation. We consider non-linear (i.e., geometric) improvement in coordination in the next section, and introduce other algorithms that are (k) applicable to the geometric case. We use a Kalman filter [39] to provide estimates φˆ i, j and (k) lˆi, j of φ and li, j respectively, and the Kalman filter is updated with new observations oi, j . i, j
5.1 Computing the optimal allocation Suppose that ∀ ai , a j ∈ L, li, j is known. To compute the optimal allocation, we iterate through every possible team A, and compute the allocation of the K training instances given A. The allocation k A∗ corresponding to the team A∗ with the maximum score is then the optimal allocation. When there are no learning agent pairs in A, the allocation does not matter. Otherwise, the optimal allocation is to pick the best learning agent pair in A and allocate all K training instances to it, as shown below: Lemma1 Let ai , a j ∈ L such that ai ∈ A and a j ∈ A. Let aα , aβ ∈ L such that aα ∈ A and aβ ∈ A. Let pi, j = li, j (μi + μ j ), where li, j is the learning rate of ai , a j , and μi , μ j are the mean capabilities of the agent ai and a j respectively.
123
Auton Agent Multi-Agent Syst
Let pα,β = lα,β (μα + μβ ), where lα,β is the learning rate of aα , aβ , and μα , μβ are the mean capabilities of the agent aα and aβ respectively. WLOG, suppose p ≥ p ∀ a , a ∈ L. i, j α,β α β Then, k A = ( ai , a j , . . . , ai , a j ) is the optimal allocation of training instances. Proof Let k A = ( a1,1 , a1,2 , . . ., a K ,1 , a K,2 ) be some allocation of training instances such
that ∀γ ∈ {1, . . . , K } aγ ,1 , aγ ,2 ∈ L. Let k{i, = 1{a ,a } ( aγ ,1 , aγ ,2 ). j} i j
{aγ ,1 ,aγ ,2 }∈k A Let k{α,β} = {aγ ,1 ,aγ ,2 }∈k 1{aα ,aβ } ( aγ ,1 , aγ ,2 ) ∀ aα , aβ = ai , a j ∈ k A . A
(0)
Let v{i, j} = (φi, j + (ki, j · li, j )) · (μi + μ j ).
(0) = (φα,β + (kα,β · lα,β )) · (μα + μβ ) ∀ aα , aβ = ai , a j ∈ Let v{α,β}
). v k A . argmaxk (mean(S(A))) = argmaxk (vi, j + {aα ,aβ }={ai ,a j }∈k vα,β i, j + A A A
v = p + k · p + k · p , i, j α,β i, j {aα ,aβ }={ai ,a j }∈k A α,β {aα ,aβ }={ai ,a j } α,β
(0) (0) where p = φi, j (μi + μ j ) + {aα ,aβ }={ai ,a j }∈k A φα,β (μα + μβ ). k A = ), ( ai , a j , . . . , ai , a j ) = argmaxk (vi, j + vα,β A since pi, j ≥ pα,β ∀ aα , aβ ∈ L.
5.2 Estimating the learning rate The previous subsection showed that the optimal allocation of training instances is to allocate all of them on learning agent pair, given that the learning rates li, j for all learning a single agent pairs ai , a j ∈ L are known. In this subsection, we consider how li, j is estimated using observations oi, j . Recall that an observation oi, j ∼ S2 (ai , a j ) is obtained after ai , a j is allocated a training instance. (k) (k) (k) (0) Since S2 (ai , a j ) = φi, j · (Ci + C j ), and the coordination φi, j = φi, j + Fi, j (ki, j , li, j ), we use Ci, j = Ci + C j for notational simplicity, and rewrite the equation as follows: (k)
(k)
S2 (ai , a j ) = φi, j · Ci, j (0) = φi, j + ki, j · li, j · Ci, j
(1)
We represent the learning agent pair’s performance as a linear process x k = F xk−1 , where each step in the process corresponds to being allocated a training instance: φi, j – State xk = ; li, j 11 ; – State transition model F = 01
– Observation model H = μi, j 0 ; – Observation noise Rk = σi,2 j · φi,2 j where Ci, j ∼ N (μi, j , σi,2 j ). Hence, we use a Kalman filter [39] to estimate li, j . In particular, we use the current estimate of φi, j (the first term iteration, and initialize the Kalman ofxk ) in the computation of Rk every (0) 00 φi, j . filter with x0 = , and state covariance V0 = 01 0 As each new observation oi, j is received, we run one iteration of the Kalman filter to improve our estimate of li, j .
123
Auton Agent Multi-Agent Syst
5.3 Allocating training instances We consider two algorithms from the multi-armed bandit problem: the Upper Confidence Bound algorithm [5] and the Thompson sampling algorithm [43]. Next, we contribute an algorithm that solves the learning agents team formation problem by approximating the optimal allocation.
5.3.1 Algorithms from the bandit problem The computation of the UCB in the learning agents team formation problem is slightly different from the UCB algorithm in the multi-armed bandit problem, because the coordination of learning agent pairs change as training instances are allocated to them, while the probabilities are constant in the multi-armed bandit problem. Since the optimal solution is to always allocate all training instances to a single learning agent pair, the modified UCB estimates the UCB of a learning agent pair if all remaining training instances were allocated to it. The pair with the highest UCB is then allocated the k th training instance: (k) (K ) ucblinear = argmax{α,β} U C B S2 (aα , aβ ) (2) (k) (K ) (3) ucblinear = argmax{α,β} U C B φˆ α,β (k) (k) ucblinear = argmax{α,β} U C B φˆ α,β + (K − k + 1) · lˆα,β (4) (k) (k) (5) ucblinear = argmax{α,β} U C B φˆ α,β ) + (K − k + 1) · U C B(lˆα,β √ where U C B(X ) = E(X ) + var (X ) [5]. Equation 2 estimates the pairwise synergy of aα , aβ . Equation 3 estimates the final coordination of aα , aβ after all K training instances are allocated (since Cα and Cβ are constant and can be removed from the argmax). Equation 4 assumes that all remaining (k) training instances were allocated to aα , aβ , where φˆ α,β is the estimate of the coordination of learning agent pair aα , aβ , and lˆα,β is the estimate of their learning rate. Equation 5 uses (k) the fact that φˆ α,β and lˆα,β are independent in order to distribute the UCB function. In the modified Thompson sampling (TS) algorithm, the same derivation from Equations 2 to 4 are performed, except by finding the argmax of the highest sample from the distribution [43]: (k) (K ) tslinear = argmax{α,β} sample S2 (aα , aβ ) (6) (k) (k) (7) tslinear = argmax{α,β} sample φˆ α,β + (K − k + 1) · lˆα,β
5.3.2 Learning agents team formation algorithm While the UCB and TS algorithms provide a means of allocating training instances to learning agent pairs, the algorithms do not focus on the goal of forming the optimal team after K training instances, since the goal in the multi-armed bandit problem is to maximize the cumulative sum of rewards. We contribute an algorithm that takes into account the goal of the learning agents team formation problem. Algorithm 1 shows the pseudocode for approximating the optimal solution to the learning agents team formation problem. For each learning agent pair, the algorithm computes the
123
Auton Agent Multi-Agent Syst
maximum coordination of the pair, using the upper confidence bound of the coordination and learning rate estimates, assuming all remaining training instances are allocated to it (k) (line 4), i.e., summing the mean and standard deviation of the current estimates of φi, j and li, j . For all other learning pairs, the mean of the current estimates of the coordination are used (line 5). The team of size n ∗ with such an arrangement is found (lines 6–11), and the corresponding learning agent pair is allocated the training instance (line 13). The Kalman filter is updated using the observation (line 14). The computational complexity of AdHocLinear is O(K |L| nN∗ n ∗ 2 ), where the nN∗ n ∗ 2 term comes from Line 7, as FormTeam performs a brute-force search across all possible teams of size n ∗ . The complexity of forming the team is reduced to O(n ∗ 2 ) by using the Synergy Graph team formation algorithm that approximates the optimal team [28], instead of finding the optimal team, making the overall computational complexity of AdHocLinear O(K |L|n ∗ 2 ). Our previous work (e.g., [33]) showed that the Synergy Graph team formation algorithm found near-optimal teams without performing a brute-force search across all possible teams.
Algorithm 1 Allocate training instances assuming agents improve coordination linearly. AdHocLinear(K ) 1: for k = 1, . . . , K do 2: (Abest ,vbest ) ← (∅, −∞) 3: for all ai , a j ∈ L do (K ) ← (E(φˆ (k) ) + var (φˆ (k) )) + (K − k + 1) · (E(lˆi, j ) + var (lˆi, j )) 4: φ i, j i, j i, j 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
(K ) ← E(φˆ (k) ) ∀ {α, β} = {i, j} φ α,β α,β (K ) for agent compatibility // S is the Synergy Graph that uses φ Ai, j ← FormTeam(A, S) if E( S(Ai, j )) ≥ vbest then pairbest ← ai , a j S(Ai, j ))) (Abest , vbest ) ← (A, E( end if end for ok ← Train(pairbest ) KalmanUpdate(pairbest , ok ) end for return Abest
The key difference between AdHocLinear and UCB is that in the latter, the pair {ai , a j } (K ) (as computed in Line 4 of Algorithm 1) would be allocated the training with the highest φ i, j instance. In our algorithm, the upper confidence bound of a learning agent pair’s coordination is used to estimate the performance of teams, and the training instance is allocated to the corresponding learning agent pair whose team has the highest estimated performance. Hence, the training instance is not always allocated to the learning agent pair with the highest upper confidence bound. Each learning agent pair is separately evaluated such that all the remaining training instances are allocated to it. With any number of training instances remaining, the optimal solution is always to allocate all the training instances to a single pair, assuming the learning rates are known. Thus, our algorithm considers such an approach and computes the best possible team if all training instances are allocated to a single learning agent pair. Hence, although our algorithm performs the computation as if a single learning agent pair receives all the remaining training instances, only one training instance is allocated to that pair, and
123
Auton Agent Multi-Agent Syst
the process repeats; in the next iteration, the computation of the remaining training instances (one less than the previous iteration) may indicate that a different learning agent pair should be allocated that training instance. In this way, our algorithm adapts to the changing estimates of the learning rates of the learning agent pairs.
6 Agents that improve geometrically In the previous section, we considered learning agent pairs whose coordination increased linearly with each training instance. However, a linear improvement may not reflect learning in many situations—typically there is a large amount of improvements in the early stages, but the marginal improvement decreases as more learning is done. In this section, we consider learning agent pairs that have a geometric improvement in coordination. (k) (0) Specifically, we consider φi, j = φi, j + Fi, j (ki, j , li, j ) where the coordination gain
ki, j li, j · γi,k−1 function Fi, j (ki, j , li, j ) = k=1 j . The coordination function is non-linear, and with each training instance, the learning agent pair’s coordination is increased by a marginally smaller amount, i.e., the coordination gain between the k th and (k + 1)th training instance has a factor of 0 < γi, j < 1 difference. Such a formulation brings about a new feature to the problem: if |L| ≥ 2, it is not always optimal to always train the same learning agent pair since a different pair may provide a higher increase in team performance. We are interested to consider how well the algorithms in the previous section will perform with such geometric improvements in coordination. First, in Sect. 6.1, we consider the optimal allocation of training instances assuming all learning rates li, j and decay factors γi, j are known. Subsequently in later subsections, we assume that the learning rates li, j are unknown, but the decay factors γi, j are known. While the learning agents improve geometrically, the Kalman filter can still be used to estimate the learning rates, by assuming that the decay rate γi, j is known, so only li, j is unknown, similar to the linear case. In Sect. 6.2, we discuss how the Kalman filter is used to estimate li, j . Section 6.3 describes how the algorithms of Sect. 5 are modified to fit the new problem. The algorithms of Sect. 5 assume that all remaining training instances are allocated to a single learning agent pair (which was optimal in the linear case), and the algorithm we introduce in Sect. 6.3 uses the same approach. However, the optimal solution in the geometric case does not allocate all remaining instances to a single learning agent pair, and we explore an algorithm in Sect. 6.4 that uses an integer solver to allocate remaining training instances.
6.1 Optimally allocating training instances We first consider the optimal allocation of training instances, given a fixed team A of size n∗.
ki, j k−1 The coordination gain Fi, j (ki, j , li, j ) = k=1 li, j · γi, j , which can be simplified to 1−γ
ki, j
Fi, j (ki, j , li, j ) = li, j · 1−γi,i,j j . The synergy of A is:
(K )
S
123
1
2
{ai ,a j }∈A
(A) = |A|
⎛ ⎝φ (0) Ci, j i, j
k
+ li, j ·
1 − γi,i,j j 1 − γi, j
⎞ Ci, j ⎠
Auton Agent Multi-Agent Syst
Since the only variable is ki, j (the allocation of training instances), the optimal alloki, j
1−γ cation Allocation(A, K ) given A is: argmax {ai ,a j }∈A2 ∩L li, j · 1−γi,i,j j Ci, j such that
{ai ,a j }∈A2 ∩L ki, j = K . With such a formulation, the optimal allocation of training instances given K can be found using a non-linear integer solver. When A is unknown, the optimal allocation is:
Optimalgeometric (K ) = argmax A⊆A s.t. |A|=n ∗ Allocation(A, K )
However, computing the optimal allocation is infeasible given that the non-linear integer solver is run on every possible team, thus having a runtime of O( nN∗ 2|L| ).
6.2 Estimating the geometric improvement In Sect. 5.2, we described how a Kalman filter is used to estimate the learning rate for learning agent pairs that improve linearly. In this subsection, we consider learning agent pairs that improve geometrically, and we present how the Kalman filter is also used to estimate the learning rates of the pairs. (k) (k) (k) (0) Since S2 (ai , a j ) = φi, j · Ci, j (where Ci, j = Ci + C j ), and φi, j = φi, j + Fi, j (ki, j , li, j ), we rewrite the equation as follows:
(k)
(k)
S2 (ai , a j ) = φi, j · Ci, j ⎛ ⎞ ki, j (0) ⎠ = ⎝φi, j + (li, j · γi,k−1 j ) · Ci, j
(8)
k=1
Similar to Sect. 5.2, we represent the learning agent pair’s performance as a linear process instance. xk = F xk−1 , where each step in the process corresponds to an allocated training φi, j , where li, j is We use a similar Kalman filter as Sect. 5.2, except that the state xk = li, j 1 1 . the marginal learning rate, and the state transition matrix F = 0 γi, j Hence, even though the coordination gains are non-linear, the state transition model remains linear and a Kalman filter is sufficient to model it. The key difference in the Kalman filter for geometric gains is that both the coordination φi, j and the marginal learning rate li, j change with each iteration of the Kalman filter, and we assume that the decay rates γi, j are known a priori. In the linear case, li, j is constant, so only φi, j changes.
6.3 Applying the linear algorithms to agents with geometric improvements In this section, we discuss how the UCB and TS algorithms are applied to agents with geometric improvements, and how we can adapt AdHocLinear (Algorithm 1) to the geometric case as well.
123
Auton Agent Multi-Agent Syst
6.3.1 UCB and TS algorithms in the geometric case First, we describe how the Upper Confidence bound algorithm [5] determines the learning agent pair to be allocated the training instance: (k) (K ) ucbgeometric = argmax{α,β} U C B φˆ α,β (9) K −k+1 (k) (k) ucb = argmax{α,β} U C B φˆ + lˆ α,β · γ k −1 (10) α,β
geometric
k =1
(k) ucbgeometric
= argmax{α,β}
α,β
K −k+1 (k) + U C B(lˆ α,β ) · γ k −1 U C B φˆ α,β
k =1
α,β
(11)
√ where U C B(X ) = E(X ) + var (X ) [5]. Equation 9 comes from the same derivation as Eqs. 2 to 3 in the linear case. Equation 10 (k) uses the Kalman estimate of the coordination φˆ α,β and the marginal learning rate lˆ α,β , i.e., the learning rate after the kth training instance. Equation 11 distributes the UCB function, similar to Equation 5 in the linear case. Similarly, the Thompson Sampling algorithm is updated for the geometric case: K −k+1 −1 (k) (k) k ts = argmax{α,β} sample φˆ + lˆ α,β · γ (12) geometric
α,β
k =1
α,β
where Eq. 12 is similar to Eq. 10, except that the sample function is used instead of UCB.
6.3.2 Adapting AdHocLinear to the geometric case The algorithm we contributed in the previous section, AdHocLinear (Algorithm 1), was designed for learning agent pairs that improved their coordination linearly. However, the algorithm only needs to be modified slightly to apply to learning agent pairs that improve their coordination geometrically, and we name this modified algorithm AdHocGeometric. The only differences between AdHocGeometric and AdHocLinear are on line 4 and line 14 of Algorithm 1. Line 4 of Algorithm 1 computes the estimated upper confidence bound of the learning agent pair’s coordination assuming all remaining training instances are allocated to the pair. For geometric coordination improvements, the upper confidence bound of the coordination should be: K −α+1 (K ) ← (E(φˆ (k) ) + var (φˆ (k) )) + (E(lˆi, j ) + var (lˆi, j )) · φ γi,k j−1 i, j i, j i, j k =1
based on Eq. 11. Hence, by changing line 4 of Algorithm 1, AdHocGeometric is suited for learning agents that improve geometrically, with a similar computational complexity as Algorithm 1. The update to Line 14 of AdHocGeometric uses the Kalman update from Sect. 6.2 to update the estimates of the learning rate li, j in the geometric case. These changes allow the algorithm to compute the final estimated coordination of the learning agent pairs when the improvements are geometric, while preserving the nature of the algorithm. We evaluate the performance of AdHocGeometric later in the experiments sections.
123
Auton Agent Multi-Agent Syst
6.4 Using an integer solver to allocate training instances The changes to the algorithms described above provide an allocation of training instances in the geometric case, but in their computations, they assume that a learning agent pair is allocated the remainder of the training instances. In Sect. 6.1, we described the optimal allocation of training instances in the geometric case, that requires a priori knowledge of the learning rates li, j . Algorithm 2 uses the approach of the optimal solution to allocate the training instances, but uses the Kalman Filter to estimate li, j . Algorithm 2 Use a solver to find the allocation of remaining training instances, assuming agents improve coordination geometrically. AdHocGeometricSolver(K ) 1: for k = 1, . . . , K do 2: (Abest , vbest ) ← (∅, −∞) 3: for all A ⊆ A s.t. |A| = n ∗ do 4: (v A , kk , . . . , k K ) ← Allocation’(A, K − k + 1) 5: if v A ≥ vbest then 6: pairbest ← kk 7: (Abest , vbest ) ← (A, v A ) 8: end if 9: end for 10: ok ← Train(pairbest ) 11: KalmanUpdate(pairbest , ok ) 12: end for 13: return Abest
The function Allocation’ (line 4 of Algorithm 2) uses a non-linear integer solver to allocate the remaining training instances, and is similar to the Allocation function of the optimal solution, except that the upper confidence bound of the learning rates are used, i.e., E(lˆi, j ) + var (lˆi, j ), since li, j is unknown and being estimated by the Kalman filter
(Sect. 6.2). Allocation’ returns the best allocation returned by the integer solver (kk , . . . , k K ), and the value of the final team (v A ) if the allocation is performed. The best allocation is sorted so that the learning agent pair with the highest contribution gain from the allocation is returned. However, since the non-linear integer solver is run for all possible teams A, and for K iterations, the algorithm AdHocGeometricSolver is infeasible when the number of possible teams is large, having a runtime of O(K nN∗ 2|L| ). We present AdHocGeometricSolver as a baseline to consider if computation was not an issue, while still preserving the nature of the learning agents team formation problem, i.e., that the learning rates of the learning agent pairs are initially unknown. We later show in our experiments that our AdHocGeometric algorithm has similar performance with a much smaller runtime.
7 Application to general multi-agent problems In the previous two sections, we discussed two classes of agents—those that improve their coordination linearly, and those that improve their coordination geometrically. In this section, we consider how our model and algorithms can be applied to a general multi-agent problem. First, we describe the criteria that is required to apply our model and algorithms.
123
Auton Agent Multi-Agent Syst
7.1 Criteria to apply learning agents team formation problem There are four main criteria that determines whether our model and algorithms can be applied to a multi-agent problem: 1. 2. 3. 4.
Team performance must be observable and measurable; The performance of a two-agent team must be observable and measurable; Team performance can be formulated in proportion to the team size; Some agent pairs will improve with “experience”.
For a given multi-agent problem, there must be a way to quantify the performance of a multi-agent team at a task, and the performance must be a real number. The actual value for the performance is domain-specific, as is the method to compute it. For example, in soccer, if the team performance could be the number of goals scores minus the number of goals received, it would fulfill the first criteria. It is acceptable that the observation of team performance is noisy and/or the team performance itself is non-deterministic—the Synergy Graph models the uncertainty with Normal distributions. The second criteria enforces that two-agent teams are feasible and that their performance can be measured. In some multi-agent problems, the task cannot be completed with two agents (e.g., moving a heavy object where any two agents have insufficient strength to move it), and such problems are out of scope for this article. An example of a multi-agent problem that fits this criteria is multi-agent foraging—a team of any size (including two agents) is able to forage items, and the team performance could be the number of items foraged, thus fulfilling the first and second criteria. The third criteria is related to the Synergy equation (Definition 11), where the synergy of a team is computed by the average of their pairwise synergies. Hence, the two examples above do not fulfill the third criteria—in general, increasing the size of the teams would increase the goal difference in soccer, and the number of items foraged in multiagent foraging. One common way to adapt the examples is to divide the proposed score by the number of agents in the team, i.e., the goal difference per player in soccer, and the number of items foraged per agent for multi-agent foraging. If the multi-agent problem is such that the performance measure is tied closely with the number of agents, and averaging is not applicable, it is also possible to adapt Definition 11 to remove the averaging factor. The fourth criteria differentiates the multi-agent problem as a problem with learning agents, and is in scope for this article; otherwise, the first three criteria would mean that the regular Synergy Graph model and algorithms can be applied to solve it. The criteria ensures that some of the agents are capable of learning, and given training instances (i.e., to gain “experience”), these agents will improve their performance. It is not strictly required that agents have to learn in pairs, or that they only improve their coordination and not learn about the task. However, our algorithms assume that the learning agent pairs only improve their coordination and not their individual capabilities, so the algorithms may perform more poorly if it is not necessarily true. Examples of agents that only improve their coordination are ad hoc agents [9,22] that improve their models of teammates with experience and optimize their actions to improve the overall team performance. When a multi-agent problem fulfills all four criteria, we can apply the models and algorithms we contribute, and we detail the process below.
123
Auton Agent Multi-Agent Syst
7.2 Learning an initial Synergy Graph The Synergy Graph model is learned completely from data—observations of team performance and the associated teams. Figure 3 shows the process of the Synergy Graph learning algorithm, and more details can be found in previous Synergy Graph work, e.g., [25,28,33]. In previous work, applying the Synergy Graph learning algorithm on multi-agent team formation problems has used observations of teams of two and three agents. Typically, thirty observations per team have been retrieved, so as to obtain an accurate assessment of the standard deviations of the observational data. However, the Synergy Graph learning algorithm is also applicable on observations of other sizes of teams—teams of two and three agents were selected so that the performance of larger teams can be predicted from the learned model to demonstrate the efficacy of the Synergy Graph model. In addition, we have previously contributed variations of the Synergy Graph learning algorithm that only require a single observation per team instead of thirty [29]. The Synergy Graph learning algorithm starts with an initial random guess of the graph structure, which is then used to compute the agents’ capabilities. Subsequently, the graph structure is iteratively improved and the agent’s capabilities recomputed, such that the overall Synergy Graph more closely models the data. To apply the Learning Agents Team Formation problem on a general multi-agent problem, the first step would be to obtain observations of team performance and learn the initial Synergy Graph, and we refer to our previous work (e.g., [28,33]) for examples on how it is done. Learning the initial Synergy Graph assumes that the agent capabilities and compatibility are static during the initial phase, which is not a perfect match to our learning agents domain. However, this paper focus on the allocation of training instances after the initial Synergy Graph is learned; thus, we assume that the agent capabilities and compatibility are static
Training Observations O {a1 , a2 } : 3.2
O{a1 ,a2 } {a1 , a2 } : 4.1
..
{a1 , a2 } : 3.1
O{a1 ,a2 ,a3 } {a1 , a2 , a3 } : 5.7
.. ..
{a1 , a2 , a3 } : 4.5
learn capabilities random graph structure learn structure and capabilities
Srandom
Sinitial
Fig. 3 The Synergy Graph learning algorithm [25], where observations are used to iteratively improve the Synergy Graph structure and compute the capabilities of the agents
123
Auton Agent Multi-Agent Syst
during the initial phase. As part of future work, we will consider how to learn an initial Synergy Graph consisting of learning agent pairs. We also discuss the implications of this assumption later in Sect. 7.5.
7.3 Allocating training instances After the initial Synergy Graph Sinitial has been learned, the learning agent pairs have to be identified. We assume that the domain and/or problem will specify which agent pairs are capable of learning—otherwise, all pairs can be considered as learning agent pairs, and the algorithm will eventually estimate the non-learning pairs with zero learning rates. In the Dynamic Weighted Synergy Graph model (Definition 9), every learning agent pair ai , a j is denoted by a learning edge ei, j = (ai , a j , wi, j , li, j ) ∈ E learn . The weight wi, j is initialized to be the shortest distance between ai and a j in the initial Synergy Graph Sinitial , i.e., wi, j would be the weight of the edge between ai and a j if such an edge existed in Sinitial ; otherwise wi, j would be the shortest distance between them using other edges. The learning rate li, j is initially unknown, and we use a Normal distribution N (μi, j , σi,2 j ) to model it. The values of μi, j and σi,2 j to use are domain-specific. Once the Dynamic Synergy Graph Sdynamic has been created, our learning algorithms can be used to allocated the K training instances. The choice of which algorithm to use depends on the domain, and the characteristics of the learning agents in the problem; if the characteristics are unknown, then it can be assumed that the agents learn geometrically, as learning linearly can be modeled with a geometric rate of 1. We will discuss the algorithms and their performance later in Sects. 8 and 9.
7.4 Forming the multi-agent team After the K training instances have been allocated, the final team can be formed using the Dynamic Weighted Synergy Graph. The Synergy Graph team formation algorithm [28,33] can be used to form a near-optimal team. The algorithm assumes that the size n ∗ of the desired team is given, as it is typically a domain-specific number (e.g., 11 players in a human soccer team). Otherwise, the team formation algorithm can be iterated over different n and the team with the highest modeled performance is selected as the final result. Thus, through the steps described above (and illustrated in Fig. 4), the Synergy Graph model and algorithms contributed in this article and others are used to solve the multi-agent learning agents problem, and applied to general multi-agent problems. We will use these steps in Sect. 9 in the multi-agent foraging domain as an example, and to evaluate the efficacy of our model and algorithms.
7.5 Relevance to learning agents and limitations Section 7.1 discussed the four criteria to apply our approach to general multi-agent problems, and Sects. 7.2 to 7.4 discussed the steps to do so. The fourth criteria in Sect. 7.1 states the agent pairs should improve with “experience”. We did not list any restrictions on how the agents learn, such as whether and how they model their teammates, choose better joint actions, or switch behaviors over time. Our approach is general and applicable to any learning agent, as we do not control the agent’s behavior and how it chooses its actions and policies. Sections 5 and 6 describe the linear and geometric models we use to model the agents’ learning rates, and it is possible that the learning agents in the literature do not necessarily follow these models. However, as we show later in Sect. 9,
123
Auton Agent Multi-Agent Syst Complete Set of Agents Heterogeneous Agents Observed Teams
Performance
..
Synergy Graph Learning Algorithm
Sinitial
Mark learning agent pairs with learning edges
Training Instance Allocation Algorithm
After K training instances, select final team
Weighted Synergy Graph
Dynamic Weighted Synergy Graph
Sdynamic
Update Sdynamic Learning Agent Pair
Performance
Fig. 4 The process of applying the Synergy Graph model and learning agent algorithms to a general multiagent problem
we are able to still allocate training instances effectively when applying our approach to a multi-agent foraging problem. Further, the learning rate models can be improved as future work, to better model how agents improve over time with experience. Thus, our approach is very applicable to general multi-agent interaction without prior coordination. In Sect. 7.2, we discussed one limitation of our current approach, namely that the initial Synergy Graph is learned with the assumption that the agents’ capabilities and coordination are static during that phase. We believe that this limitation is reasonable, since learning agents are generally controlled by software, and can be set to not learn during that phase. In that way, the initial Synergy Graph would accurately reflect the agents’ capabilities and compatibility prior to learning and allocation of the training instances. However, that would require that the learning agent algorithms are slightly tweaked, or that the learning agents are “reset” after each observation is made during the initial creation of the Synergy Graph, which we believe is typically feasible. If it was not possible to disable the learning during the initial phase, for example if the approach is applied to human or human-robot teams, then the initial Synergy Graph would not be fully accurate, and only approximate the agents’ average capabilities and compatibility during the initial phase. That would introduce some error to the system, and we seek to improve it as future work, to learn an initial Synergy Graph in the presence of learning agents.
8 Experiments with synthetic data This section describes the experiments we conducted with synthetic data to compare the performance of our algorithms: AdHocLinear, AdHocGeometric and AdHocGeometricSolver, against the benchmarks of UCB and TS. In the next section, we describe our experiments and results on applying our model and algorithms to the domain of multi-agent foraging.
123
Auton Agent Multi-Agent Syst
For our experiments with synthetic data, we conduct three sets of experiments: – We first consider agents that improve linearly, and compare AdHocLinear against the optimal allocation, UCB (linear) and TS (linear); – We next consider agents that improve geometrically, and compare AdHocGeometric and AdHocGeometricSolver against the optimal allocation, UCB (geometric) and TS (geometric); – The last set considers scaled up experiments (in terms of the number of agents and learning agent pairs) in the geometric case, and compare AdHocGeometric against UCB (geometric) and TS (geometric). We do not compare against the optimal allocation since the computation is exponential; similarly, we do not test AdHocGeometricSolver. We first describe our overall experimental setup, then the results and analysis of the three sets of experiments.
8.1 Experimental setup To generate the agent capabilities and pairwise coordination, we used a Synergy Graph [28] with weighted edges of |A| vertices, with the agent capabilities Ci ∼ N (μi , σi2 ) such that 2 2 μi ∈ ( m2 , 3m 2 ) and σi ∈ (0, m ) where the multiplier m = 100. We used the Synergy Graph model as it provided a means to compute the coordination between agent pairs, and generated the agent capabilities with a large variation among the agents. For each learning agent pair, we randomly sampled their learning rates, such that in the linear case, the learning rate li, j ∈ (0, 0.1), and in the geometric case, the initial learning rate li, j ∈ (0, 1) and the decay rate γi, j ∈ (0.9, 0.95). The bounds of the learning and decay rates were chosen such that the coordination gains after allocating the training instances do not completely overshadow the initial team performance before learning, and that the learning rates do not decay too quickly in the geometric case. In the first and second sets of experiments, we set |A| = 10 and |L| = 5, and randomly selected the |L| learning agent pairs. We limited the size of |A| and |L| since the computation of the optimal allocation is exponential in the geometric case. For each value of |L|, we generated 50 Synergy Graphs for the linear case and 50 Synergy Graphs for the geometric case, and the corresponding variables for the learning agent pairs. We also varied the number of training instances, i.e., K = 20, 40, . . . , 280, 300. In the third set of experiments, we set |A| ∈ {20, 30}, and |L| ∈ {5, 10, 15, 20}. For each value of |A| and |L|, we generated 30 Synergy Graphs and the corresponding variables for the learning agent pairs. We set the number of training instances K = 300. We used the upper limit of K from the first and second sets of experiments, so that we could compare the overall performance of AdHocGeometric against the benchmarks. In all three sets of experiments, we set the size of the desired team to be formed, n ∗ = 5. The choice of n ∗ = 5 was arbitrary and chosen only to have a large number of possible teams to select from. The team performance is computed using Definition 11 using the Dynamic Weighted Synergy Graph model.
8.2 Experiments with agents that improve linearly Our first set of experiments consider learning agent pairs that improve their coordination linearly, and Fig. 5 shows the performance of teams formed by the various algorithms. The optimal solution was computed following Sect. 5.1.
123
Auton Agent Multi-Agent Syst
Fig. 5 The performance of teams formed with various algorithms when coordination increases linearly
All three algorithms [AdHocLinear, UCB (linear), and TS (linear)] demonstrate performance that is close to the optimal allocation. Figure 5 shows the standard deviations of the algorithms with the shaded areas, and illustrates that the standard deviations of all the algorithms largely overlap. In particular, AdHocLinear has a similar performance to both UCB (linear) and TS (linear), showing that considering the compatibility of a learning agent pair is sufficient to solve the problem, without having to consider the overall performance of the team. We believe this is due to the linear setup, as shown from how the optimal allocation assigns all the training instances to a single pair. Further, we believe the geometric case, which we discuss next, is more realistic in terms of the improvements of learning agents.
8.3 Experiments with agents that improve geometrically Our second set of experiments consider learning agents that improve their coordination geometrically. Figure 6 shows the performance of the various algorithms. The optimal solution was computed following Sect. 6.1. AdHocGeometric, AdHocGeometricSolver, UCB (geometric), and TS (geometric) estimate the learning rates and allocate the training instances iteratively, while Geometric Optimal and SinglePair know the learning rates. Geometric Optimal uses the known learning rates and computes the optimal allocation of training instances, while SinglePair computes a single learning agent pair that receives all the training instances. SinglePair is analogous to the Optimal algorithm in Fig. 5, but it is not the optimal solution in the geometric case. However, we include it here for comparison. The shaded regions in Fig. 6 show the standard deviations of the algorithms—the standard deviations of Optimal, AdHocGeometricSolver, and AdHocGeometric mostly overlap, and the standard deviations of SinglePair, UCB (geometric) and TS (geometric) mostly overlap. AdHocGeometric performs slightly worse than AdHocGeometricSolver, but both are close to the optimal geometric solution. Hence, even though the optimal geometric solution and AdHocGeometricSolver requires a non-linear integer solver, an allocation algorithm such as
123
Auton Agent Multi-Agent Syst
Fig. 6 The performance of teams formed with various algorithms when coordination increases geometrically
AdHocGeometric is sufficient to achieve high performance without an exponential computational cost. When learning agent pairs increase their coordination geometrically, AdHocGeometric and AdHocGeometricSolver significantly outperforms UCB (geometric) and TS (geometric). UCB and TS do not perform well in the geometric case, being close to SinglePair (that was optimal in the linear case but sub-optimal here), even though the algorithms were updated for the geometric case (Sect. 6.3). Hence, our results show that it is important to keep the team formation goal in mind while allocating training instances in the geometric case.
8.4 Further experiments with agents that improve geometrically The second set of experiments showed that AdHocGeometric performs similarly to AdHocGeometricSolver, and with a much lower computational cost. Hence, in the experiments below, we increased the number of agents and number of learning agents pairs, and only compare AdHocGeometric to UCB (geometric) and TS (geometric). We do not consider AdHocGeometricSolver or the optimal geometric solver since they would be intractable to compute. In addition, since the number of feasible teams of size n ∗ = 5 grows very large, we modified the algorithms to approximate the optimal team formation instead of doing a brute-force argmax computation. For example, AdHocGeometric would not use an argmax function, but instead use the Synergy Graph team formation algorithm [28,33] to approximate the optimal team. We have previously shown that the Synergy Graph team formation algorithm finds near-optimal teams without the exponential computational cost [28,33]. Figure 7 shows the results of |A| = 20 and Fig. 8 shows the results when |A| = 30, with the number of learning agent pairs varying from 5 to 20. In both figures, the shaded regions represent the standard deviations of the results. The number of training instances K = 300, and the figures show the performance of the team formed after all K training instances have been allocated.
123
Auton Agent Multi-Agent Syst
Fig. 7 The improvement in performance of teams formed by AdHocGeometric against the benchmarks, when coordination increases geometrically, with twenty total agents and three hundred training instances, and varying number of learning agent pairs
Fig. 8 The improvement in performance of teams formed by AdHocGeometric against the benchmarks, when coordination increases geometrically, with thirty total agents and three hundred training instances, and varying number of learning agent pairs
Overall, across the different values of |A| and |L|, it is clear that AdHocGeometric outperforms UCB (geometric) and TS (geometric), similar to the results in the previous subsection, demonstrating that scaling up the number of agents and learning agent pairs does not affect the relative performance of the algorithms. We performed a Wilcox signed rank test on our
123
Auton Agent Multi-Agent Syst Table 1 The p-values of the Wilcox signed rank test of AdHocGeometric versus UCB (geometric) and AdHocGeometric versus TS (geometric), across different numbers of learning agent pairs, and different numbers of total agents # learning agent pairs
20 total agents
30 total agents
Versus UCB (geo)
Versus TS (geo)
Versus UCB (geo)
Versus TS (geo)
5
1.9 × 10−6
5.8 × 10−6
2.1 × 10−6
4.3 × 10−6
10
1.9 × 10−6
2.6 × 10−6
1.7 × 10−6
1.7 × 10−6
15
1.7 × 10−6
1.7 × 10−6
1.7 × 10−6
1.7 × 10−6
20
1.7 × 10−6
1.7 × 10−6
1.7 × 10−6
1.7 × 10−6
Overall
2.2 × 10−21
4.0 × 10−21
2.2 × 10−21
2.9 × 10−21
results, Table 1 shows the p-values of the Wilcox signed rank test for each variation of the experiment (number of learning agent pairs, number of total agents), where each variation had 30 trials. Overall, across the number of learning agent pairs, AdHocGeometric outperformed UCB (geometric) and TS (geometric) with p = 2.2 × 10−21 and p = 4.0 × 10−21 respectively when |A| = 20 (120 trials), and with p = 2.2 × 10−21 and p = 2.9 × 10−21 respectively when |A| = 30 (120 trials). Hence, from these experiments, we demonstrate that AdHocGeometric performs significantly better than UCB (geometric) and TS (geometric) even when the number of agents and learning agent pairs are scaled up. Further, the computational cost for our AdHocGeometric algorithm is polynomial even in the geometric case where the optimal solution is exponential, and we have shown in the previous subsection that AdHocGeometric performs close to optimal in the geometric case.
9 Experiments with multi-agent foraging In the previous section, we discussed experiments where the data was synthetic and generated from Synergy Graphs that follow our model. In this section, we conduct experiments on the multi-agent foraging domain, to demonstrate how our model and algorithms are applied to a multi-agent problem, and to evaluate the efficacy of our algorithms compared to the benchmarks. We first provide a description of the multi-agent foraging problem we are trying to solve, and next describe our experimental setup, and the different variables used in the experiments. Then, we discuss our results and analysis.
9.1 Multi-agent foraging problem We consider the multi-agent continuous foraging problem [34,35,45], where there is a heterogeneous multi-agent team whose goal is to maximize the rate of items foraged over time. The multi-agent team is solving the continuous foraging problem, because the items to be foraged replenish probabilistically over time, and the agents have to maximize the rate of items foraged. For a full description of the problem statement, we refer to [34,35]. In the works cited above, the items to be foraged spawn from pre-determined locations (known to the agents), and these foraging locations can follow different replenishment models—Bernoulli (where an item has a probability of being generated every time step
123
Auton Agent Multi-Agent Syst
at every location), Poisson (where the number of items generated every time step follows a Poisson distribution), and Logistic (where the number of items generated every time step depends on the number of existing items at that location, akin to the number of living creatures in a population replenishing themselves). In this article, we use the Poisson replenishment model, because more than 1 item can be generated per location per time step (compared to the Bernoulli). In addition, the Poisson distribution models certain real-world phenomena such as mail being ready for pickup at mailboxes. The agents do not know the rates of replenishment (which vary across the locations), and have to use models to predict how many resources are available at each location. When an agent arrives at a location, it observes the number of resources available there, and updates its model. It also communicates the observed number to any other agent that is within communication range. Multiple algorithms have also been proposed to solve the multi-agent continuous foraging problem, and we use one of them—the Greedy Rate algorithm—in this article [34]. The Greedy Rate algorithm was shown to perform well in both the Bernoulli and Poisson replenishment models [34], and can also be applied to the multi-agent item delivery problem [35].
9.2 Experimental setup We set the simulated world to be 1 × 1 unit in size (the actual value does not matter—only the relative speeds of the agents in the world), and the number of simulated agents to be |A| = 10. Each agent ai had an associated speed si and capacity ci . si affected how quickly the agent could move in the world, and ci affected the maximum number of items the agent could carry at any one time. For each experiment, we sampled si ∼ N (0.2, 0.12 ) (with minimum and maximum values of 0.1 and 0.4 respectively). Similarly, ci was randomly generated from {1, . . . , 10}. Between every pair of agents ai , a j , we defined a communication range ri, j . The communication range ri, j was sampled from a Normal distribution N (0.1, 0.12 ) with a minimum value of 0.2 and maximum of 0.4. Note that the mean of the distribution was 0.1, so we only used one side of the Normal distribution to generate the communication ranges. As mentioned above, agents that arrive at foraging locations observe the number of resources available for foraging, and communicate that number to any other agents within its communication range. In addition, some agent pairs are able to share their models, which also depend on their communication range. Sharing models allows the agents to exchange more information and increase the fidelity of their models and improve their foraging ability. Hence, from the Synergy Graph perspective used in this article, the speed and capacity of an agent affect its capability, and the communication range and whether it shares its model affect its compatibility. The Greedy Rate algorithm was not designed to learn from experience across experiments, but only to improve its model from observations within an experiment [34]. To simulate “learning” in pairs, we allowed the communication range to increase whenever a learning agent pair is allocated a training instance. The goal is to maximize the rate of foraging, and we defined the performance of the multi-agent team to be the percentage of items foraged per time step. Thus, a score of 100 means that the team foraged all the items that were generated at the locations, and a score of 50 means half the items were foraged. We generated the worlds such that there were |A| foraging locations. We chose to normalize the scores to be a percentage, because the items
123
Auton Agent Multi-Agent Syst
replenished probabilistically across experiments, and using the raw numbers would not give an accurate sense of the performanceof the multi-agent team. Also, when a learning agent pair ai , a j is selected, the two agents are simulated in a world with 2 foraging locations, so that the performance score is not a function of the number of agents (otherwise 2 agents in a world of 10 foraging locations would be very unlikely to foraging close to 100% of the items generated). Since we do not know the learning agents’ model, we assumed that the agents improve coordination following the geometric model. In the experiments in the previous section, the geometric rate was known, but it is unclear what the rate should be in these experiments. Hence, we varied the geometric rate γi, j from 0.1, 0.2, . . . , 0.9. In addition, we allowed all pairs of agents to “learn”, so the number of learning agent pairs |L| = 10 2 = 45, which is significantly larger than the experiments in the previous section. We also varied the probability that agent pairs would share models from 0.1, 0.2, . . . , 0.9. We set the size of the final team n ∗ = 10, so all the agents are used in the final team. The number of training instances was set to K = 200. Thus, the space of our experiments was large, and we repeated each set of variables 30 times. Although setting n ∗ = 10 makes the team formation problem trivial (all the agents are in the team), we chose such a value for n ∗ so that the benefits of our algorithm’s allocation of training instances is compared to the baselines (UCB and TS); our previous section already illustrated the benefit including team formation.
9.3 Results and analysis The first step of applying our model and algorithms to the problem, is to learn an initial Synergy Graph from the data. To do so, we generated observations of how well teams of 2 and 3 agents perform, and learn a Weighted Synergy Graph. After the initial Synergy Graph is learned, we can apply our AdHocGeometric algorithm to allocate the K training instances, and form a team of size n ∗ after all K training instances have been allocated. Table 2 shows the performance of the formed teams, where n ∗ = 10, the probability of sharing models was set to 50%, and γi, j varies from 0.1 to 0.9. The variance in the algorithms—AdHocGeometric, UCB (geometric) and TS (geometric)—are very large, mainly due to the variance in the replenishment model and agent variables (such as their
Table 2 The foraging rate of the various algorithms, as γi, j varies γi, j (%)
Foraging rate (%) AdHocGeometric
UCB (geometric)
TS (geometric)
10
84.3 ± 10.6
82.9 ± 10.6
83.0 ± 10.7
20
85.2 ± 10.4
83.0 ± 10.7
83.3 ± 10.8
30
85.1 ± 10.5
83.0 ± 10.7
83.2 ± 11.1
40
85.2 ± 10.3
82.9 ± 10.9
82.8 ± 10.9 82.9 ± 10.9
50
85.2 ± 10.3
82.9 ± 10.9
60
85.2 ± 10.3
82.9 ± 10.9
83.0 ± 10.7
70
85.2 ± 10.3
83.1 ± 10.9
83.1 ± 10.9
80
85.2 ± 10.3
83.1 ± 10.7
83.3 ± 11.0
90
85.2 ± 10.3
83.1 ± 10.7
83.0 ± 10.8
123
Auton Agent Multi-Agent Syst
Fig. 9 The gain in foraging % when using AdHocGeometric over UCB (geometric) and TS (geometric), over a range of γi, j
Table 3 The p-values of the Wilcox signed rank test of AdHocGeometric versus UCB (geometric) and AdHocGeometric versus TS (geometric), across different values of γi, j , the geometric rate used in the Kalman filter γi, j
AdHocGeometric versus UCB (geometric)
AdHocGeometric versus TS (geometric)
0.1
2.4 × 10−6
9.3 × 10−6
0.2
1.1 × 10−5
4.3 × 10−6
0.3
7.0 × 10−6
4.1 × 10−6
0.4
2.9 × 10−6
1.9 × 10−6
0.5
2.6 × 10−6
2.1 × 10−6
0.6
2.4 × 10−6
2.4 × 10−6
0.7
2.6 × 10−6
1.7 × 10−6
0.8
2.0 × 10−5
4.4 × 10−5
0.9
2.4 × 10−6
2.1 × 10−6
Overall
1.0 × 10−42
3.0 × 10−42
speeds and capacities). Figure 9 shows a plot of the gain in performance of AdHocGeometric versus UCB (geometric), and AdHocGeometric versus TS (geometric), removing some of the variance of the performance due to the random nature of the task, e.g., a world that replenishes items very quickly will have a low overall team score regardless of algorithm since the agents cannot forage all the items, while a world that replenishes slowly will have a high overall team score regardless of algorithm. However, the relative performance of AdHocGeometric versus UCB (geometric) and TS (geometric) is more consistent. Thus, Table 2 shows that the different values of γi, j has little effect on the performance of the
123
Auton Agent Multi-Agent Syst Table 4 The foraging rate of the various algorithms, while varying the probability that agents share models when in communication range Prob of sharing models (%)
Foraging Rate (%) AdHocGeometric
UCB (geometric)
TS (geometric)
10
82.1 ± 11.3
80.4 ± 11.8
80.7 ± 11.6
20
84.2 ± 10.5
82.1 ± 10.8
82.1 ± 11.1
30
85.2 ± 10.3
82.9 ± 10.9
82.9 ± 11.0
40
87.0 ± 9.3
84.4 ± 9.9
84.6 ± 9.9
50
88.2 ± 8.9
85.6 ± 9.7
85.7 ± 9.6
60
89.0 ± 8.8
86.2 ± 9.6
86.6 ± 10.2
70
89.6 ± 8.5
86.9 ± 9.6
86.9 ± 9.5
80
90.2 ± 8.2
87.2 ± 9.5
87.5 ± 9.1
90
90.8 ± 8.1
88.0 ± 9.1
88.0 ± 9.0
Fig. 10 The gain in foraging % when using AdHocGeometric over UCB (geometric) and TS (geometric), as the probability that agents share models when in communication range varies
final team, and Figure 9 shows that AdHocGeometric outperforms UCB (geometric) and TS (geometric). Table 3 shows the p-values of the Wilcox signed rank test for each value of γi, j , where each variation had 30 trials. Overall, AdHocGeometric outperformed UCB (geometric) with 1.0 × 10−42 (270 total trials) and TS (geometric) with p = 3.0 × 10−42 (270 total trials). Table 4 shows the performance where the probability of agent pairs sharing models varied from 10 to 90%, and the figure uses n ∗ = 10, γi, j = 0.7. Similar to Table 2, the large variance in the performance is due to the random nature of the task. Figure 10 shows a plot of the gain in performance of AdHocGeometric versus UCB (geometric), and AdHocGeometric vs TS (geometric), as the probability of sharing models varies. Table 4 and Fig. 10 show
123
Auton Agent Multi-Agent Syst Table 5 The p-values of the Wilcox signed rank test of AdHocGeometric versus UCB (geometric) and AdHocGeometric versus TS (geometric), across different probabilities of agents sharing models Prob of sharing models (%)
Versus UCB (geometric)
Versus TS (geometric)
10
1.4 × 10−5
4.2 × 10−4
20
2.1 × 10−6
3.2 × 10−6
30
2.6 × 10−6
3.2 × 10−6
40
1.7 × 10−6
3.5 × 10−6
50
1.1 × 10−5
4.3 × 10−6
60
1.9 × 10−6
2.6 × 10−5
70
2.6 × 10−6
6.3 × 10−6
80
2.4 × 10−5
1.0 × 10−5
90
2.4 × 10−6
1.9 × 10−6
Overall
7.4 × 10−44
2.4 × 10−40
that as the probability of sharing models increases, the overall team performance increases across all the algorithms, as is expected since the agents do better at the task. However, AdHocGeometric remains ahead of UCB (geometric) and TS (geometric). Table 5 shows the p-values of the Wilcox signed rank test for each probability of agent pairs sharing models, where each variation had 30 trials. Overall, AdHocGeometric outperformed UCB (geometric) with 7.4×10−44 (270 total trials) and TS (geometric) with p = 2.4×10−40 (270 total trials). Thus, even in a general multi-agent problem, our model and algorithms can be applied to effectively allocate training instances to learning agents, and outperform benchmark algorithms.
10 Conclusion We are interested in agents that learn to coordinate with their teammates, in particular, pairs of agents that improve their coordination through training instances. We formally defined the learning agents team formation problem, where the goal is to allocate a fixed number of training instances to learning agent pairs that have heterogeneous capabilities and learning rates, so as to form a team, i.e., select a subset of the agents, that maximizes the team performance. We considered two categories of learning agent pairs, those that improved their coordination linearly, and those that improved geometrically, i.e., the marginal improvement decreased by a constant factor with each training instance. We defined the optimal allocation of training instances in both categories, assuming that the learning rates of the agent pairs are known. We contributed two algorithms to solve the learning agents team formation problem, one for the linear category and one for the geometric category, that estimate the learning rates and allocate the training instances. There are similarities between the learning agents team formation and multi-armed bandit problems, and we compared our algorithms against the upper confidence bound (UCB) and Thompson sampling algorithms (TS) from the multi-armed bandit problem. In our simulated experiments, we showed that our algorithms performed similarly to the bandit algorithms in the linear category, and outperformed them in the geometric cate-
123
Auton Agent Multi-Agent Syst
gory, finding near-optimal solutions in both categories. Further, the optimal solution in the geometric case requires exponential computational time, as well as our AdHocGeometricSolver algorithm that similarly uses an integer solver. We showed in our experiments that the AdHocGeometric algorithm, that is adapted from the AdHocLinear algorithm designed for the linear case, performs similarly to the AdHocGeometricSolver algorithm, but at much lower computational cost, and is hence better suited for general use. We applied our model and algorithms to the multi-agent foraging domain, and compared AdHocGeometric to the benchmark algorithms UCB and TS. We showed that AdHocGeometric outperformed the benchmark algorithms, thus demonstrating the efficacy of our algorithm in the learning agents team formation problem, and its applicability to general multi-agent problems. In our current approach, we assume that the agents do not improve coordination while the initial Synergy Graph is learned, and only begin improving when training instances are allocated. Such an assumption may not be generally true, and we consider learning an initial Synergy Graph while agents are actively learning as future work. Research in learning agents is steadily progressing, and agents are learning at faster rates, across more domains, and interacting with previously unknown teammates. Our contribution in this article builds upon this research, by allowing the use of existing learning agents in the literature (without modifications, since we do not need to control the agents’ actions and policies), modeling the agents’ learning rates through observations, choosing which agents should be given the training instances to improve their coordination, and forming a multi-agent team. Thus, our approach allows multi-agent teams to interact without prior coordination, and to improve their coordination to form an effective multi-agent team for a task. Acknowledgements The authors thank Junyun Tay for her feedback and comments, and Yucheng Low for his help with the statistical significance tests. This work was partially supported by the Air Force Research Laboratory under grant number FA87501020165, by the Office of Naval Research under grant number N0001409-1-1031, and the Agency for Science, Technology, and Research (A*STAR), Singapore. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. This work was supported by the A*STAR Computational Resource Centre through the use of its high performance computing facilities.
References 1. Agmon, N., & Stone, P. (2011). Leading multiple ad hoc teammates in joint action settings. In Proceedings of the international workshop on interactive decision theory and game theory (pp. 2–8). 2. Agmon, N., & Stone, P. (2012). Leading ad hoc agents in joint action settings with multiple teammates. In: Proceedings of the international conference on autonomous agents and multiagent systems (pp. 341–348). 3. Albrecht, S. V., & Ramamoorthy, S. (2012). Comparative evaluation of mal algorithms in a diverse set of ad hoc team problems. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 349–356). 4. Albrecht, S. V., & Ramamoorthy, S. (2013). A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 1155–1156). 5. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256. 6. Barrett, S. (2014). Making friends on the fly: advances in ad hoc teamwork. Ph.D. thesis, The University of Texas at Austin. 7. Barrett, S., & Stone, P. (2014). Cooperating with unknown teammates in robot soccer. In Proceedings of the international workshop on multiagent interaction without prior coordination.
123
Auton Agent Multi-Agent Syst 8. Barrett, S., & Stone, P. (2015). Cooperating with unknown teammates in complex domains: A robot soccer case study of ad hoc teamwork. In Proceedings of the International Conference on Artificial Intelligence (pp. 2010–2016). 9. Barrett, S., Stone, P., & Kraus, S. (2011). Empirical evaluation of ad hoc teamwork in the pursuit domain. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 567– 574). 10. Barrett, S., Stone, P., Kraus, S., & Rosenfeld, A. (2012). Learning teammate models for ad hoc teamwork. In Proceedings of the international workshop on adaptive and learning agents. 11. Bikakis, A., & Caire, P. (2014). Computing coalitions in multiagent systems: A contextual reasoning approach. In Proceedings of the European conference on multi-agent systems (pp. 85–100). 12. Blumenthal, H. J., & Parker, G. (2004). Co-evolving team capture strategies for dissimilar robots. In Proceedings of the 2004 AAAI fall symposium (Vol. 2). 13. Bu¸soniu, L., Babuška, R., & De Schutter, B. (2010). Multi-agent reinforcement learning: An overview. In D. Srinivasan & L. C. Jain (Eds.), Innovations in multi-agent systems and applications-1 (pp. 183–221). Berlin: Springer. 14. Caire, P., & Bikakis, A. (2014). A MCS-based methodology for computing coalitions in multirobot systems. In Proceedings of the international workshop on cognitive robotics. 15. Chakraborty, D. (2014). Sample efficient multiagent learning in the presence of markovian agents. Ph.D. thesis, The University of Texas at Austin. 16. Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. In Proceedings of the joint European conference on machine learning and knowledge discovery in databases (pp. 211–226). 17. Chakraborty, D., & Stone, P. (2013). Cooperating with a markovian ad hoc teammate. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 1085–1092). 18. Chang, Y., Ho, T., & Kaelbling, L. P. (2004). All Learning is local: Multi-agent learning in global reward games. In Proceeedings of the international conference on neural information processing systems (pp. 807–814). 19. Crandall, J. W. (2014). Non-myopic learning in repeated stochastic games. arXiv preprint arXiv:1409.8498 20. Eck, A., & Soh, L. K. (2015). To ask, sense, or share: Ad hoc information gathering. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 367–376). 21. Feldman, Z., & Domshlak, C. (2014). Simple regret optimization in online planning for markov decision processes. Journal of Artificial Intelligence Research, 51, 165–205. 22. Genter, K., Agmon, N., & Stone, P. (2013). Ad hoc teamwork for leading a flock. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 531–538). 23. Gordin, M., Sen, S., & Puppala, N. (1997). Evolving cooperative groups: Preliminary results. In International workshop on multi-agent learning. 24. Jumadinova, J., Dasgupta, P., & Soh, L. K. (2014). Strategic capability-learning for improved multiagent collaboration in ad hoc environments. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 44(8), 1003–1014. 25. Liemhetcharat, S. (2013). Representation, planning and learning of dynamic ad hoc robot teams. Ph.D. thesis, Carnegie Mellon University. 26. Liemhetcharat, S., & Luo, Y. (2015). Adversarial synergy graph model for predicting game outcomes in human basketball. In Proceedings of the international workshop on adaptive and learning agents. 27. Liemhetcharat, S., & Luo, Y. (2015). Applying the synergy graph model to human basketball (extended abstract). In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 1695–1696). 28. Liemhetcharat, S., & Veloso, M. (2012). Modeling and learning synergy for team formation with heterogeneous agents. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 365–375). 29. Liemhetcharat, S., & Veloso, M. (2012). Weighted synergy graphs for role assignment in ad hoc heterogeneous robot teams. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (pp. 5247–5254). 30. Liemhetcharat, S., & Veloso, M. (2013). Forming an effective multi-robot team robust to failures. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (pp. 5240–5245). 31. Liemhetcharat, S., & Veloso, M. (2013). Learning the synergy of a new teammate. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (pp. 5246–5251). 32. Liemhetcharat, S., & Veloso, M. (2013). Synergy graphs for configuring robot team members. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 111–118). 33. Liemhetcharat, S., & Veloso, M. (2014). Weighted synergy graphs for effective team formation with heterogeneous ad hoc agents. Artificial Intelligence, 208(2014), 41–65.
123
Auton Agent Multi-Agent Syst 34. Liemhetcharat, S., Yan, R., & Tee, K. (2015). Continuous foraging and information gathering in a multiagent team. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 1325–1333). 35. Liemhetcharat, S., Yan, R., Tee, K., & Lee, M. (2015). Multi-robot item delivery and foraging: Two Sides of a Coin. Robotics, Special Issue on Recent Advances in Multi-Robot Systems: Algorithms, and Applications, 4(3), 365–397. 36. Melo, F. S., & Sardinha, A. (2016). Ad hoc teamwork by learning teammates’ task. Journal of Autonomous Agents and Multi-Agent Systems, 30(2), 175–219. 37. Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Journal of Autonomous Agents and Multi-Agent Systems, 11(3), 387–434. 38. Puppala, N., Sen, S., & Gordin, M. (1998). Shared memory based cooperative coevolution. In Proceedings of the IEEE international conference on evolutionary computation (pp. 570–574). 39. Russell, S., & Norvig, P. (2003). AI: A modern approach. Upper Saddle Rive: Prentice Hall. 40. Stone, P., Kaminka, G., Kraus, S., & Rosenschein, J. (2010). Ad hoc autonomous agent teams: Collaboration without pre-coordination. In: Proceedings of the international conference on artificial intelligence (pp. 1504–1509). 41. Stone, P., & Kraus, S. (2010). To teach or not to teach? Decision making under uncertainty in ad hoc teams. In Proceedings of the international conference on autonomous agents and multiagent systems (pp. 117–124). 42. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the international conference on machine learning (pp. 330–337). 43. Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294. 44. Tuyls, K., & Nowe, A. (2005). Evolutionary game theory and multi-agent reinforcement learning. Journal of Knowledge Engineering Review, 20(1), 65–90. 45. Wang, C., Liemhetcharat, S., & Low, K. (2016). Multi-agent continuous transportation with online balanced partitioning (extended abstract). In: Proceedings of the international conference on autonomous agents and multiagent systems (pp. 1303–1304). 46. Wu, F., Zilberstein, S., & Chen, X. (2011). Online planning for ad hoc autonomous agent teams. In Proceedings of the international joint conference on artificial intelligence (pp. 439–445). 47. Zihayat, M., Kargar, M., & An, A. (2014). Two-phase pareto set discovery for team formation in social networks. In Proceedings of the IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent technologies (IAT) (pp. 304–311).
123