Distributed consensus problem with caching on federated learning framework

Federated learning framework facilitates more applications of deep learning algorithms on the existing network architectures, where the model parameters are aggregated in a centralized manner. However, some of federated learning participants are often inaccessible, such as in a power shortage or dormant state. That will force us to explore the possibility that the parameter aggregation is operated in an ad hoc manner, which is based on consensus computing. On the contrary, since caching mechanism is indispensable to any federated learning mobile node, it is necessary to investigate the connection between it and consensus computing. In this article, we first propose a novel federated learning paradigm, which supports an ad hoc operation mode for federated learning participants. Second, a discrete-time dynamic equation and its control law are formulated to satisfy the demands from federated learning framework, with a quantized caching scheme designed to mask the uncertainties from both asynchronous updates and measurement noises. Then, the consensus conditions and the convergence of the consensus protocol are deduced analytically, and a quantized caching strategy to optimize the convergence speed is provided. Our major contribution is to give the basic theories of distributed consensus problem for federated learning framework, and the theoretical results are validated by numerical simulations.


Introduction
Federated learning (FL) is an emerging and promising decentralized machine learning approach that performs a collaborative training of models with the local datasets on various mobile devices, with the local model updates being sent to a server to aggregate and the updated global model being fed back for the next round of local training, instead of transmitting the raw data to a data center. Hence, training models are shared and the privacy of local datasets is preserved, while the communication cost can be reduced greatly. 1 Given these compelling benefits, a rapidly increasing research attention has been dedicated to applying FL in the field of wireless communications to support more intelligent, more convenient and applicable applications, 2 for example, an image classification task for vehicular edge computing, 3 a content popularity prediction for augmented reality (AR) applications, 4 a signal classification or a deep anomaly detection in industrial distributed wireless sensor networks, etc. 5,6 Recent 2 years have begun to witness an increasing interest in studying how to employ FL on mobile ad hoc networks (MANET) or multi-agent systems (MAS), such as unmanned aerial vehicles (UAV), 7 and vehicular Internet of Things (IoT). 8 Nevertheless, these mobile devices acting as FL clients are designed to directly communicate with an FL server (or a cluster center that plays a role in relaying), rather than an ad hoc operation mode. This situation is also present when FL is applied to wireless sensor networks. 5,6 The distributed consensus problem is the theoretical basis for supporting FL parameter synchronization (namely, model updates' aggregation) in the context of ad hoc operation mode.
The distributed consensus problem (aka consensus computing) means to make the scalar states of a set of nodes (or agents) converge to the same value or the average under local communication constraints. A distributed consensus algorithm or protocol is an interaction rule that specifies the information exchange between an agent and all of its neighbors on the network. The essence of this algorithm is that, in each round, one or more agents can communicate information with its immediate neighbors, and then each agent updates its estimate of a quantity of states by combining the estimate with those of its neighbors.
What are the benefits brought by the ad hoc operation mode for FL framework? (1) Either FL server or cluster center may be inaccessible in some scenarios, such as in a power shortage or device dormant state, (2) when a large number of FL clients send their requests to link an FL server or a cluster center, it may be overwhelmed due to its limited capability, (3) this operation mode enables short-range communications such as ultra-wideband by which a large channel capacity can be obtained for local model updates' transmissions that are capacity-consuming, (4) in this mode, FL parameter synchronization is enabled by mobile devices in a coordinated manner in the absence of FL server or cluster center, and (5) the asynchronous transmissions of local model updates are allowed to a considerable extent in this operation mode.
In view of the above reasons, a novel FL paradigm is proposed at the beginning of this article, as shown in Figure 1. It is obvious that the most salient challenge for this paradigm should be the communication overhead for parameter synchronization. Thanks to the sparsification technology, 9 combining with quantization, compression, or selective communication technologies, 2 the size of model or gradient updates can be reduced tremendously. It can be seen that our scheme is sufficiently practicable if ultra-wideband communication is employed in this paradigm as well.
In a word, to the best of our knowledge, as of now the study of the distributed consensus problem over FL framework has not been found yet. That is our motivation of exploring this issue.

Related work
As mentioned above, until now we have not seen an FL framework whose mobile clients operate in an ad hoc manner yet. The existing studies of FL over MANET (or MAS) are concentrated on communication cost reduction, FL client selection, data privacy and security, so on.
Zhang and Hanzo 7 proposed an FL-aided multi-UAV system to conduct classification tasks for exploration scenarios, where each of UAV is coordinated by a ground fusion center as FL server to form a cooperative network. An algorithm of weighted zero-forcing precoding is used by each of UAV to mitigate the interference to the FL server. Bao et al. 8 proposed an edge computing-based joint client selection and networking scheme for vehicular IoT, where some of vehicles are assigned to act as both edge nodes (aka cluster centers) and FL clients via a distributed approach. The selected clients play a role of forwarders between common vehicles and FL server. Lu et al. 10 employed an FL architecture empowered by blockchain to address data privacy concerns on Internet of Vehicles, where the security of shared data is guaranteed by integrating learning parameters into a blockchain. Regarding the aforementioned sparsification technology, Sun et al. 9 presented a general gradient sparsification framework as another way to reduce communication cost for FL parameter synchronization on IoT, where validation data sets are maintained with top-1 accuracy when 99.9% gradients are sparsified. As for the FL applications in distributed sensor networks, Liu et al. 5 used an FL paradigm to fuse the learning process and recognition results of each sensor node for the modulation recognition of wireless signals. Liu et al. 6 proposed an on-device FL-based deep anomaly detection framework for sensing timeseries data. Both of the FL frameworks require parameter aggregators as FL servers, instead of an ad hoc operation mode.
To date, the distributed consensus problem of perfect models, which are assumed that each agent (or node) can obtain its neighbor information timely and precisely, instantaneous transmissions, perfect clock synchronization, concurrent updates, identical agent dynamics, even fixed network topologies, has reached a reasonable degree of maturity. 11 Nonetheless, wireless networked systems in practical applications often operate in uncertain communication environments and are inevitably subjected to communication latency, asynchronous clock and updates, agents' heterogeneity, topological dynamics, as well as measurement noises (including additive and multiplicative noises). Then what are the specific demands on consensus computing in terms of FL framework over MANET (or MAS)? (1) Due to the intermittent and unreliable communications arising from topological dynamics and channel noise, measurement noises that may be treated as time delay (i.e. additive noise) and packet loss (i.e. multiplicative noise) should be taken into account. Furthermore, time delay on each link (including channel and queuing delay) is asked to be non-identical and time-varying. Packet loss rate on each link (including channel and queuing packet loss) may be non-identical and timevarying as well, (2) given that some of nodes are sometimes inaccessible (e.g. in dormant state), FL parameter aggregation must be achieved in an asynchronous manner, (3) in consideration of nodes' heterogeneity, caching on each node ought to be non-identical, and (4) consensus computing is requested to make good use of the broadcast nature of wireless communications, in that its convergence can be accelerated tremendously. 12 Although the existing distributed consensus algorithms explore some of four aspects mentioned above, to the best of our knowledge, an algorithm that can fulfill all of requests has not been seen yet. Moreover, the caching issue along with consensus computing has not been received any concerns up to now. Olfati-Saber and Murray 13 provided the consensus protocols and their convergence analysis for directed balanced networks with constant time-delays by introducing disagreement functions, while a direct connection between the algebraic connectivity of a graph and the convergence of a linear consensus protocol is established. Savino et al. 14 contributed a sufficient condition of consensus for discrete-time switching networks, based on linear matrix inequalities that consider the joint effect of timevarying delays and topological uncertainty. Under an assumption that delay is time-varying and undirected network is connected, Wang et al. 15 derived the conditions to guarantee consensus for continuous-time multiagent systems. For the consensus problem of a switched multi-agent system composed of continuous-time and discrete-time subsystems, Zheng and Wang 16 proposed a linear consensus protocol and proved that this consensus problem is solvable under arbitrary switching with undirected connected graph, directed graph, and switching topologies, respectively. Kar and Moura 17 studies the distributed average consensus with intermittent topologies and noisy channels in sensor networks, which leads to a bias-variance dilemma, that is, running consensus for long reduces the bias of the final average estimate but increases its variance, and presented two versions of consensus compromise to this tradeoff. Zong et al. 18 investigated the stochastic consensus conditions of linear MAS with fixed time-delays and stochastic multiplicative noises. First, the stochastic stability for stochastic differential delay equations driven by multiplicative noises is examined. Then, sufficient conditions are deduced for the mean-square and a. s. consensus. Zheng et al. 19 studied the mean-square consensus problem of discrete-time linear MAS over directed networks with constant delay and nonidentical packet dropouts. Sufficient consensus conditions are obtained in terms of delay, packet dropout rates, network topology and agent dynamics. On the basis of a first-order average-consensus protocol with switching networks and additive noises, Chen et al. 20 gave a quantitative description of relation between convergence speed and connectivity of topologies by using stochastic approximation methods and establishing a critical consensus condition for network topologies.
In short, inspired by the fact that caching plays a critical role in FL operations, 21,22 we will investigate the connection between caching and consensus computing, while discussing the condition for reaching consensus.

Problem formulation
Algebraic graph theory A MANET (or MAS) is described as a sequence of weighted digraphs G(k) = V , E(k) f g, with graph index k 2 f1, 2, . . . , 'g, where V is the set of N nodes (or agents) and E(k) is the set of all existing wireless links.
where w ij is the weight of link (i, j).
Suppose that the probability that there exists a link (i, j) is p ij and the packet loss rate over the entire net- where P ij (k) indicates whether there exists a link (i, j), e ij (k) indicates whether a packet loss occurs over the link (i, j), and both P ij (k) and e ij (k) follow a Bernoulli process. As a result, In addition, let l i denote the ith eigenvalue of the Laplacian in average L(k).

Distributed consensus problem
Each node (or agent) in a networked system can be described by a discrete-time dynamic equation, that is where x i 2 R n and u i 2 R m are the state and input of node i, respectively. Both A and B are node i's coefficient matrices, which are employed to characterize this node. In view of the results, 23 we can assume that all eigenvalues of A are either on or outside a unit circle, AjB ½ is controllable, and the union of a sequence of weighted directed graphs, that is, G = [ G(k), contains a directed spanning tree.
Considering both delay and packet loss, u i can be expressed as where K is the control gain, t is the delay denoted by an integer, and N i is the neighbor set of node i. Therefore, we can formulate the update equation as It is said this networked system can reach a meansquare consensus if there exists a control gain K such that lim k!' Efjjx i (k) À x j (k)jj 2 g = 0, 8i, j 2 f1, :::, N g ð6Þ where jjÁjj represents the Euclidean norm of the vector.

Consensus protocol
In this section, we will derive a criterion to evaluate the consensus of the networked system and present the consensus protocol.

Consensus conditions
For a networked system with delay and packet loss, the update equation is Lemma 1. For equation (7), as long as there exists a positive-definite matrix P satisfying equation (8), then we can say that the system is mean-square consensus, that is where m = E(e(k)) = 1 À p, s 2 is the variance of e(k) and s 2 = p(1 À p). The proof of this lemma is given in Appendix 1.
We employ the control gain K defined as 19 Q in equation (9) is a positive-definite matrix that meets the Riccati inequality whereÃ = (t + 1)A À t Á I and g = (t + 1)(1 À p)= (2t(1 À p) The consensus error is defined as where r j is the element of vector r = ½ r 1 r 2 ::: r N T enabling r T Á L(kÀt) = 0 and r T Á 1 N = 1.
Since G has a directed spanning tree, there exist matrices Y and S such that c = ½1 N Y ,c À1 = ½r T S T , We can see that equation (6) is also equal lim k!' 19 Thus, if this equation always holds for 8i 2 f1, 2, . . . , Ng, the consensus can be reached in a mean-square manner.
To get the further study of the topological consensus conditions, we can define Lemma 2. It is difficult to form a balanced digraph (or a balanced joint digraph) for broadcast-based networks. That is why a bias of convergence result from its accurate value occurs sometimes, which is questioned. 24 Lemma 2. If the network is a balanced digraph, the final convergence value of it is the mean value of its initial value.
The proof is given in Appendix 2.

Quantized caching
Due to the delay occurring when sending messages between nodes, the asynchronous problem is inevitable. Hence, it is necessary to consider the asynchronous problem of communications. We use a quantized caching mechanism to mask the uncertainties from asynchronous updates and varying delays. As shown in Figure 2, node i receives messages with different delays (from t À tDt to t) and caches them, respectively. Then, node i deals with them at the moment t. The messages are sent by the neighbor node j at the moment t À tDt, which may arrive at t À (t À 1)Dt or t À (t À 2)Dt due to the delay. The quantized caching mechanism used caches these messages and deals with them at moment t after they arrived. The algorithm given in Table 1 is used to simulate the consensus process of the networked system.
We give the complexity of message overhead of this consensus algorithm prior to its convergence analysis. Assume that the number of messages sent by all nodes over an entire network is d in a certain period of time t max that denotes the allowed maximum (without packet loss) of time delay t. Thus, the message overhead in a round of communication is t(d=t max ) over the entire network. When considering the number of iterated rounds k, the complexity of message overhead should be kt(d=t max ).

Convergence analysis
In this section, we will discuss the impact of time-delay and packet loss on the convergence speed of the network.

Convergence speed
Inequality equation (66) in Appendix 1 indicates that the convergence speed is determined by e g . The larger the value of g is, the faster the algorithm converges. Now, let s Ã : = e g , s Ã 2 (1, s 0 ). If s 0 becomes larger, s Ã should also become larger. Suppose that s 0 is the solution of H(s) = 0 when s.1, as shown in Appendix 1. Therefore, the convergence speed of the algorithm turns into the range of zero point values of function H(s) when s.1. The larger the zero point value of function H(s) is, the faster the algorithm converges.
Based on equation (13), we have where m = 1 À p. According to equation (63) in Appendix 1, we get Lemma 1 indicates that it is necessary to satisfy E½P 1 \0 to ensure the mean-square consensus of the networked system.
By introducing the control gain K, we get and Since v Ã = 2=( l 2 + l max ) and 19 as long as P is a positive-definite matrix, E½P 1 \0 holds. We also assume that A = 1, B = b 2 R 1 3 1 , K = k 2 R 1 3 1 , and all of the digraphs are balanced. When each node state is a one-dimensional vector, H(s) may be expressed as The zero point of H(s) is also the zero point of h(s) when s.1. Since we have previously proved that H(s) has a unique zero in the region of s.1, h(s) also has a unique zero.
Combining these with the analysis of H(s), we can conclude that the larger the zero point of h(s) in the region of s.1 is, the faster the algorithm converges.

Impact of time delay and packet loss rate
From equation (21), we know that h(s) is related to packet loss rate p and delay t. In this subsection, we will analyze the effects of these parameters on the convergence speed of the networked system.
We first investigate the impact of packet loss rate.
Let v := l i v Ã and y := s. We know that the zero point of h(y) is an implicit function based on p, t and v.
We can get the partial derivative of h(p, y) w.r.t. the packet loss rate as We also have Since y.0, which means that h p (p, y).0 holds. Considering these, we can prove that y 0 (p)\0, that is, the zero point of h(y) decreases with the increase of packet loss rate.
The impact of delay analysis is the same as the packet loss rate. To do that, we first need to analyze the positive and negative of y 0 (t) as where the partial derivative of h(t, y) w.r.t. t is h t (t, y) = y dg dt + f (y À 1)y t ln y + df dt (y À 1)y t ð28Þ and based on equation (20), we can get As a result, we have h t (t, y).0 and y 0 (t)\0. To sum up, a remark is concluded as follows.

Convergence optimization
Assume that the maximum time delay is t max , the data arrival rate in each round of communication follows a Input: Initial states x 1 (0), x 2 (0), :::, x N (0) Output: Alignment value x 1: Generate the initial state of N nodes x 1 (0), x 2 (0), :::, x N (0) 2: N nodes start their broadcasting 3: For i = 1 to k 4: Node i receives messages from its neighbors 5: Node i updates its state after t based on x i (k + 1) = Ax i (k) + Bu i (k) 6: Node i broadcasts its new state 7: If x i (k) = x j (k) for 8i, j , the alignment is reached at round k Poisson distribution, and the neighbor nodes send d messages on average at a certain time. Therefore, the average number of messages received in a round of communication is d=t max . Then, the parameters of Poisson distribution is l = (d=t max ). When the cache is set to t, there exists a probability distribution, which is and the expected number of received messages is t(d=t max ). Thus, the packet loss rate caused by queuing or caching is and the total packet loss rate is where p ch is the packet loss rate caused by channel noise.
Introducing p into h(y), we can get h(y) = (2 + g)y + f Á y t (y À 1) À 2 ð33Þ It is necessary to analyze the positive and negative of y 0 (t) according to the existence theorem of the implicit function as By calculating h y (t, y) and h t (t, y), respectively, we can get h y (t, y) = 2 + g + f ty tÀ1 (y À 1) and h t (t, y) = y dg dt + fy t (y À 1) ln y + df dt y t (y À 1) ð37Þ When p is incorporated into y 0 (t), the positive and negative of y 0 (t) cannot be analyzed. When t = 0, the packet loss rate is 1, and the convergence speed is 0. When t is increasing, it can effectively reduce the packet loss rate, and thus improve the convergence speed. When t ! t max , the additional packet loss rate caused by caching (or queuing) will infinitely approach 0. The convergence speed decreases with the increase of t.
The extreme point of y(t) can be obtained by solving the following equations and the solution is Simulation results RWP (Random Waypoint) mobility model is selected to simulate the movement trajectories and information exchanges of nodes moving within a circular area with a radius of 10 m. Assume that the initial sate values x i 2 R 1 3 1 on each node i are uniformly distributed between 0 and 1. Table 2 provides the simulation parameters. Figure 3(a)-(d) illustrates the variations of node states for t = 4, t = 200, t = 400 and t = 310, respectively. The x-axis represents the communication rounds, and the y-axis represents the state values of each node. It can be seen from Figure 3(a) that our algorithm converges just after 800 rounds, which is faster than 1000 rounds in Figure 3(b) and over 1000 rounds in Figure 3(c). It reveals that the convergence speed of consensus algorithm would decline as time delay increases. Using equation (39), we can calculate that the optimal delay setting should be 310, as shown in Figure 3(d). It can be observed evidently from this subfigure that the convergence process starts with 400 rounds, which is much faster than both of the results above when t\300 and t.300 respectively. Figure 4 reflects the impact of packet loss rate on the convergence speed. Therein, the x-axis represents the communication rounds, and the y-axis represents the state values of each node. Figure 4(a)-(c) illustrates the variations of node states for p ch = 0, p ch = 0:1 and p ch = 0:3 when t = 310, respectively. It can be observed from Figure 4(a) that our algorithm converges just after 200 rounds, which is much faster than that in Figure  4(b) and (c) in the case of the same delay. This indicates that the convergence speed of consensus algorithm would decline as packet loss rate increases. Therefore, packet loss rate plays a more significant role in convergence speed in comparison with time delay.

Conclusion
It can be seen that both time delay and packet loss rate on each link are allowed to be non-identical even timevarying under our control law by employing different quantized caching policies for different nodes. FL parameter aggregation can also be achieved in an asynchronous manner by caching some messages on a node for a period of time prior to being updated. Besides, the exchange of messages between neighbor nodes proceeds by broadcasting under our control law. However, that will probably lead to a bias of convergence result from its accurate value, which is our future work. The consensus conditions deduced analytically revealed that neither time delay nor packet loss rate affect the convergence of the consensus protocol, except its convergence speed. Nevertheless, the union of a sequence of directed network graphs is requested to be able to contain a directed spanning tree. As a result, it is observed that the caching on mobile devices actually plays a critical role in consensus computing. It can be concluded that it is possible to operate in an ad hoc manner for FL participants, although the centralized operation mode cannot be replaced completely.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Natural Science Foundation of China (grant no. 61771354).

ORCID iD
Xin Yan https://orcid.org/0000-0003-3630-8173 We build a Lyapunov energy function as where P is a positive-definite matrix and P T = P. 25 Substituting equation (43) into (42) gives where T = eBK + A À I. Since P T = P, there exists 2½Tx(k) T P X kÀ1 Due to 2x T y ł t x k k 2 + 1 t y k k 2 , we have Then, we obtain where A s = PT + T T P + T T PT. We build another Lyapunov energy function as where x(j) is replaced with its update equation, that is V 2 (k + 1) = V 2 (k) + x T (k)(eBK) T P(eBK)x(k)À X kÀ1 j = k + s x T (j)(eBK) T P(eBK)x(j) Letting V (k) = V 1 (k) + V 2 (k), we get where P 1 = (t + 1)T T PT + t(eBK) T P(eBK) Hence, it is observed that P p = E½P 1 .