HiCoACR: A reconfiguration decision-making model for reconfigurable security protocol based on hierarchically collaborative ant colony

Reconfigurable security protocols, with dynamic protocol configuration and flexible resource allocation, have become a state-of-the-art technology to guarantee the security of space-ground integrated network. However, reconfiguration decision-making for reconfigurable security protocols remains a major challenge in order to adapt to diverse secure service requirements and deploy higher security level but more complicated security strategies in nodes with limited resources and computing abilities. To handle this problem commendably, a hierarchically collaborative ant colony–based reconfiguration decision-making model called HiCoACR is proposed. This model, inspired by the ideas of hierarchical reinforcement learning and population collaboration, decomposes the reconfiguration decision-making problem into two sub-problems by introducing a two-level hierarchy ant colony consisting of the Explorer and the Worker. The Explorer controls directions of protocol reconfiguration and generates abstract scheduling sub-goals which are conveyed from the Worker. While the Worker schedules most suitable cryptogram resources for each sub-goal received and produces the optimal reconfiguration solution which is verified and re-optimized by a Lévy process–based stochastic gradient descent algorithm. Both the Explorer and the Worker adopt a modified version of ant colony algorithm to fulfill its targets, where a hierarchical pheromone is defined to reinforce positive behaviors of each ant colony. Experiment results suggest that HiCoACR outperforms baseline algorithms and possesses well model transferability.


Introduction
Security protocol is a primary means for guaranteeing the security of network services, such as network communication, data transmission, and authentication. However, no one security protocol could sufficiently cater to all the diverse security requirements, especially in space-ground integrated network (SGIN), where severely limited resource further decreases the possibility of fulfilling diverse security requirements and deploying higher security level but more complicated security strategies. 1 Reconfigurable secure protocol (RSP), with adaptively dynamic protocol reconfiguration and flexible cryptogram resource configuration, 1,2 offers a great possibility of alleviating resources constrains, improving resources utilization and enhancing network security in SGIN. Security protocol reconfiguration (SPR) generates protocols by determining proper protocol flow and its components, picking cryptogram resources that satisfy the functional requirements and performance indexes for each component, and then assembling these selected resources in terms of the protocol flow. Since the performance of all components determines the efficiency of the generated target protocol, one of the key issues in SPR is to determine the corresponding protocol reconfiguration flow and its components and schedule optimal cryptogram resources so as to form the target security protocol, which is defined as reconfiguration decision-making problem (RDMP) in this article.
Generally, the scales of cryptogram resources and the diversities in design standards, cryptosystem, and application situations of cryptogram resources enlarge the solution space of RDMP. What is worse, the uncertainty of protocol flow caused by reconfiguration granularity, further intensify this situation. In addition, RDMP is a combinatorial optimization problem, a nondeterministic (NP)-hard problem, where the optimal solution relies on both reconfiguration protocol flow and cryptogram resources. However, previous works mainly focus on resources scheduling and ignore the role of reconfiguration flow in RDMP. Thus, designing an accurate and efficient reconfiguration decisionmaking mechanism is of great significance to RSP.
To address this problem, a hierarchically collaborative ant colony-based reconfiguration decision-making model called HiCoACR is proposed. The HiCoACR model takes inspiration from hierarchical reinforcement learning 3,4 and collaborative behaviors of biological populations, 5 where the RDMP is decomposed into two subproblems including sub-goal generation for resource scheduling and resource scheduling for sub-goals. And a hierarchical coordinated ant colony consists of the Explorer and the Worker is introduced to address these two subproblems, respectively. In details, the Explorer controls directions of protocol reconfiguration and explores abstract resources scheduling sub-goals at lower temporal resolution in a latent state space, while the Worker schedules proper cryptogram resources for the sub-goals deriving from the Explorer and generates optimal solution for the target protocol. Both the Explorer and the Worker operate with an improved ant colony algorithm and a hierarchical pheromone is defined to reinforce the positive behaviors of the Explorer and the Worker. In addition, a Le´vy theory 6 -based stochastic gradient algorithm is adopted to verify and re-optimize the solution produced by the Explorer and the Worker so as to generate the best reconfiguration solution.
The key contributions of this article include (1) a reconfiguration decision-making model HiCoACR for reconfigurable security protocol that decouples RDMP into scheduling sub-goal generation and resources scheduling for sub-goals; (2) a hierarchical pheromone that indicates the reconfiguration policies of HiCoACR and its updating mechanism; (3) a Le´vy flight-based stochastic gradient algorithm to verify and re-optimize the optimal solution.

Related works
Naturally speaking, the core of RDMP lies in optimization scheduling, which can be classified into static scheduling and dynamic scheduling. Conventional methods for static scheduling include genetic algorithm, 7-9 particle swarm, 10,11 heuristic search, 12,13 and so on. Dynamic scheduling algorithms include load balancing, [14][15][16] fuzzy stochastic optimization, 10,17 deep reinforcement learning, 18,19 and so on. A heuristic search algorithm is adopted to find the optimal reconfiguration scheme for the target security protocol in given solution space. 20 A multi-objective evolutionary algorithm (MOEA)-based reconfigurable optimization algorithm is introduced to address energy-saving issues in reconfigurable sensor network, where a specific MOEA framework is applied to make a reasonable trade-off choice from the set of Pareto-optimal solutions according to their preferences and system requirements. 21 A quality of service (QoS)-aware service composition algorithm is proposed to handle dynamic web service composition problem by taking the advantages of both global optimization and local optimization. In this article, mixed integer programming is adopted to find the optimal decomposition of global QoS constraints into local constraints and distributed local selection is introduced to find the best web services that satisfy these local constraints. 22 A cuckoo's breeding behavior-based heuristic method and a natural gene evolution-based heuristic method are proposed to find the best solution or near best solution of service composition, where the former applies a 1-OPT (algorithm that could reaches the solution by changing a single element in the current solution) heuristic to expand the search space in a controlled way and the later uses two memory structures to avoid the stagnation in a local optimum solution, and to ensure that exploitation and exploration are properly performed. 23 A heuristic algorithm T-HEU (a trust-based heuristic algorithm) combining trust-based selection, convex hulls, and global optimization is proposed to cope with the service composition problem. In T-HEU, the trustbased selection method is used to filter untrustworthy component services, the convex hulls to reduce the search space in the process of service composition, and the heuristic global optimization approach to obtain the near-optimal solution. 24 A systematic approach based on a fuzzy linguistic preference model and an evolutionary algorithm is put forward to solve the service level agreement (SLA)-constrained service composition problems. Specifically, the weighted Tchebycheff distance is introduced first to model this problem, and a fuzzy preference model for preference representation and weight assignment is presented, two evolutionary algorithms proposed for service composition. 25 Meanwhile, RDMP may also belong to the sequential decision tasks, 26 where multi-step problems have been solved and the consequences of a step may unfold gradually over many subsequent steps and choices. Multi-objective problems have been widely examined in many areas of decision-making. 27 In general, modelfree reinforcement learning and model-based reinforcement learning are the main solutions. A FeUdal network, a novel architecture for hierarchical reinforcement learning, is proposed to solve sequential decision task such as playing computer games (such as Montezuma's revenge), which provides great inspiration for method proposed in this article. 3 All aforementioned works serve as inspiration for the computational methods developed in this article.

Formalization of RDMP
Definition 1. Reconfiguration model for SPR is defined as a quadruples RSP = (RG, RR, RE, RX ), where RG denotes reconfiguration goals, depicting functional and performance requirements of the target security protocol. RR refers to as all available cryptogram resources including cryptocards, cryptographic devices developed on field-programmable gate arrays (FPGAs), or reconfigurable processors. RE represents reconfiguration efficiency criteria for protocol performance evaluation. RS indicates reconfiguration solutions of RDMP for the given protocol. One reconfiguration solution often consists of protocol flow and corresponding cryptogram resources. Definition 2. RG denotes functional requirements and minimum performance indexes of the target protocol.
As complex protocols are formally derived from basic security components, 28 RG could be decomposed into a sub-goal set RG = frg 1 , rg 2 , . . . , rg n g, where each sub-goal can be matched to a security component-a cryptogram resource. 8rg i 2 RG, 1 ł i ł n, there exits req(rg i ) = (fc i , pm i , pf i ), where fc i , pm i , and pf i indicate functionality, interface, and performance required by sub-goal rg i , respectively. Intuitively speaking, there are multiple decomposition schemes for a given RG due to differences in reconfiguration granularity. Thus, there may exist other sub-goal sets RG 0 = frg 0 1 , rg 0 2 , . . . , rg 0 m g that satisfies RG = frg 1 , rg 2 , . . . , rg n g = frg 0 1 , rg 0 2 , . . . , rg 0 m g = RG 0 .
Definition 3. Cryptography reconfigurable cell (CRC) are independent cryptogram components which are reconfigurable in functional, structural, and temporal dimensions.
For each CRC, its attribute set is defined as a quintet CRC = (id, tp, fc, pf , pm), where id uniquely identifies a CRC, tp points out the type of CRC, such as atom CRC and compound CRC, fc depicts the functionality of CRC, pf describes the performance of CRC, including execution time, energy consumption, safety level, and so on, and pm denotes its input and output interfaces. All CRCs together constitute the total cryptogram resources RR = fcrc i i = 1, 2, . . . , N j g.
Definition 4. RE refers to overall performance of cryptogram resource or a security protocol. Efficiency of security protocol SP is recorded as RE(SP), and efficiency of a cryptogram resource crc i is recorded as RE(crc i ). Efficiency of a protocol is usually evaluated using certain efficiency model, which is described in section ''RE evaluation.'' Definition 5. Reconfiguration solution (RS) depicts the result of reconfiguration decision-making for SPR, which consist of a protocol flow and a set of cryptogram resources constituting the flow.
Based on Petri Net model, 29 RS could be formulated as RS = (State, Res; Flow). Where Flow : State $ Res denotes the flow of a target protocol, transition set Res = fc 1 , c 2 , . . . , c n g are components of the target protocol satisfying Res & RR, and all resources in Res together form the target protocol according to the Flow. Place State refers to the system state set. As there exist more than one RS satisfying RG, solution with highest RE is regarded as the best RS and recorded as RS best . Problem 1. RDMP for SPR is defined as follows: given reconfiguration resource RR = fcrc i i = 1, 2, . . . , N j g, a target security protocol rsp, and corresponding reconfiguration goal RG(rsp), RDMP is to find the optimal solution RS best = (Ŝ,T ;F) that satisfies the following conditions: where m is the size ofT and S t2T t À! F rsp indicates that all cryptogram resources inT together form the target protocol rsp according to the protocol flowF.
2. There exist a sub-goal set frg 1 , rg 2 , . . . , rg m g = RG and a one-to-one mapping f : T ! RG, where 8rg j 2 RG, there is one and only one t i 2T (i = 1, 2, . . . , n) satisfying equations , where x 1 y means that ''x is better than y,''fc(x), pm(x), pf (x) denotes functionality, interfaces, and performance of x, respectively. 3. pf (RS best ) 1 pf (rsp), which means that the overall performance of RS best is higher than the overall performance requirements of the target protocol rsp. 4. 8RS = (S, T ; F), RE(RS best ) 1 RE(RS).
According to the above definitions, RDMP is to find the best solution with highest RE under the condition of meeting basic functional requirements and performance index. Figure 1 briefly illustrates the RDMP, where dotted lines with different colors represent different reconfiguration solutions and the green dotted line is the best solution.

The HiCoACR model
HiCoACR is a two-level hierarchy architecture consisting of two modules-the Explorer and the Worker, each of which implements certain sub-tasks of reconfiguration decision-making. The Explorer points out the reconfiguration directions of protocol flow and generates scheduling sub-goal, while the Worker implements the cryptogram resources scheduling sub-tasks, where both together generate an optimal solution of RDMP coordinately.
In details, at time t, the Explorer apperceives a latent system state s t + 1 and generates a cryptogram resources scheduling sub-goal rg t + 1 (0 ł t ł K À 1), where K is the scale of sub-goals. The sub-goal rg t + 1 points out the reconfiguration directions and all frg t + 1 (0 ł t ł K À 1)g together cover the overall protocol RG, which also indicate the components of the selected protocol flow. The Worker follows the directions guided by sub-goal rg t + 1 and schedules optimal resource c 0 t + 1 for rg t + 1 according to s t + 1 . Then all components fc 0 t + 1 (0 ł t ł K À 1)g are combined to produce the candidate optimal solution. And a stochastic gradient algorithm based on Le´vy flight is conducted to verify and optimize the candidate solution so as to get the final optimal solution.
Both the sub-goal generation process in the Explorer and the resource scheduling process for sub-goal in the Worker are trained by a modified version of ant colony algorithm. Meanwhile, a hierarchical pheromone phe is introduced to reinforce the positive behaviors of the Explorer and the Worker during training and to produce reconfiguration policy during reconfiguration decision-making. In essence, the greater the RE of a RS is, the higher the concentration of phe is and the policies that the related protocol flow and resources are to be selected with higher probabilities will form.
The overall design of HiCoACR model is illustrated in Figure 2 and equations (1)-(6) depict the forward dynamics of HiCoACR model where h E resources scheduling.''Equation (3)

Sub-goal generation
The intention of the Explorer is to produce a sequence of sub-goals guiding the directions of reconfiguration. Each sub-goal represents a component of the protocol flow. As mentioned above, there may be multiple flows corresponding to the target protocol due to difference in reconfiguration granularity. Taking protocol flows in Figure 3, for instance, there are seven distinct flows between state s iÀ1 and state s i + 3 . At state s i , there are three alternative sub-goals to construct the protocol. The Explorer selects one and generates the sub-goal rg i . When the sub-goal rg i is accomplished by the Worker, state transition from s i to s i + 1 is triggered.
Given certain security protocol, its RG could be expressed as a goal vector RG 2 R k . And for each cryptogram resource c 2 RR, it will satisfy the requirements of one sub-goal. Thus, the Explorer first produces a goal vector for each cryptogram resource. All sub-goal vectors of the whole cryptogram resources could form a matrix.
To generate proper sub-goal, the Explorer adopts an ant colony algorithm with Q-learning, which combines the advantage of optimization theory, non-linear control, and reinforce learning. The Explorer takes current system state s i and RG as input and outputs the subgoal. In this algorithm, when producing sub-goals, an adaptive pseudorandom ratio selection rule is introduced at probability u 0 , and the sub-goal generation policy indicated by flow pheromone is adopted at probability 1 À u 0 . What is more, sub-goal generation policies are trained by updating flow pheromone, which is detailed in section ''Hierarchical pheromone updating.'' At each system state, the Explorer generates a sub-goal according to policies given in equations (7) and (8) where t ij (t) is the amount of pheromone deposited for the sub-goal that triggers state transition ij, and a1 ø 0 is the heuristic factor reflecting the relative importance of pheromones t ij (t) in guiding sub-goal generation process. And h ik (t) = 1=RE(res ik ) is heuristic function indicating the expectation that the sub-goal triggering state transition ij is generated, and b1 ø 1 is the expectational heuristic factor reflecting the relative importance of heuristic factor a1 in guiding sub-goal generation process. And allowed i denotes possible states set that state i may transfer to. Sub-goal generation is implemented repeatedly until the target protocol ends.

Cryptogram resources scheduling
Once receiving a sub-goal conveyed from the Explorer, the Worker first computes best cryptogram resource matching the sub-goal. And then, cryptogram resources for all sub-goals together constitute the candidate optimal solution for RDMP. Finally, candidate optimal solution is verified and optimized to obtain the final optimal solution.
Cryptogram resource matching for sub-goal. Scenario of cryptogram resource matching for a sub-goal is depicted in Figure 4, where given 8rg i (i = 1, 2, . . . , t), there may exist a cryptogram resources set RR i 2 RR that could satisfy the functional requirements and performance indexes of the sub-goal rg i . Cryptogram resource matching aims at finding the best-suitable cryptogram resource for each received sub-goal. Similarly, to address this issue, an ant colony algorithm that includes just once cryptogram resource matching is introduced. Each matching process aims at finding the preponderant cryptogram resource for current sub-goal. Rules for cryptogram resource matching are depicted in equations (9) and (10) where d 2 ½0, 1 and a random walk rule depicted in equation (9) is adopted when u ł d; otherwise, a determinate matching policy shown in equation (10) is is the amount of pheromone for cryptogram resource c i and h c i (t) is heuristic function indicating the expectation that c i matches rg i . MT (rg i ) is resources set that matching sub-goal rg i . c i is set to be the ith component of the solution after each matching process. And a2 ø 0 is the heuristic factor reflecting the relative importance of accumulated pheromones t c i (t) in guiding resources matching process, b2 ø 1 is the expectational heuristic factor reflecting the relative importance of heuristic factor a2.
where K denotes the scale of all epochs and there exists RE(RS ca 0 ) 1 RE(RS i ).
Solution verification and optimization. In case that the candidate solution is a relative maximum one rather than the absolute one, an optimal solution verification and optimization process is indispensable to guarantee that the real optimal solution is obtained. To address this issue, a Le´vy flight 6 -based stochastic gradient algorithm is employed. The main idea is to introduce the advantages of frequent short-distance steps and accidental long-distance steps in Le´vy flight into stochastic gradient policies. Assume RS ca 0 to be the initial candidate solution and X ca i to be the ith updating of RS ca 0 , the solution updating rule is formulized as follows

RS ca
where s is step factor controlling the range of optimization and s = 1 can be used usually in most cases, RS lbest denotes the historically best candidate solution, and L evy(j) provides the random step length from a Le´vy distribution For ease of calculation, equation (14) is adopted to calculate the Le´vy random number where m, n follow the normal distribution and j = 1:5 To speed the convergence process of optimal solution optimization, some solutions are discarded and substituted by new solutions with certain probability. These new solutions are generated based on random walk policy as follows where RS ca k is a discarded solution and g 0 is the step factor.
After the aforementioned solution optimization, assume RS ca n to be the optimized solution, and then the final optimal solution RS best can be obtained with equation (17) Hierarchical pheromone updating In HiCoACR, besides the proposed hierarchical coordinated ant colony algorithm, a hierarchical pheromone is defined for reinforcing the positive feedback of the Explorer and the Worker and for sharing wisdom among populations as well. In addition, hierarchical pheromone points out the reconfiguration policies of RDMP. The hierarchical pheromone is defined as follows where the upper pheromone phe f = fphe f ij i, j 2 N j g denotes pheromone on sub-goals and phe f ij is the amount of pheromone on the sub-goal triggering state transition ij, which is recorded as t ij t ð Þ in equations (7) and (8). However, the lower pheromone phe r = fphe r i i 2 N, 0\i\Num j g refers to as pheromone for cryptogram resources and phe r i is the amount of pheromone on the ith cryptogram resource, which is recorded as t c i (t) in equations (9) and (10).
Once sub-goal generating or cryptogram resource matching is finished, a pheromone updating process is triggered. Considering the difference of ant colony algorithms used in the Explorer and the Worker, different pheromone updating rules are adopted.
Pheromone updating for sub-goals. Pheromone updating for sub-goals occurs mainly in the sub-goal generation phase. As it is mentioned in section ''Sub-goal generation,'' sub-goal generation adopts the ant colony algorithm with Q-learning. Once the sub-goal is generated, the pheromone of this sub-goal is updated according to the following rule where r f 2 (0, 1) denotes the volatilization factor of protocol flow pheromone, whose value directly affects the direction controlling ability of the Explorer and convergence speed of the ant colony. Dphe f ij depicts increment of sub-goal pheromone triggering state transition ij. And, g 1 is the discount factor for computing the accumulated reward. Finally, after the model is trained, the sub-goals pheromone may form a hypergraph and could be transferred into sub-goal generation policies using equation (8).
Updating of pheromone for cryptogram resources. Pheromone updating for cryptogram resources is triggered in two occasions, namely the cryptogram resource matching phase and the solution verification and optimization phase. During the cryptogram resource matching phase, when the best-suitable cryptogram resource for each received sub-goal is found, pheromone for this best-suitable cryptogram resource will be updated. Meanwhile, when the optimal solution is generated in solution verification and optimization phase, pheromones for all cryptogram resources constituting the optimal solution will be updated. The updating rule can be formulized as follows where r r 2 (0, 1) refers to as the volatilization factor of resources pheromone, which directly affects the ratio of choosing resources as best-suitable resources that have been selected before; Dphe r c depicts the increment of pheromone on cryptogram resources c. In cryptogram resource matching phase, Dphe r c = W =RE(c). While in solution verification and optimization phase, Dphe r c = W =RE X lbest ð Þ, where X lbest is local optimal solution in an epoch and W is a constant variable. Once the hierarchical pheromone is trained, the hierarchical pheromone could be transferred into resources scheduling policies. where time, sec, cpu, memory, and energy refer to as Etime, security level, CPU consumption, memory consumption, and energy consumption, respectively. Analysis shows that these indexes can be classified into two categories, namely efficiency gain indexes and efficiency loss indexes. For instance, security level belongs to the efficiency gain indexes, where the higher the security level, the higher the RE is. While E-time and resources consumption belong to the efficiency loss indexes, where the lower the E-time and resources consumption, the higher the RE is. The goal of reconfiguration decision-making is to decrease efficiency loss and increase efficiency gain so as to find the solution with highest RE. Without loss of generality, a linear model is adopted as efficiency evaluation model to computer the RE of a security protocol or a cryptogram resource

RE evaluation
where RE(x) denotes RE of x, v ! indicates the coefficient vector of efficiency indexes, and re(x) ! refers to as efficiency vector of x. The efficiency evaluation model could be extended or modified to adapt to different practical application scenarios. Incentive function tanh (RE(x)) could also be introduced to highlight efficiency difference of two solutions. However, it is worth noting that considering the features of efficiency indexes, components ofre cannot be calculated by simply doing weighted summation while computing the RE of a security protocol. Instead, each component ofre is calculated with certain rules. In detail, when calculating value of time inre, it is equivalent to the shortest time problem in an activity on edge (AOE) network. When calculating value of cpu, memory, energy inre, the weighted summation is effective and to the point. Meanwhile, while calculating sec, the Short board effect should be the calculation basis.

Reconfiguration decision-making algorithm based on HiCoACR
Once HiCoACR is trained, the reconfiguration decision-making can be done to address the RDMP with HiCoACR. The reconfiguration decision-making algorithm takes the cryptogram resources and trained hierarchical pheromone as input, and output the optimal resources set and corresponding protocol flow for the target protocol.
In details, the algorithm begins at the initial system state s 0 . Then the Explorer generates the sub-goal according to sub-goal generation policies transferred from the sub-goal pheromone and the Worker schedules the best cryptogram resource for this sub-goal with scheduling policies derived from cryptogram resources pheromone. The Explorer and the Worker keep doing these steps until the system reaches end state s end , where all components are scheduled and could constitute the whole protocol. The algorithm procedure is depicted in Algorithm 1.
Where PolicyCalculation calculates policies with the hierarchical pheromone through equations (8) and (10). More details of function SubgoalGeneration and function ResourceMatching are depicted in sections ''Subgoal generation'' and ''Cryptogram resource matching for sub-goal,'' respectively. With this algorithm, an optimal solution for RDMP could be obtained finally.

Experiments setup
To explore the properties of HiCoACR, experiments are carried out to analyze its convergence, performance, and transferability. The subjects in our experiments include a reconfigurable cryptogram resources set and a security protocols set. Table 1 lists a simplified reconfigurable cryptogram resources set, which includes a mass of atomic and compound cryptogram resources in diverse forms such as cryptocards, cryptographic devices developed on FPGAs or reconfigurable processors, and so on. The developing platforms, performances, and other attributes are left out in Table 1. And for any one resource list in Table 1, there may be multiple physical devices with different performances corresponding to it.
Security protocols adopted to test HiCoACR are listed in Table 2, which include authentication protocols, communication protocols, and key agreement protocols. Convergence and performance experiments are mainly tested on authentication protocols, but communication protocols and key agreement protocols are utilized to demonstrate the transferability of HiCoACR. Moreover, details of these security protocols can be derived from related references.
The simulation mainly includes three procedures. The first phase is data preparing, where data such as functionality, interface, and performance of all cryptogram resources and the composition of target protocols have to be provided. In fact, we extract these data of cryptogram resources and target protocols and organize them in JSON format instead of operating physical devices which simplifies the training process. Usually, resource files with different amounts of cryptogram resources are prepared to simulate different scales of cryptogram resources.
The second phase is model adjusting, where the inputs and parameters of our algorithm are adjusted to simulate different scenarios. For instance, the scale of cryptogram resources is set to different values by inputting different resource files while testing their influences on convergence ratio. While examining the influence of pheromone volatility factors on convergence ratio, parameter r f in equation (19) and r r in equation (21) are set to different values between 0 and 1. Parameters in simulations are initialized as follows. Set a1 = 2, b1 = 2, u 0 = 0:2 for equations (7) and (8). Set a2 = 1, b2 = 4, d = 0:3 for equations (9) and (10). Set s = 1, j = 1:5, g 0 = 1:5 for equations (12)- (15). Set r f = 0:4, r r = 0:35, g 1 = 0:5, W = 10 for equations (19)- (21). Let the numbers of ants in the Explorer and the Worker be 50.
The third phase is model training, where the convergence, performance, and model transferability are analyzed. During this phase, the proposed model is executed and a best reconfiguration solution for the target protocol is generated. Meanwhile, the policies for reconfiguration decision-making are stored, which can be transferred to reconfiguration decision-making for other target protocols.

Convergence analysis under different parameters
Convergence analysis mainly focuses on the influences of cryptogram resources requirements of security protocol, total scale of cryptogram resources, the step length of global optimization, the pheromone volatility factor, and the population collaboration on the model convergence. And the stopping condition for training is that the deviation of two adjacent solutions is less than 0.01.
Resources requirements of protocol and total scale of cryptogram resources. The total size of cryptogram resources carried on nodes and the cryptogram resources requirements of security protocol directly affect the performance of the reconfiguration decision model. To verify the influences, authentication protocols listed in Table  2 are chosen as target security protocol to represent different scales of cryptogram resources requirements, and we range the total scale of cryptogram resource from 0 to 500.
The influences of cryptogram resources requirements or scale of cryptogram resources on model convergence are illustrated in Figure 5. The x-axis represents the  30 Jiang's scheme, 31 Priauth, 32 Roy-Chowdhury's scheme, 33 and Chen's scheme 34 Communication Mahmoud's scheme, 35 Khalil's scheme, 36 and RiSeG 37 Key agreement Tseng's scheme 38 and Amin's scheme 39 total scale of cryptogram resource and the y-axis denotes the convergence time. From Figure 5, it can be drawn that when cryptogram resource requirement is fixed, the larger the size of the total cryptogram resources, the slower the convergence rate. The reason is that the larger the size of the total cryptogram resources is, the bigger the searching space of RS is, which results in slower convergence speed. Thus, larger scale of cryptogram resource leads to lower model convergence rate. Meanwhile, when the value of x-axis is set to be fixed, the convergence time of Hsieh's scheme, Priauth, Jiang's scheme, Roy-Chowdhury's scheme, and Chen's scheme increase in turn, where the resources requirement increases in turn as well. That is to say that when the size of the total cryptogram resources is fixed, the greater size the resources requirement of protocol is, the slower the convergence rate is. Similarly, the reason is that the times of loops enlarge with the increase of resources requirement, which leads to slower convergence as well. Even when concurrency strategies are adopted to optimize the algorithm performance, the tendency will remain the same for the reason that the next loop depends on the pheromone updated in last loop.
Step type in solution verification and optimization phase. In solution verification and optimization of HiCoACR, a Le´vy flight-based stochastic gradient decline algorithm is adopted. The convergence rate of this stochastic gradient decline algorithm greatly influences the convergence rate of HiCoACR, where the Le´vy step plays a vital role. To analyze the influence of different step types on model convergence, the fixed step and the random step are taken as benchmarks. And L evy(j) = 1 for the fixed step and L evy(j) = 2 3 rand(1) for the random step, where rand is a random number generator.
From Figure 6, we can find that the Le´vy flight step outperforms the random step and fixed step in convergence. After deep analysis, we find that the Le´vy flightbased stochastic gradient decline algorithm adopts the Le´vy flight step which combines long step and short step, and steps will be adjusted according to the distance to the optimal solution, resulting in faster convergence speed. In addition, compared to fixed step, random step possesses faster convergence because randomness of step length increases the probability to find the best reconfiguration solution.
Volatility factor of pheromone. In HiCoACR, a hierarchical pheromone is defined to reinforce positive behaviors, which is adopted to imitate memory ability of ant colonies and is updated in each loop. Meanwhile, the pheromone volatility factor is used to simulate memory deterioration. To examine the effects of volatility factor, the total size of cryptogram resources is set to be 100 and choose Hsieh's scheme as target security protocol. As both the flow pheromone and cryptogram resource pheromone have its own volatility factor, the influence of these two volatility factors is tested separately. Value of volatility factor for resources pheromone is set to be 0.08, 0. From Figure 7, we can find intuitively that larger volatility factor possesses less convergence time, which means the larger volatility factor, the faster the convergence speed. The reason is that the volatility factor denotes forgetting probability of history interactive Figure 5. Influence of resources requirements and resource size on convergence. Figure 6. Influence of step type on convergence.
information among populations. When a pheromone gets to 0, the sub-goal generation action or matching action is forgotten, which influences the ability of solution searching. When the volatility factor gets larger, the pheromones of unvisited cryptogram resources or protocol flow which are relatively less than that of frequently visited ones will drop to 0 much faster and the probability that the visited ones are selected again will get larger, leading to the speed-up of convergence process. However, larger volatility factor leads to less diversity of reconfiguration policies due to the rapid decline of pheromone, which decreases the randomness of sub-goal generation and cryptogram resources matching.
In addition, when the volatility factor of protocol flow and cryptogram resources are set equal, the convergence time of the former is shorter than that of the latter. Because volatility factor of protocol flow not only affects the directions of reconfiguration, but also influences the resources pheromone. When the volatility factor of protocol pheromone gets larger, the probabilities that unvisited protocol flow will be selected is lower, which will lead to the pheromones of corresponding cryptogram resources in unvisited protocol flows lower than that of cryptogram resources in frequently visited protocol flows. This further increases the convergence speed. Thus, the influence of volatility factor of protocol flow pheromone is greater than that of resources pheromone. And either of these two volatility factors is set unsuitable, the collaboration relationship of the Explorer and the Worker will be broken.
Population collaboration. Population collaboration is one of the most vital mechanism to decrease the searching space of reconfiguration solutions in HiCoACR. To analyze its influence on model convergence, change the ant number in the Explorer and the Worker to adjust the degree of collaboration of the Explorer and the Worker. First, let the numbers of ants in the Worker to be 50, set numbers of ants in the Explorer to be 0, 10, 20, . . . , 100. Then let the numbers of ants in the Explorer to be 50, set numbers of ants in the Worker to be 0, 10, 20, . . . , 100. Meanwhile, set the total size of cryptogram resources to be 100 and choose Hsieh's scheme as target security protocol.
From Figure 8, when the number of ants in the Explorer or in the Worker drops to 0, the convergence time gets longer. Meanwhile, when the number of ants reaches a certain value, the convergence time is tending toward stability. The reason is that when the ant number in the Explorer or the Worker equals 0, there exists only one population and interpopulation collaboration disappears, which results in that HiCoACR degenerates to Cuckoo search algorithm when number of ants in the Explorer is 0 or degenerates to ant-Q algorithm when number of ants in the Worker equals 0. And at that moment, the convergence time approximately reaches the ceiling. In addition, according to the experiments results, number of ants in the Explorer and the Worker is more suitable to be set to about 50 in terms of the given protocol samples.

Performance analysis
Performance analysis of HiCoACR model aims to compare model performance with certain benchmark algorithms. Set size of whole cryptogram resources to be 100, take ant colony algorithm, 40 Cuckoo search algorithm, 41 and reinforcement learning 42 as benchmark algorithm and set authentication protocol in Table 2 as target security protocol.  From Table 3, it could be seen that HiCoACR model possesses less time cost and much higher accuracy, which indicates that HiCoACR outperforms given benchmark algorithms in accuracy and efficiency. According to above designation, HiCoACR model divides the reconfiguration solution searching process into two sub-process including controlling directions of protocol flow and cryptogram resources matching. Sub-goal generation phase implemented by the Explorer screens the reconfiguration solutions with higher RE and controls the directions of sub-process according to protocol flows using ant colony algorithm with Q-learning, which greatly narrows the solution space. Meanwhile, cryptogram resources scheduling phase executed by the Worker is consist of sub-goal matching, candidate solution producing, solution verification and optimization and responsible for reconfiguration solution producing and optimization. Compared to above benchmark algorithms, HiCoACR adopts the advantages of population collaboration and reinforcement learning. When the number of ants in the Explorer is 0, HiCoACR degrades into Cuckoo search algorithm. When the number of ants in the Worker is 0, HiCoACR degenerates into ant colony algorithm with Q-learning. Thus, HiCoACR outperforms the given benchmarks in performance.

Model transferability analysis
Model transferability analysis aims at verifying the transferability of trained model on different categories of security protocols. Model transferability mainly reflects in the acceleration of convergence under the condition of ensuring accuracy in reconfiguration decision-making. Thus, average speed-up ratio is adopted to assess the model transferability. And speed-up ratio is defined as the ratio of convergence time of baseline and convergence time of given model, namely where sr is speed-up ratio of a certain protocol under mode m, t b denotes convergence time of baseline, and t m refers to as convergence time of given mode m.
Meanwhile, average speed-up ratio is weighted mean of all speed-up ratio for each protocol in the test sample. Three modes are taken into consideration in this experiment including without pre-training, pre-trained with same category, and pre-trained with different categories. For mode of without pre-training, no reconfiguration policy is trained ahead of time and the pheromone of both protocol flow and cryptogram resources is initialized with 0. For mode of pre-trained with same category, reconfiguration policy has been generated according to security protocols the same category with the test samples. For mode of pre-trained with different categories, reconfiguration policy has already been generated with security protocols whose category is different with the test samples.
Set size of total cryptogram resources to be 100, take the first three authentication protocols in Table 2 as training samples and adopt rest two authentication protocols, key agreement protocols and communication protocols as test samples. Meanwhile, ant colony algorithm, Cuckoo search algorithm, and reinforcement learning are selected as benchmark algorithms as well.
Set the time cost of the mode without pre-training as baseline, we can see from Table 4 that average speed-up ratio of pre-trained with protocols in same category is larger than that of pre-trained with protocols in different categories. The reason is that the mode pre-trained with same category provides more useful information for reconfiguration than the mode pre-trained with different categories. Meanwhile, we can find that average speedup ratio of HiCoACR is higher than that of given benchmarks regardless of application modes. The reason is that both protocol pheromone and resources pheromone can provide reconfiguration policy during decision-making, which results in less time cost in HiCoACR than other benchmark algorithms. Thus, HiCoACR outperforms the given benchmark in model transferability.
In conclusion, the hierarchically collaborative ant colony-based reconfiguration proposed in this article enjoys higher convergence speed, better performance, and model transferability. However, as the ant colony algorithm has the shortcomings of long convergence time and is easy to get into the local optimal solution, our model may exist the same problem in some degree, although an adaptive pseudorandom ratio selection rule and a parallel computing mechanism are adopted in the simulation to alleviate the two shortcomings.

Conclusion
Reconfiguration decision-making is one of the key processes to reconfigurable security protocol design, which determines the performance of security protocol. This article combines the idea of hierarchical reinforcement learning and collaborative behavior of biological populations and proposes a reconfiguration decision-making model named HiCoACR based on hierarchical coordinated ant colony algorithm. It takes the advantages of reinforcement learning and population collaboration and handles the RDMP commendably. The method of this article may also be applied to decision-making problem such as military task programming, route planning. In addition, in order to eliminate the problems of long convergence time and falling into local optimal solution, we will keep on improving our model in the future study.