Identifying Missing and Spurious Interactions in Directed Networks

Recent years, the studies of link prediction have been overwhelmingly emphasizing on undirected networks. Compared with it, how to identify missing and spurious interactions in directed networks has received less attention and still is not well understood. In this paper, we make use of classical link prediction indices for undirected networks, adapt them to directed version which could predict both the existence and direction of an arc between two nodes, and investigate their prediction ability on six real-world directed networks. Experimental results demonstrate that those modified indices perform quite well in directed networks. Compared with bifan predictor, some of them can provide more accurate predictions.


Introduction
Network is an effective and efficient tool to describe realworld complex systems [1,2], such as social, biological, traffic, and information systems, where nodes represent individuals, proteins, airports, web pages, and so forth and links denote the relations and interactions between them.The study of complex networks is an important growing field that attracts lots of attention from different branches of science.While making great efforts to understand the structural features and evolutionary mechanism of networks, scientists gradually realize that the inaccuracy and incompleteness of data sets is a significant obstacle to the research [3,4].To address this issue, link prediction algorithms have been adopted to extract the missing information, identify spurious interactions, and reconstruct network.
The problem of link prediction aims at estimating the likelihood of the existence of a link between two nodes in a given network based on the observed links [5], and it has a wide range of applications in the real-world.For example, in online social networks, very likely but not-yet-existent links can be recommended as promising friendship, which can help users in finding new friends; in biological networks, compared to blindly checking all possible protein-protein interactions, accurate prediction of the most likely existent ones can dramatically reduce the experimental cost; in ecommerce, with the help of recommendation systems, sellers enhance their sales by watching customers' purchases and recommending other goods to them in which they may be interested [6]; in security domain, link prediction methods could be used to assist identifying groups of terrorist or criminals [7].
Since the link prediction problem is relevant for various domains, lots of algorithms have been proposed to solve it.The most widely used link prediction indices are the local similarity measures [8][9][10][11][12], for example, Common Neighbors, Jaccard index, Adamic-Adar index, and Resource Allocation index, which only require consideration of the local structure of the networks.They are favored for low complexity, low time consumption, and relatively high accuracy especially when the network is highly clustered [13].
Algorithm based on statistical inference is another branch in the study of link prediction, in which some organizing principles of the network structure, like hierarchical organizations or community structures, are often presupposed [7,14,15].Using Bayes theorem, people can infer the underlying structure from observed network and take advantage of the knowledge to reliably identify both missing and spurious interactions.Usually, the performance of this type of method is more accurate and robust, but it has an obvious drawbackthe high computational complexity makes it infeasible for large-scale networks.
All the algorithms mentioned above are designed for undirected networks.However, in the real-world asymmetric interactions are widespread such as worldwide web, food web, neural network, email network, and citation network.Unfortunately, there has been little research focusing on how to identify missing and spurious interactions in directed networks.Until very recently, a new mechanism for the local organization of directed networks, potential theory, has been proposed [16].Combining the potential theory with the clustering and homophily mechanisms, bifan structure which consists of 4 nodes and 4 directed links is deduced to be the most favored local motif structure and could be used directly as a well-performed missing link predictor in directed networks.
In this paper, we focus on the link prediction problem in directed networks.Our contribution is twofold: (1) Instead of proposing a brand-new directed link predictor, we make the best of classical prediction indices of undirected networks.We extend the most representative measures to directed version, to meet the requirement of predicting both the existence and direction of an arc between two nodes.(2) To investigate the performance of these modified indices, we design simulation experiments on five real biological and technological directed networks, the results of which vividly demonstrate their prediction ability.
The remainder of this paper is organized as follows.In Section 2, we describe the link prediction problem in directed networks and the standard metric for performance evaluation.Then, we present how the adapted indices work in asymmetric networks and access their prediction ability in Sections 3 and 4, respectively.Finally, the conclusion is drawn in Section 5.

Problem Description and Evaluation Metric
Given a directed network (, ), where  and  are sets of nodes and directed links, respectively, multiple links and selfconnections are not allowed.The fundamental task of a link prediction algorithm is to give a rank of all nonobserved links in the set  \ , where  is the universal set containing all ||(|| − 1) possible directed links.The mainstream method is to assign each nonobserved link a score and the one with higher score ranks ahead.The top ranked links are regarded as the most likely missing (or future) interactions.
To evaluate the algorithmic performance, the observed links  is divided into two parts: the training set   is treated as known information, while the probe set   is used for testing and no information therein is allowed to be used for prediction.Clearly,  =   ∪   and   ∩   = 0.
We adopt metric AUC [17] (area under the receiver operating characteristic curve) to quantify the prediction accuracy.It evaluates the predictor performance according to the whole list and can be interpreted as the probability that a randomly chosen missing link (i.e., a link in   ) is given a higher score than a randomly chosen nonexistent link (i.e., a link in \).Among  independent comparisons, if there are   times the missing link having a higher score and   times they are the same; the AUC value is Similarly, the link prediction algorithm can also be used to identify spurious interactions.In this case, the task is changed to score all the observed links in  and rank them in an ascending order; then the top ranked links are suspected as the most likely spurious interactions.To test the algorithmic performance,    =  ∪    is treated as the "hypothetical observed" links, where    is made up of some randomly selected nonexistent links from \.Now the AUC represents the probability that a randomly chosen spurious link (i.e., a link in ) is given a lower score than a randomly chosen real link (i.e., a link in ).
In a word, for both predicting missing links and identifying spurious links cases, the higher the AUC value is, the better the prediction algorithm works.

Directed Link Predictors
In this section, we first review some classical prediction indices for undirected network and present how to extend them to directed version.
(1) Preferential Attachment (PA).This is a basic link prediction algorithm corresponding to the preferential attachment phenomena in many real-world networks [18,19].Preferential attachment means that the more connected a node is, the more likely it is to receive new links.Nodes with higher degree have stronger ability to grab links added to the network.So PA index defines that the probability that a link will connect  and  is proportional to the number of their neighbors; that is, where  are not equal, so we can take advantage of this asymmetry to extend PA index.For an arc  →  in directed networks, the PA score could be defined as Actually, the meaning of ( 3) is quite straightforward: from the perspective of information dissemination, the more outgoing neighbors of  imply the more channels to get the message out; the more incoming neighbors  attracts, the stronger its ability to gather information, so it is more likely to build a connection from  to .
(2) Common Neighbors (CN).The common neighbors refer to the nodes that are connected with both  and .This concept is very important that nearly all the local similarity measures are based on it.Triadic closure principle in social network analysis states that if two people A and B have a common friend C, they are very likely to be friends too; thus these three nodes form a closed triangle ABC.For undirected networks, it is natural to assume that the more common neighbors two nodes share, the more similar they are and the more likely to build a connection between themselves.So the CN index is defined as It is not hard to find that  CN  = (A 2 )  , where A 2 = A ⋅ A. A is the adjacency matrix, the element of which is defined as and obviously A  = A  .However, in directed networks, adjacency matrix Ã becomes asymmetric; that is, Now the meaning of ( Ã2 )  changes, which equals the number of different two-step paths from  to ; in other words, ( Ã2 )  measures how many "transit nodes" in the network that could forward the message got from  to .The more numbers of such "transit nodes, " the more likely to build a directed arc  → .Thus the extension of CN index in directed networks could be naturally defined as As shown in Figure 1, from the analysis of geometry, CN indices could be interpreted as the number of triadic closure in undirected networks, while in directed networks this index measures the formation of feedforward structure.
In fact, (7) is not the only way to extend CN index.Other alternative indicators include the following.
From Figure 2 we can see that co-citation index and bibliographic coupling index cannot present different judgement for  →  and  → .The existence possibility for these two arcs will always be the same, so co-citation index and bibliographic coupling index are not suitable for link prediction in directed networks.And we choose feedforward structure instead of feedback structure as the indicator, because the empirical research demonstrates that the former one is more popular in real-world networks.Milo et al. [20] studied the network motifs in networks from biochemistry, neurobiology, ecology, and engineering.They found that feedforward structure appears in all the observed network.
(3) Stochastic Block Model (BM) [14].Stochastic block model is one of the most general network models, where nodes are partitioned into groups and the probability that two nodes are connected depends only on the groups to which they belong.
In undirected networks, given a block model  = (, Q), in which  is the partition of nodes into groups and the matrix Q = (  ) describes the probabilities of the linkage between groups, then the likelihood of the observed network is where A is the adjacency matrix,   is the number of links between nodes in groups  and , and   is the maximum number of such links.Using Bayes theorem, the reliability of an individual link is where P is the space of all partitions,  is the number of distinct group pairs, and  is a normalizing constant In directed networks, the interaction between two nodes are no longer reciprocal, which implies that the underlying International Journal of Distributed Sensor Networks matrix Q becomes asymmetric; that is,   determines the probability of nodes in group  linking to the ones in group , while   limits the chance of links from group  to .   and   are not necessarily the same.In addition, the links of different direction between two groups should be counted separately.Thus, the likelihood of the directed network structure is Notice that  *  now becomes the number of directed arcs from groups  to , and  *  is the corresponding maximum number of such links.Consider Among all the link prediction measures for undirected networks, the above three indices are chosen to be adapted for the following reasons.Firstly, the design philosophies of them are different from each other, and they all have their own advantages.CN index follows the nature intuition and has wide range of applications in practice.PA index reflects the mechanism of rich-get-richer and is superior for its least information requirement, while BM method takes into account network community structures, rests on solid mathematical foundations, and returns excellent results.Secondly, those three indices are fairly representative, so their modification methods are enlightening and could be extended to other indices.For example, inspired by the modified CN index, other local similarity measures can also be extended in a similar way.Table 1

illustrates a few common local similarity indices and their corresponding extensions.
To analyze the performance of those modified measures, next we introduce bifan predictor as a comparison index.Bifan structure is found to be quite widespread in real-world [20] and proven to be a well-performed predictor for directed networks.
(4) Bifan Predictor [16].The potential theory is proposed as a microscopic organizing principle for directed networks, which assumes that each directed link corresponds to a decrease of a unit potential and subgraphs with definable potential values for all nodes are preferred.Combining the clustering and homophily mechanisms with potential theory, it is deduced that bifan subgraph is the most favored local structure in directed networks (see Figure 3).This special structure could be directly used as a link predictor, with the assumption that a link that can generate more bifan subgraphs is more significant and is thus of a higher probability to appear.So the bifan index is determined by how many bifan subgraphs that link  →  could generate, which can be calculated by

Result
4.1.Data Sets.Our experiments are carried out on five realworld directed networks drawn from biological and technological fields.Those networks are pretty representative and are often employed in simulation experiments to validate the effectiveness of the proposed scheme.(i) FoodWeb1 (FW1) [21]: a network of food web represents the predator-prey relations between 69 species living in Everglades Graminoids during wet season.(ii) FoodWeb2 (FW2) [22]: a network of food web consists of 97 species living in Mangrove Estuary during wet season.(iii) FoodWeb3 (FW3) [23]: a network of food web consists of 128 species living in Florida Bay during wet season.(iv) C. elegans (CE) [24]: it is the neural network of the nematode worm C. elegans, in which an edge joins two neurons if they are connected by either a synapse or a gap junction.(vi) Political blogs (PB) [25]: it is a directed networks of hyperlinks between weblogs on US political blogs.
For convenience, we eliminate all the loops and multiedges of the above networks.The basic topological features of these five networks are summarized in Table 2.

Experimental Results.
In this section, we investigate the prediction ability of the indices presented in Section 3.
Comparisons of algorithms' accuracy are displayed in Figures 4 and 5.The -axis denotes fraction  = |  |/|| for missing interactions ( = |   |/|| for spurious interactions), ranging from 0.05 to 0.95, and the interval is set to be 0.05.The dashed line indicates the baseline accuracy when the score of each link is got by pure chance.Each AUC value is obtained by averaging 100 independent realizations, except for the PB, of which the implementation times are 50.

Comparison of Local Similarity
Measures.Figure 4 illustrates the experimental results for 10 extended local similarity indices.It is worth noting that PA index, which is regarded as the worst predictor in undirected networks, surprisingly performs quite well in the experiments.It outperforms all the other local similarity measures in identifying both missing and spurious links.Hence the relatively good prediction ability together with the least information requirement makes PA index more competitive in directed networks.
Note that when applied in finding missing links, the performance of other nine local similarity measures based on common neighbors is nearly the same, especially in CE and PB networks.It is because all these extended indices are on the basis of feedforward structure, if and only if there are some "transit nodes" between  and ; the score of arc  →  is larger than 0. So when the average shortest distance ⟨⟩ is relatively large, the score of lots of arcs will be zero for all these measures, which leads to the little difference in their prediction ability.

Comparison of Different Kinds of Measures.
In Figure 5, we choose PA and CN as representatives of local similarity measures and compare their performance with BM and bifan.It is obvious that when applied in finding missing links, BM outperforms CN and PA and works even better than bifan predictor in most networks.The advantage of BM to others is usually remarkable except for network PB, in which the performances of BM, bifan, and PA nearly keep pace with each other.This outcome demonstrates that community structure also plays an important role in directed networks.Compared with other three measures utilizing only the local structure information, BM index which takes advantage of the organizing principles of the whole network could get more accurate prediction results.
Figure 5 also shows that the prediction accuracy of all four indices decline with the increases of , but the rate of decrease is varying.Compared with the linear decrease of CN index, the trend of AUC for others is slowly downward until it reaches a "turning point, " after which the remaining links in   are not enough to infer the underlying structure of network, so the prediction accuracy drops sharply.For example, in all FoodWebs, the prediction accuracy falls dramatically after  reaches 0.75.In network PB, 0.65 could be regarded as the "turning point, " before which the AUC values for PA, BM, and bifan nearly remain unchanged, but after that the prediction accuracy drops quickly.Note that in all five networks the AUC of bifan is even worse than PA when the value of  exceeds some value.Bifan index is defined as the number of bifan subgraphs that link  →  could  generate.With the increase of , more and more links are removed from   , and it is more and more difficult to build bifan structure, which leads to the decline of discrimination for   *  →  .While the increase of  does not affect the PA index that much, so the prediction accuracy of bifan is lower than PA when lots of the links are removed.
As is illustrated in Figure 5, in identifying spurious interactions, BM and bifan are the two best predictors which are followed by PA, while CN index still has the worst overall performance as before.Note that, for each index, different from the results in finding missing links, AUC values are relatively high and are basically not changed with the increases of .However, high accuracy is not sufficient for spurious link detection algorithms.If just a few unexpected important links are incorrectly removed, the structural and dynamical properties of the network may change dramatically [26].So when applied in identifying spurious links, the prediction indices should be evaluated meticulously; however this problem is beyond the scope of this paper.

Conclusion
This paper studied how to identify both missing and spurious interactions in directed networks.We have showed how to extend classical link prediction indices to directed version, making them able to predict both the existence and direction of an arc between two nodes.Simulation experiments on five real-world directed networks have vividly demonstrated the prediction ability of those modified measures.The purpose of this paper is not to present a better directed link predictor, but to give an example of how to use the valuable knowledge of undirected networks to solve the problems in directed networks.For directed networks, the direction of links is a double-edged sword.It provides additional valuable information for link prediction but also leads to greater difficulty-how to determine the direction of the unknown links is a tough problem.The extended CN index made use of the asymmetric similarity to determine the link direction, while BM benefited from asymmetric likelihood of the linkage between groups.In brief, asymmetry is the key.However, prediction methods based solely on asymmetry are incomplete, and other natures of the directed networks should be taken into consideration.For example, in neural networks and some technological networks, information or resources are often collected by basal nodes, transmitted through directed links, and delivered to top nodes eventually.Such kind of macroscopic flow direction may help to determine the unknown link direction.
On the other hand, each prediction index has its own advantages and is suitable for different kind of networks.How to choose the most appropriate algorithm according to the features of the networks is also a problem worth study.
is the set of node 's neighbors and   = |Γ()| denotes the degree of it.In directed networks, taking the link direction into consideration, Γ out () = { | (, ) ∈ } is regarded as the set of outgoing neighbors, while Γ in () = { | (, ) ∈ } is the group of incoming neighbors.And  out  = |Γ out ()|,  in  = |Γ in ()| represents the out-degree and in-degree of node , respectively.Usually,  out  and  in

Figure 2 :
Figure 2: Alternative indicators for CN index.

Figure 3 :
Figure 3: Bifan structure consists of 4 nodes and 4 directed links.All the links are equivalent to each other and nodes are of two different potentials-the potentials of two black nodes are a unit higher than that of the grey ones.

Table 1 :
Some local similarity measures and their corresponding extensions for directed networks.

Table 2 :
The basic topological features of the five directed networks.|| and || are the number of nodes and links. in max and  out max are the maximum of in-degree and out-degree of all nodes, ⟨⟩ is the average degree of the network, ⟨⟩ is the average shortest distance between pair nodes, and  is the clustering coefficient of the directed networks.