Digraph Spectral Clustering with Applications in Distributed Sensor Validation

In various sensor networks, the performances of sensors vary significantly over time, due to the changes of surrounding environment, device hardware, and so forth. Hence, monitoring the status is essential in sensor network maintenance. Spectral clustering has been employed as an enabling technique to solve this problem. However, the traditional spectral clustering is developed for undirected graph, and the naive generalization for directed graph by symmetrization of the adjacency matrix will lead to loss of network information, and thus cannot efficiently detect bad sensor nodes while applying it for sensor validation. In this paper, we develop a generalized digraph spectral clustering method. Instead of simply symmetrizing the adjacency matrix, our method takes into consideration the network circulation while clustering the sensors. The extensive simulation results demonstrate that our method outperforms the traditional spectral clustering method by increasing the bad detection ratio from 19% to 41%.


Introduction
Sensor networks as an enabling technique have been deployed in many scenarios that human beings find it hard to reach, for example, in the wild area, ocean, battle fields, and so forth.These sensor networks serve an important purpose to collect information to help people understand and monitor the unreachable regions.Due to various unpredictable reasons, for example, mechanical problems, malfunctioning, damage, turning to a function-reducing mode due to low battery power, and being compromised, the sensor performances may degrade over time.Thus a periodical validation of sensor status is needed.However, in many cases, it is not possible to reach the sensor network to find the problematic senor nodes.Hence, a self-validation method becomes more practical in reality, where sensor nodes validate their goodness by monitoring the signals received from their neighbors.
In the literature, the spectral clustering method is introduced as a key technique to identify those "bad sensor nodes" [1,2].However, because the traditional spectral clustering algorithm only works on symmetric matrices, [1] symmetrizes the asymmetric connectivity matrix among the sensor nodes, and applies the traditional spectral clustering algorithm, this symmetrization leads to a loss of significant information directed connections between sensor nodes, which makes the bad node detection highly inaccurate.
An  by  square adjacency matrix  represents a finite graph  = (, , ) with  = || vertices, where  is the vertex set and  denotes the collection of all directional edges.Each entry of adjacency matrix   represents the weight (or conductance) from vertex  to vertex .
For undirected graph, associated with symmetric adjacency matrix , the random walk theory has been extensively studied in the literature, such as the reversible Markov chain theory.Chung and Yau [3] define a normalized Laplacian matrix with the out-degree matrix of the undirected graph and provide several ways to derive the discrete Green's function.The effective resistance defined in [4][5][6] is exactly the commute time of random walk on directed graph and can be computed by the pseudoinverse of the graph Laplacian.
In contrast, for directed graph, where many key properties listed above cannot hold any more, due to the asymmetry of the adjacency matrix  in directed graphs, the in-degree 2 International Journal of Distributed Sensor Networks and out-degree distribution are no longer equal to the stationary distribution.
In this paper, we focus on the strongly connected graph, which is corresponding to the irreducible Markov Chain, and develop a digraph spectral clustering algorithm to solve the sensor node validation problem.The main contributions of this paper are summarized below.
(i) To our knowledge, this is the first work to investigate the sensor node validation as digraph spectral clustering problem.We develop theoretical results that introduce a digraph spectral clustering algorithm without losing the information of directed links among sensors.
(ii) By evaluating our algorithm and the traditional undirected graph based spectral clustering algorithm on randomly generated large-scale synthetic data, we show a significant promotion in terms of the sensor node detection accuracy.
This paper is organized as follows.Section 2 formally defines the sensor node validation problem.In Section 3, we introduce the generalized digraph spectral clustering algorithm.In Section 4, we provide extensive evaluation results that demonstrate the efficiency and effectiveness of our proposed digraph spectral clustering algorithm.We briefly introduce the related works below in Section 5. We conclude our paper and outline the future works in Section 6.

Problem Definition
Given a sensor network with  static sensor nodes, denoted as  = { 1 , . . .,   }, which is strongly connected, every sensor   can reach   in a number of hops.They periodically ping their one hop neighbors, where the received signal strength (RSS) values reported by a sensor to one of its neighbors correlate with the matching degree of their antenna polarizations.Thus, using the RSS as the goodness of a connection between two sensor nodes, denoted as   , the weighted adjacency matrix  = [  ] captures the connectivity of sensor nodes in the sensor network.We denote  = (, , ) as the directed graph established by the sensor network, with  as the set of all sensor nodes,  as the set of all directed edges between sensor nodes, and  as the weighted adjacency matrix.
Assume that properly working sensors of similar properties, such as radio and antenna characteristics, node environments, and power usage, will report similar measurements of RSS.Thus, those working sensors are called "good sensors" and other sensors that do not report proper measurements "bad sensors." We illustrate these terminologies using a simple example, where sensors are indexed by their antenna orientations.In this case, nearby sensors are those which have similar antenna polarizations.Suppose that RSS values reported by a sensor to one of its neighbors correlate with the matching degree of their antenna polarizations.Then, nearby sensors, which are in good working condition, are expected to report similar RSS measurements on the same neighbor.
The sensor node validation problem in fact aims to find those "bad sensors" by detecting anomaly connection patterns received from "bad sensors." To be precise, to identify bad sensors, we need to solve the problem of determining whether a sensor belongs to a cluster of nearby sensors.We assume that there are plentiful sensors so that clusters of nearby sensors will be large.A sensor is considered as a bad one if the sensor is in a small unique cluster or the sensor is in a small out-of-place component of a large cluster.As illustrated in [1], this can be achieved by applying spectral clustering algorithm.However, different from [1], in reality, the adjacency matrix , in general, is asymmetric; namely,   is not necessarily equal to   .Thus the traditional spectral clustering algorithm developed for undirected graph cannot be directly applied to solve our sensor node validation problem.In this work, we provide theoretical results to generalize the spectral clustering algorithm to digraph case and apply our clustering algorithm to do sensor node validation.In the next section, we will introduce our digraph clustering algorithm.

Digraph Spectral Clustering Algorithm
In this section, we develop digraph spectral clustering algorithm for strongly connected digraphs by introducing the generalized objective function as L. We will first present the generalized objective function followed by the design of the digraph spectral clustering algorithm.
Detailed descriptions for the extension are presented in the following section.

Design of Digraph Spectral Clustering Algorithm.
In this section, we will generalize the spectral clustering algorithm using the generalized random walk theory on digraphs.
Spectral clustering is one of the most popular modern clustering algorithms, with wide applications in distributed computing systems.For example, in [7,8], spectral clustering is used to estimate the wireless transmission cost in wireless networks.
So far, most published spectral algorithms operate on symmetric matrices.However, there are important cases when data points have pairwise relationships that are not symmetric.The link delivery probabilities in wireless network, economic transactions, and internet communications are often asymmetric.The commonly used approach for spectral clustering link data is to obtain a symmetric matrix Ã from the original adjacency  and then to apply spectral clustering techniques to Ã.Typical transformations used in the literature include Ã =  +   and Ã =   .Zhou et al. [9] design a symmetric combinational Laplacian matrix by symmetrizing the original unnormalized Laplacian of the digraph.There are two serious problems with these methods.First, in digraph, given a partition  = {, }, the commonly used edge cuts, that is, cut() = cut(, ) = ∑ ∈,∈   , are not symmetric; that is, cut() ̸ = cut().These approaches cannot distinguish the directional variations in digraph cases.Secondly, they all apply the symmetrization to either adjacency matrix or Laplacian matrix in digraph, and thus the results obtained in fact represent the characteristics of the transferred undirected graphs instead of the original directed graphs.Now, we are in the position to introduce how our generalized normalized Laplacian matrix can be used to address the above two problems in digraph based spectral clustering algorithms.
Given a strongly connected directed graph  = (, , ),  =  −1  = [  ] is the transition probability matrix of the corresponding irreducible Markov Chain.The circulation  is defined as a function which maps each directed edge (, ) ∈  to a nonnegative real value  : () → R + ∨ {0}, and, for each vertex , In [10], Chung proves that  Π (, ) =     is in fact the circulation function of the digraph .
The circulation essentially interprets that, for each random walk state, the in-flow traffic is equal to the outflow traffic, even though the digraph edge weights are not symmetric.Therefore, given a graph partition  = {, }, one can easily check that the circulations between  and , defined as  Π () = ∑ ∈,∈     , are symmetric; that is,  Π () ̸ =  Π ().This nice property of the circulation motivates us to redefine the graph cut for directed graph as below.
(5) Use the -means algorithm on the rows of the , and cluster  into  1 , . . .,   .
Algorithm 1: Spectral clustering algorithm for digraphs.This is the standard form of a trace minimization problem, and Rayleigh-Ritz theorem [12] tells us that the solution is given by choosing  as the matrix which contains the first  eigenvectors corresponding to the  smallest eigenvalues of L as columns.Now, we need to reconvert the real valued solution matrix to a discrete partition.We use standard means algorithms [13] on the rows of  and get the clusters  1 , . . .,   .
How to choose  is a general problem for all clustering algorithms.The goal is to choose the number , such that all eigenvalues of L, { 1 , . . .,   } are very small, but  +1 is relatively large.One heuristic proposed is using eigengap of L to compute the .We will address this problem as part of our future work.

Evaluation
In this section, we conduct extensive evaluations on the performance of our proposed sensor validation algorithm.

Evaluation Settings.
We consider a sensor network with  = 1000 sensor nodes and randomly generate the topologies among them, so the network is strongly connected.After getting a topology, we randomly generate RSS values among sensor pairs, such that nearby sensors maintain similar antenna polarizations.Then, we randomly choose a set of  < 1000 sensors to be the bad nodes and change their RSS values to their neighbors significantly.Then, by applying our digraph spectral clustering algorithm and the traditional undirected graph based spectral clustering algorithm on the symmetrized matrix   , we compare the detection accuracy between them.Given  as the number of "bad nodes, " we consider the number of detected correct "bad nodes" as   and use  =   / to evaluate the detection accuracy.
In the evaluation we vary the size of bad node set  from 50 to 500, and the degree of RSS changes from 10% to 50%, where the degree of RSS changes indicates the percentage of the change on the original RSS values.Note that, for each confuration setting, we run the simulation 100 times with randomly generated topologies and a set of randomly chosen bad nodes, to reduce the randomness introduced to the results.Below, we will present our evaluation results.when using our digraph clustering algorithm and tradition spectral clustering algorithm.We observe that, as the number of "bad" nodes increases, the detection accuracy decreases almost linearly for both our digraph spectral clustering algorithm and the traditional spectral clustering algorithm.This happens because, as the more "bad nodes" exist, the harder it is for the clustering algorithms to detect them, where the malfunctioning nodes may coexist and dominate a neighborhood of sensor nodes and make the detection become harder.Overall, our method consistently outperforms the traditional method with 19% to 41% more detection accuracy.

Evaluation Results.
Figure 2 shows the results of how the change ratio on RSS affects the detection accuracy when using our digraph clustering algorithm and tradition spectral clustering algorithm.We observe that, as the change rate on RSS increases from 0.1 to 0.5, the detection accuracy of both our digraph spectral clustering method and the traditional spectral clustering method increases, which is because larger changes on RSS leads to higher dissimilarity between "bad nodes" and "good" nodes.Moreover, our method achieves 22% to 34% more detection accuracy over the traditional method.

Related Work
In this paper, we develop generalized digraph spectral clustering algorithm for sensor status validation in distributed sensor networks.The related work for sensor  node status validation has been discussed in the previous section, where in the section we primarily introduce the state-of-the-art spectral clustering methods.Spectral clustering is mainly employed in data mining and machining learning areas [14][15][16][17][18][19], where a few studies have attempted to extend the spectral clustering algorithms to digraph setting, for example, [7][8][9][19][20][21][22][23][24][25].However, while being applied to solve sensor status validation problem, these works have two fundamental drawbacks: (1) loss of information by symmetrizing the adjacency matrix or Laplacian matrix and (2) asymmetric cuts in digraphs.

The Loss of Information by Symmetrization of 𝐴 or 𝐿.
The algorithms proposed in [19,20,24] symmetrize the adjacency matrix or Laplacian matrix, so the traditional spectral clustering algorithm (for undirected graph) [] can be used.However, by using the symmetrization to either adjacency matrix or Laplacian matrix in digraphs, they in fact apply existing undirected graph based spectral clustering algorithms in transferred undirected graphs, with Ã =   +, Ã =   , and so forth, instead of the original directed graphs.Therefore, the results will lose some information from the original digraph, and the clustering obtained will not be accurate.Many papers have explicitly addressed this problem, such as [23,24].
In particular, in [9] by Zhou et al., they define the same cut function as CCut(, ) = ∑ ∈,∈     and get good performances.However, the proposed algorithm still has problems.First, they use the symmetrized Laplacian as objective function, which will result in losing information from original digraph.Secondly, they proposed the symmetrized Laplacian and cut definition as heuristics and did not explicitly give the circulation interpretation of the algorithm.
Asymmetric Cuts in Digraphs.The directionality of directed links is crucial information, where [23][24][25] define the cluster cut in an asymmetric fashion.In digraph, given a partition  = {,}, the edge cuts defined in many papers (for instance, cut() = cut(, ) = ∑ ∈,∈   , etc.) are not symmetric; that is, cut() ̸ = cut().These approaches cannot distinguish the directional difference in digraph cases.

Conclusion
In this paper we propose a generalized digraph spectral clustering algorithm for validating sensor status in distributed sensor networks.The proposed DSC algorithm considers the network flow circulation while performing the sensor node clustering; thus it preserves the directed link information, which is lost in the traditional spectral clustering method.In our extensive simulations, digraph spectral clustering algorithm demonstrates 19% to 41% mode detection accuracy over the traditional spectral clustering based method.
There exist several future directions.Firstly, we are planning to explore the applicability of digraph spectral clustering algorithm in other scenarios, for example, social network community detection, and so forth.Secondly, when the sensor network size scales up, a real-time statistical query for the number of "bad" nodes becomes time consuming.In this case, we consider applying sampling techniques [26][27][28][29][30][31][32][33][34] to perform fast and accurate estimation for it.Last but not least, we are interested in applying our digraph spectral graph method to detect node and link failures in large-scale cloud computing environments [35][36][37][38].

Figure 1 :
Figure 1: The impact of number of "bad" nodes.

Figure 2 :
Figure 2: The impact of change rate on RSS.