The Quality of Sampling from Geographic Networks

We empirically investigate the factors affecting the quality of information obtained by randomly sampling nodes from a network embedded in two-dimensional space. The motivation for this work is that wireless and other physical networks do in fact have embedding of their nodes in space, although analyses of random walks on such networks often only consider the link structure while ignoring node locations. Of independent interest is the measure we propose to evaluate the quality of sampling: the rate of decrease in the area of the largest empty circle remaining.


Introduction
Sampling information from nodes of a network is a wellconsidered and important problem. The problem particularly has many applications to wireless sensor networks where it may sometimes be necessary to monitor the data of the sensor nodes (e.g., temperature sensors dispersed in a forest used to detect forest fires). In very large nonphysical networks such as online social networks random sampling is useful as a mechanism by which the social network application can gather user statistics. In the physical internet, random sampling may also be used to gather network and user statistics, which, among other applications, may help detection of suspicious activities in a subnetwork. The uses and applications of sampling (random or other) from networks (physical or other) are numerous to mention [1][2][3][4][5].
While nonrandom and biased sampling algorithms may perform well under certain conditions, the performance of random sampling from a network, as implemented via a random walk, is indicative of the network related factors that affect the sampling quality and has many advantages due to obliviousness. For example, random walks have no critical points of failure and are completely local, requiring no global information. Moreover, due to the lack of bias in the method, a random walk can be used as a kind of control to test how the network related factors affect the sampling quality and efficiency. The mixing time of a random walk is the analytical measure of the time it takes for a random walk to reach a truly random sample. Technically it is the worst case time taken to reach the stationary distribution from an arbitrary starting node [5,6]. For a given network, the mixing time is a property of the edge connectivity of the network's nodes, has fundamental relationships with the network's resilience and eigenvalues, and has been well-studied for many graph classes. It is well-known, for example, that graphs whose edge relationships are chosen randomly, such as Erdös-Rényi and random regular graphs, exhibit optimal mixing time [7][8][9] whereas graphs whose edge relationships are determined via local properties, such as grids and random geometric graphs, exhibit bad mixing time [1,2]. Therefore, in general, the randomness of the edge connectivity can be expected to ameliorate the sampling quality whenever the network is defined only via the edge connectivity matrix (called adjacency matrix) and sampling quality is measured relative to sampling of random nodes quickly.
However, many real-world networks are not defined solely by their adjacency matrix of edge connectivity but also by other relevant node-specific information, such as the location of a sensor node in space for a wireless sensor network, or user-specific information for an online social network. The importance of the location of a node should not be undermined, particularly for applications in which a physical quantity must be measured by sensor nodes. For example, if sensor nodes are distributed across a forest for an application in which a forest fire must be detected via temperature and humidity measurements, the location of the nodes from which the measurements are sampled is critical. Less critical instances of the importance of node locations may include sampling opinion data from a social network, as unintentionally ignoring a large contiguous geographic region may yield inaccurate overall predictions. As many physical networks in general and sensor networks in particular exist on a two-dimensional surface (without loss of generality), in this work we take the location based information into particular relevance. Therefore, for geographic networks, we must reformulate the meaning of sampling quality and efficiency as sampling a node uniformly at random may no longer be the most appropriate indicator of quality sampling of the information residing in the geographic space. Rather, as motivated above, we wish to further ensure that large contiguous regions of space are not ignored as the sampling proceeds. As such, we make the reasonable assumption that there is no point in space which is of zero importance compared to any other point, which is a weaker assumption than all points in space having equal importance for the sampling. As the measure of a finite set of (visited) points is always zero compared to the measure of the space, one must yet specify what kinds of spatial regions we are taking into account in terms of unvisited regions. Figure 1 illustrates different kinds of empty spatial regions that may be defined on the same point set. The image of Figure 1(b) illustrates why restricting to convex regions is necessary, as one may always draw a huge but nonconvex space filling curve that avoids any given finite point set otherwise. The image of Figure 1(a) further demonstrates why considering empty regions that are similar in both dimensions is important, as it is also otherwise very easy to embed large convex areas that are very long in one dimension but small in another dimension. Therefore, in this work, we propose a natural measure based on circular regions, as illustrated by Figure 1(c), due to the convexity and symmetry of the shape in addition to the elegant computability of the related computational geometry problem via the wellstudied Delaunay triangulation [10]. We rely on the empty circle property of the Delaunay triangulation: a circle that contains the three vertices of any Delaunay triangle does not contain any other points of the input set in its interior. It is an unexplored region. We take the following as a measure of the quality of the random walk based sample: the rate at which the largest empty circle area diminishes (the LEC is defined only on visited points of the sampling procedure, e.g., visited points of the simple random walk). A Delaunay triangulation with largest empty circle is illustrated in Figure 2. Whereas measuring largest empty circles is a wellstudied computational geometry problem with applications to meshes, we are the first to our knowledge to apply it to the measurement of sampling quality from geographic networks.
In addition to proposing a natural new measure of sampling quality relevant to geographically embedded networks in particular, we use our measure to compare what network related factors affect the geographic sampling quality. Although our measure of sampling quality is explicitly geometrically defined, it is also dependent on purely graphtheoretic sampling properties, as the sampling process considered is still network dependent. Therefore, we evaluate the mixing time as measured via eigenvalues, to see how the edge connectivity structure affects the geographic sampling quality. We emphasize, however, that one novel aspect of this work is that we also measure how the node distribution structure affects the geographic sampling quality by performing International Journal of Distributed Sensor Networks 3  experiments in which node locations are permuted while keeping edge connectivity the same. The different network types we consider are parametrized with respect to connectivity and expected mixing times, with eigenvalue measurements also explicitly represented in Table 1. Of course, given the desired applicability of our results, we focus on parametrized versions of geometrically defined graphs characteristic of purely wireless or hybrid sensor networks. In particular, we consider classes ofgrid graphs at one extreme, representative of the connectivity structure of purely wireless networks, and random regular graphs at the other extreme, representing randomly wired networks with good mixing properties. For thegrids, we further consider (i) randomizing (i.e., randomly permuting) the node locations in space while keeping links unchanged and (ii) randomizing the links. We note that permuting node locations of the -grids in space (i) is for the purpose of measuring the effect of the transformation on the quality of sampling though neither transformation (i) nor transformation (ii) results in a realistic wireless scenario. We finally analyze the quality of sampling for small world networks as described by Kleinberg [11], which are representative of hybrid networks [2,12,13], and consider which parametrization processes of such networks result in the best geographic sampling quality even when controlling for degree.
Thus, we are able to determine the relative effect of node distribution versus edge connectivity on the geographic sampling quality. Our results indicate that, as expected, both randomly permuting node locations and randomizing edge connectivity significantly ameliorate the geographic sampling performance even when applied separately. The effect of the random permutation of the node locations for any fixed network was also surprisingly substantial. Regarding the relative effect of the node permutation versus the randomization of edges, a small but consistent effect is seen favoring the edge randomization towards geographic sampling quality for smaller networks. In order to determine how important was the distinction in the relative effects of node distribution versus edge connectivity, we performed the same experiments on much larger networks and found the distinction amplified. These results indicate that whereas randomizing node locations for the same network indeed has a significantly positive effect in geographic sampling quality, nonetheless the edge connectivity remains the most important factor in sampling quality even when restricted to geographical measures and geographical networks.
In this paper we expand greatly on the ideas previously introduced in [14]. Section 2 gives definitions and preliminaries on graphs, random walks, and connectivity. Section 3 gives details of the experimental setup, and Section 4 summarizes the results. Finally, conclusions are presented in Section 5.

Definitions and Preliminaries
In talking about a network, we are referring to the graph that represents it. A general graph is defined by its node set, , and its edge set , and is usually denoted as = ( , ). For = | |, without loss of generality we may take = {1, 2, . . . , − 1, }. The edge set subset of 2 defines the direct connectivity relationships between the nodes. We say that node neighbors node or in other words that node is adjacent to node , iff there exists edge { , } in . We consider undirected graphs where the edge relationship is symmetric. The degree of a node in a graph is the number of nodes that are adjacent to it. A graph is regular if every node has the same degree, and if that degree is then we can also call such a graph -. In the case of stochastically generated networks, it is also useful to characterize a graph which may not be regular but whose degree distribution is tightly concentrated about its average degree, and we refer to such a graph as being almost regular. A common and useful representation of graph is by its adjacency matrix which is a matrix such that [ , ] = 1 iff there exists edge { , } in and [ , ] = 0 otherwise.
In the case of a geographically embedded network, we may also specify the ordered set of coordinates subset of 2 . So, such a network may be denoted as = ( , , ) with 1 = ⟨ 1 , 1 ⟩ specifying the and coordinates of the first node, 2 = ⟨ 2 , 2 ⟩, the and coordinates of the second node, and so forth. Two common types of geographically embedded networks are -grids and random geometric graphs. Twodimensional × -grid is formed by placing 2 nodes exactly in integer lattice positions and directly connecting any two nodes which are at Manhattan distance at most apart. A random geometric graph is defined by parameters and and is formed by distributing nodes uniformly at random into the square region and connecting any two nodes which are distance at most apart. Both -grids and random geometric graphs are almost regular graphs, and both are also badly mixing (very much not rapidly mixing) except for significantly large average degree (i.e., degree at least as large as a constant root of ) [1,2].
Small world graphs are constructed against a lattice according to a formula with , , and as parameters. The graphs begin as -grids, with each node having shortrange connections to all nodes within lattice steps. The small world is characterized by the addition of long-range connections, which are generated randomly, with probability proportional to − , where is the lattice distance between the two connected points and is a fixed exponent [11]. Thus, longer random connections become less likely as increases.

Random Walk and Connectivity Preliminaries.
A simple random walk on a graph is a memoryless stochastic process which starts at an arbitrary initial node V 0 and proceeds to a uniformly randomly chosen neighbor of the current node at each time step. For -regular graphs, the random walk process is defined by a Markov chain that is identical to the normalized adjacency matrix, namely, multiplied by 1/ . Such a Markov chain is called rapidly mixing if it converges to its stationary distribution, corresponding to sampling a "truly random node" in optimal, namely, asymptotically logarithmic time [5,6]. Graphs which are known to be rapidly mixing include random edge models, such as Erdos-Renyi graphs of at least logarithmic average degree and random regular graphs for any degree at least 3 [8,9]. The size of the second largest eigenvalue of the normalized Laplacian of a graph's adjacency matrix, also referred to as the spectral gap, is well-known to be indicative of whether or not the graph is rapidly mixing, with larger spectral gaps pointing to better mixing properties [6]. We formalize the above in the following.
Let Markov chain M = (Ω, ) correspond to the natural random walk on a graph = ( , ). For any node V ∈ , let (V) denote the degree of V, that is, the number of neighbors of V in , and let (V, ) = 1/ (V) for (V, ) ∈ and 0 otherwise. In linear algebraic terms, the process is an application of to the current distribution vector V of step , where the initial distribution vector V 0 is concentrated completely at an arbitrary node: In such terms, the stationary distribution of M, if such exists, is the unique probability vector such that = . (1) The stationary distribution being a fixed point vector that remains unchanged upon operator is also the distribution to which the random walk converges, regardless of the starting point, given that is connected and nonbipartite (which is guaranteed by any odd length cycle): Moreover, when the underlying graph is regular, then the stationary distribution is the uniform distribution [15], and this statement also remains true asymptotically when is almost regular; namely, when the degree of every node is Θ( ( )) for the same function , then the stationary distribution is Θ(1/ ( )). Therefore, for almost regular graphs, it is clear that the random walk samples efficiently at stationarity, and the faster the random walk on a regular graph converges to stationarity, the greater its load-balancing qualities are (this refers to how fairly each node is sampled). This rate of convergence to stationarity is called the mixing time.
To define mixing time, we must first introduce the relevant notion of distance over time. Let be the state at time = 0 and denote by ( , ⋅) the distribution of the states at time . The variation distance at time with respect to the initial state is defined to be [16] Δ ( ) = max ⊆Ω ( , ) − ( ) . ( Note that when the state space Ω is finite it can be verified that [1] Now we may formally define the mixing time as the following function [16]: A chain M is considered rapidly mixing iff ( ) is (poly(log( / ))). Clearly, as the name indicates, for a random walk to be used for efficient sampling (according to its stationary distribution), it should be rapidly mixing.
As the stationary distribution is defined to be such that = , it corresponds to the eigenvalue 0 = 1 of . Let the rest of the eigenvalues of in decreasing order of absolute value be 1 = 0 ≥ | 1 | ≥ ⋅ ⋅ ⋅ ≥ | −1 | ≥ −1. For a finite, connected, nonbipartite Markov chain as the type in this work, the rate of convergence to , which as you may recall is captured by the mixing time, is governed by the difference between the first and second eigenvalues, namely, the spectral gap which is 1 − 1 [16]. The following theorems establishing these relationships imply that inverse polylogarithmic spectral gap establishes polylogarithmic (i.e., fast) mixing time.

Theorem 1. For an ergodic Markov chain (ergodicity is guaranteed by the chain being finite, connected, and nonbipartite, as we have in this work), the quantity ( ) satisfies
As we will measure mixing properties via spectral gap, the above relationships are important.
Finally, we must speak of some of the well-known mixing (or nonmixing) properties of some graphs considered in this work: -grids for any constant and polylogarithmic International Journal of Distributed Sensor Networks 5 are known to exhibit very poor mixing properties and bad spectral gap [1,17,18]. On the other hand, random -regular graphs are expanders w.h.p. for any ≥ 3 and are therefore rapidly mixing with excellent spectral gap [8,9]. The mixing properties of small world graphs vary significantly, depending to large degree on the parameters used in their construction.

Experimental Setup
The experiments were performed on networks of sizes (number of nodes) 2704 and 10404, respectively. For each graph type and node location distribution considered, a random walk was performed on the network, and at every 10 steps the largest empty circle (largest circular unvisited region) was calculated via Delaunay triangulation of the visited nodes. The random walk was continued until all nodes in the network were visited, and the empty circle calculations were plotted for comparisons. The results can be seen in Section 4.
The graph types considered were generated as follows. Both Grid and Grid Perm are -grids of two dimensions for = 1 and = 2 where indicated; however Grid Perm permutes the location of the grid points while retaining the same link structure. Grid PermRecon on the other hand maintains that nodes are on the grid locations (as in -grid) but chooses links randomly, maintaining almost regularity at an average degree of 4 for = 1 (and 12 for = 2). Latter constraint allows us to control for degree when comparing the networks.
The graphs labeled Random DAVG4 and Random DAVG12 are constructed by choosing both point locations and links randomly, maintaining average degrees of 4 and 12, respectively. The small worlds graphs are labeled according to their parameters (lattice distance threshold for local links), (number of randomly chosen distant links), and (the fixed clustering exponent). Table 1 compares the eigenvalues of the networks examined. Note that Grid and Grid Perm have identical spectral gap due to both networks having identical edge connectivity (and adjacency matrices). Note further that because experiments were performed on the normalized matrices, all spectral gaps must be less than 1. The comparative results of the table are consistent with the theoretically known facts thatgrids exhibit poor mixing time and spectral gap whereas the random edge models have significantly good spectral gap.

4.1.
-Grids, Perturbed -Grids, and Random -Regular Graphs. Figure 3 plots the rate of decrease of the largest empty circle areas during the random walks, for the -grid and random graphs with 2704 nodes.
All networks outperform the unperturbed -grids. Interestingly, even -grid of = 2 (and average degree of 12) is significantly outperformed by the location-randomized and edge-randomized models of degree 4. In particular, there is a substantial degree of improvement in the sampling quality for the Grid Perm model over the unperturbed -grid though those two are the same graph. More surprising is the Step  similarity in the performance of the Grid Perm model with the Random DAVG model of the same average degree. Upon closer examination of Figure 3, it is also apparent that the randomization of edge choices while maintaining node location at regularly spaced grid points ( Grid PermRecon ) still consistently beats the Grid Perm model in which edge selection is not randomized. In fact, further experiments reveal that the randomization of the edge connections yields higher quality of sampling in comparison to only randomization of the grid's node locations. This comparison is shown for two graphs of 10,404 nodes in Figure 4. Figure 5 shows the rate of decrease of the largest empty circle areas for -grid and random graphs with 10,404 nodes. As with the 2704-node graphs, the nonrandomized -grids converge at a much slower rate than all other graph types. Figure 6 represents an attempt to show more clearly the results for the -grid graphs of size 2704. We show the size of the largest unsampled area (as measured via the largest empty circle) decreasing logarithmically as the number of steps required increases. Here it is clear that the Grid PermRecon graphs perform the best, with = 2 beating all and = 1 second. In the case shown, the graphs with the worst sampling quality are, as expected, the simple Grid with = 1 and, unexpectedly, the Grid Perm with = 1.

Small World Graphs.
It has been shown that the mixing time of geometrically embedded graphs, such as those representing purely wireless networks, is improved by the addition of sparse random links [19,20]. This results in hybrid models characteristic of small world graphs [12,13,21]. It is of interest to verify empirically that the geographic sampling quality is also improved by varying the randomness-inducing parameters of small world graphs. 3,500,000 Step 10  30  50  70  90  110  130  150  170  190  210  230  250  270  290  310  330  350  370  390  410  430  450  470  490 Unexplored area For each small world graph, the degree of randomness is determined by the and parameters. Higher indicates the presence of more randomly chosen links, and lower indicates that those links will cover longer distances.
The spectral gaps for four randomly generated small world graphs are shown in Table 1. The graphs are labeled by their --values. It is noticed that, as expected, larger spectral gaps are obtained by graphs with more random links, , and longer random link distances, expressed by smaller values.
For purposes of comparison, contrasting graphs of degrees 12 and 24 were generated in different ways. The 12degree graphs had , , and of 2-0-0 (all lattice connections, no random connections) and 1-4-0 (4 lattice and 4 random connections for each node). The 24-degree graphs had , , and of 3-0-0 (we note that this case is also a -grid) and   Figure 7: Small worlds graphs. A 24-degree graph (red) converges more slowly than a 12-degree graph (light blue) with increased randomness. A high degree does not necessarily guarantee good mixing properties.
2-6-0. As is shown in Figure 7, the 12-degree graph with more random connections exhibited even higher sampling quality than the 24-degree degree graph with no randomness. We note that the = 0 Kleinberg model that we have taken is similar to the Watts-Strogatz small world model and allows us to test the extreme cases of adding random links versus completely geometrically defined links (where = 0).

Conclusion
We have proposed a natural new measure of sampling quality relevant to geographically embedded networks in particular and used our measure to compare what network related factors affect the geographic sampling quality. We have demonstrated use of the rate of decline of the area of the largest empty circle to evaluate the efficiency with which a geographic network can be traversed by a random walk. A high rate of decline was generally correlated with graphs with high spectral gap and low mixing time. However, even for graphs with low mixing time, randomization of node locations while retaining identical connectivity structure also significantly affected the sampling quality. A greater amount of randomness in either node placement or edge connections resulted in a higher quality of sampling from geographic networks.
We compared the quality of sampling of geographic networks by noting the rate of shrinkage of the largest circumscribed circle of the Delaunay triangulation of the sampled points. Various types of networks were used, including completely nonrandom -grids, node and edge based perturbations of the -grids, random networks with regular degrees, and small world networks with varying randomness and degrees.
We started by first permuting the node locations and then randomizing the edge connections of the -grids, to determine the relative effect of node distribution versus edge connectivity on the quality of sampling. The addition of randomness applied either way resulted in a substantially more rapid decrease in the unexplored geographic area compared to the unperturbed -grids. Generally, the edge randomization resulted in better quality sampling than node randomization, although the differences were not necessarily great. Interestingly, the differences were greater with networks of greater size.
Spectral gap information was calculated for the examined graphs and was shown to correlate with the quality of sampling. As expected, graphs with higher spectral gaps exhibited a faster rate of largest circle decline. This was particularly important with small world graphs. The graphs with the highest amount of randomness, as represented by their and parameters, resulted in a better sample that was more quickly obtained via a random walk. And such graphs also had better mixing properties, as expected. In fact, a nonrandom graph of degree 24 performed significantly worse than a randomized small world graph of only degree 12.