Game-Theoretic Camera Selection Using Inference Tree Method for a Wireless Visual Sensor Network

In a wireless visual sensor network consisting of wireless, battery-powered, and field-of-view (FoV) overlapping and stationary visual sensors, trade-offs exist between extending network lifetime and enhancing its sensing accuracy. Moreover, aggregating individual inferences from each sensor is essential to generate a globally consistent inference, because these individual inferences can be biased by noise or other unexpected conditions. Those challenges can be addressed by reducing the amount of data transmission among the sensors and by activating, in a timely manner, only a desirable camera subset for given targets. In this paper, we initialize an optimal data transmission path among visual sensors using the inference tree method, which is vital for collecting individual inferences and building a global inference. Based on the optimal data transmission path, we model the camera selection problem in a cooperative bargaining game. In this game, based on the serial dictatorial rule, camera sensors cooperatively attempt to raise the overall sensing accuracy by sequentially deciding their own mode between “sleep” and “active” in descending order of their bargaining power. Simulated results demonstrate that our proposed approach outperforms other alternatives, resulting in reduced resource overhead and improved network lifetime and sensing accuracy.


Introduction
There has been an increasing necessity to extract relevant information for multiple targets moving around inside wide areas for surveillance purposes. Moreover, these requirements must be fulfilled in a cost-efficient manner. A visual sensor is primarily equipped with an image sensing device, several processing units, communication facilities, and a set of batteries. This composition is very suitable for surveillance, because of the advantageous characteristics such as a wide monitoring area, rich visual information, and humanfriendly data. Following the development of these inexpensive, powerful, and easily-deployable visual sensors, wireless visual sensor networks (WVSN) consisting of wireless, battery-powered, and field-of-view (FoV) overlapping and stationary visual sensors have been widely employed for surveillance in public places [1,2]. Compared with other types of wireless sensors, visual sensors are impacted more by their limited bandwidth, lifespan, computation, and storage capabilities, because they contend with high-dimension data sets containing rich information generated from images [3]. Thus, it will be necessary to initialize an optimal data transmission path to reduce the amount of data transmission among the sensors for a global inference and to efficiently activate only selected cameras, which optimizes their collective coverage of given targets in a timely manner. The latter is referred to as Camera Selection (CS).
In this paper, we initialize an optimal data transmission path among visual sensors utilizing the inference tree method, which is a key component in aggregating individual inferences and building a global inference with minimized transmissions [2]. Based on the optimal transmission path, every visual sensor can exchange data with other sensors. Additionally each sensor can autonomously switch its mode between "sleep" (in sleep mode, the sensor stops capturing data; it will continue to transmit data) and "active" only with local knowledge, during advanced target analysis beyond basic tracking. This local rationale can be feasible under the practical assumption that FoV overlapping cameras can directly communicate with each other; additionally, the view of a target is shared only between neighboring cameras. Thus, the camera's local knowledge is sufficient to measure its sensing contribution towards the global sensing accuracy, considering its neighbors' contributions. As discussed in [4,5], each of the multiple cameras can cooperatively bargain for an optimal collective coverage. Our proposed approach utilizes the serial dictatorial rule, in which preferred cameras are prioritized to select their mode in earlier steps to achieve efficient computation. The remainder of this paper is organized as follows. Section 2 introduces and discusses several advanced CS solutions; Section 3 describes the inference tree method used to initialize the data transmission path; Section 4 models a CS problem in a cooperative game; Section 5 describes our proposed serial dictatorial rule-based solution to force every camera to select its optimal mode using its local knowledge; Section 6 simulates and analyzes our approach in several network performance metrics and resource overhead; Section 7 compares our solution with its representative alternatives employing our concerned metrics; and Section 8 concludes the paper.

Related Work
Because of the lack of CS-related studies at this time, we note that several considerable research efforts have proposed comparable solutions, mainly based on greedy selection (GS) or potential game (PG).
The GS-based approaches select a camera best satisfying the criteria of interest at the time until cameras are selected for a certain that has been heuristically determined in advance. For example, a candidate camera may be selected based on the extent to which it improves the current visual hull and thus reduces occlusion [6,7]; candidate cameras may also be selected based on the degree to which their images are different from images of already-selected cameras, with a goal of producing varied images that reduce redundancies [8]. Although this selection process may provide richer information about targets, it does not adaptively cope with the dynamics in the targets' locations because of the stationary .
For seamless tracking, the PG studies assign a camera to every target based on the maximum utility between the camera and the associated target. A camera's utility for a target is quantified by how large the camera records the target and its face in stationary camera networks [4] or the degree to which the camera can closely observe the target in active camera networks [5]. Until a set of probabilities describing how effectively a target is tracked by a camera for every camera and every target is converged, a set of these utilities and the probability set are alternatingly updated; additionally, they both influence each other. This one-for-one selection could efficiently produce results for tracking or locating given targets; however, it may be insufficient to produce advanced target information, such as multitarget interaction analysis.
All of the discussed techniques have approached a CS solution in a centralized manner, with global knowledge of every camera and every target. To obtain such global knowledge, high bandwidth consumption will necessarily occur at a central operator in centrally controlled networks or at every camera sensor in distributed controlled networks. To avoid such dissipation, our approach employs only limited knowledge; however, it aims to perform similarly to, or improve, the previously mentioned alternatives.

Inference Tree Method for Initializing Transmission Path
A WVSN can be described as displayed in Figure 1. Figure 1(a) displays the configuration of a WVSN, Figure 1(b) illustrates a physical topology based on wireless connection reachability (or geometric proximity), and Figure 1(c) represents a logical topology based on the FoV overlapping constraint. In this paper, however, we assume that FoV overlapping cameras can directly communicate with each other. Thus, the physical topology graph can be ignored. This assumption is feasible in practical applications and enables us to focus on the data transmission and camera selection challenges.
As displayed in Table 1 [9], power consumption costs are higher for a broadcasting network compared with a unicasting network. In order to minimize data transmissions for building a global inference in a WVSN, we need to convert the logical topology from a broadcasting/multicasting network to a unicasting network. To initialize a transmission path as a unicasting network, we use the inference tree method [2]; the process is illustrated in Figure 2. When a logical topology is provided as an input, a weight value is calculated for each edge based on the amount of FoV overlapping between two visual sensors. The result of this initial step is a weighted graph. From the weighted graph, a maximum spanning tree is produced, and the center node of the maximum length path is selected as a root node. Utilizing the root node selected in the second step, the weighted graph is converted into a minimum depth tree employing a breadth-first search algorithm. The result of this step is an initial inference tree. Lastly, it is optimized with up and down actions necessary to build a balanced tree. We utilized the final resulting tree as an optimal data transmission path for the WVSN.
The results from our implementation are displayed in Figure 3. Figure 3(a) illustrates that there is a significant reduction in the number of data transmissions for both the leaf and internal nodes. Figure 3(b) illustrates that energy consumption for data transmissions has decreased for all nodes. Figures 3(a) and 3(b) successfully support that the inference tree method is effective for initializing an optimal data transmission path in WVSN.

Cooperative Game for CS
Consider cameras, indexed by , statically deployed, and targets, indexed by , randomly moving inside a geographical area. As previously stated, we also assume that any two neighboring cameras are able to communicate with each other if their FoVs overlap. The locations of cameras are initially calibrated and stationarily fixed. The locations of targets are updated based on any object localization algorithm of [10], utilizing the most recently recorded images at each time instant, provided to their associated cameras. Whenever new locations are provided, the expected target locations at the next time instant are also estimated by the extended Kalman filter as in [11]. At this point, every camera is aware of the set of its observable targets to move into its FoV, termed .
Illustration of parameter notations in camera 's FoV with targets and can be described as displayed in Figure 4. Safe region sr is the set of every 2D point inside the dotted-line square, where a target is seen observed safely enough. Unsafe distance of not located in sr is the distance from the center of sr to 's location. Distinction angle of and is the included angle between the -to-vector and the -tovector.
Declaring that represents the mode of camera between 0 for sleep (in sleep mode, the sensor stops capturing data; it will continue to transmit data) and 1 for active, we modeled our defined CS problem in the form of a classic, normal game given as ⟨player, action, utility⟩ in [12]. If function utility is identical for every player, the players are encouraged to be cooperative to maximize their shared utility [13]. These conditions are analogous to a CS where the objective, from the point of view of a game designer, is to obtain a minimal set of active cameras that can achieve a high sensing accuracy for given targets, equal to the accuracy provided by an entire network [14]. To enable every camera to autonomously select its mode for the objective, we replace the given game form by ⟨{ , }, , ⟩, where is the global utility equally shared within the entire network and can be quantified as follows: For { } , the mode set of the cameras, which are able to observe target , (1) quantifies the global sensing accuracy given targets by summing the extent to which each target is well observed by its associated cameras in global target utility . This value generally becomes larger as more cameras are active, but not necessarily. A small number of images may omnidirectionally cover a target; conversely, images that are too similar, produced by closely located and similarly oriented cameras, redundantly dissipate resources to transmit and process these images. By setting the upper bound for global target utilities to 1 to restrict the redundancies as in (3), we observe that the global target utility of target is constructed by individual target utilities {tu } obtained by each of the associated cameras as in the following equation: such that The target utility of target by camera represents the smallest likelihood that is sufficiently observed without any occlusion in 's FoV according to 's mode. This likelihood is determined utilizing three values: of whether or not is  active, safe i of the degree to which can be safely observed by , and dist of the degree to which is occluded by other targets in 's FoV, as in (4). As stated in Figure 4, can be safely observed if it is located in 's safe region sr ; otherwise, the safety value for inversely decreases as 's unsafe distance becomes greater as in (5). To measure the occluded degree of in 's FoV, we define the distinction probability between images simultaneously taken by and as in (6), where , is the distinction angle of and defined in Figure 4 and is the scaling factor of , because a greater will result in greater reduction of the occlusion between and . Utilizing these definitions, both and tu can have a value between 0 and 1, indicating that even a single camera can provide a thorough observation of a target if the camera safely monitors the target without occlusion, as we have assumed.
Forthwith, we discuss the extent to which an individual camera contributes to a certain global utility. Camera utility of camera evaluates its contribution degree to observe the set of its observable targets according to by summing the target utilities of every in as in the following equation: The equations from (4) to (7) reflect the following characteristics that only camera sensors can possess.
(i) Every camera can be assumed to have, in its FoV, its own safe region where any target is observable in sufficient detail to give desirable information with minimal distortion.
(ii) Owing to the 3D-to-2D projection of imaging, the occlusion among multiple targets in a camera's FoV obstructs the extraction of the targets' information [3,4].
(iii) The interaction analysis among multiple targets provides more relevant information about the targets beyond their locations [15].
While (i) and (ii) are, respectively, represented by (5) and (6) for (4), more relevant information is provided by cameras with a higher camera utility, which becomes greater as it observes more targets or its associated target utilities are greater, as in (7). Subsequently, every camera selects its mode to cooperatively maximize their payoff and the global utility , for the least number of active cameras while considering the condition (3). The mode selection process, to be discussed in the following section, is greatly enhanced by taking the camera utilities into account.

Serial Dictatorial Rule-Based Bargaining Solution
According to [13], the serial dictatorial rule is a sequence of dictatorial rules conducted by individual players whose exercising order is statically arranged by their bargaining powers. By evaluating camera utilities provided in (7), we consider that a camera has greater bargaining power if it observes a greater number of targets, in a less occluded manner, in the corresponding safe region. Given that all cameras possessing greater bargaining power have already determined their modes and a camera must presently select a mode, it will select the mode maximizing its payoff, , according to the dictatorial rule. The camera is under the assumption that other cameras that have not determined their modes are in sleep mode [14]. This bargaining process is serially performed until every camera determines its mode while communicating with neighboring cameras as follows.

Order Cameras by Their Camera Utilities.
Given estimated locations of , every camera computes the target utilities of every target in and its own camera utility assuming = 1 and subsequently transmits them to its neighboring cameras. Thereafter, obtains its position in the dictatorial ordering list of it and its neighbors in descending order of their camera utilities, while initializing the mode set for the list, = {0}.

Select the Current Mode.
Prior to mode selection, every waits for all the modes of its more bargaining-powerful neighbors to be announced while updating if it is not the first on the list. Otherwise, it instantly assumes its mode by (8) for , which is the target 's associated subset of where only the mode of is replaced by . Consider A camera will decide to be active only if it can improve the total of the global target utilities for every target of by its contribution. Specifically, a camera covers its FoV only if at least one target in its FoV is not sufficiently observed at the moment, indicating that it will not be responsible for the targets already well covered by other cameras. Subsequent to this mode decision, announces its selected mode followed by ; accordingly every neighbor, termed , updates with 's new mode and drops every such that ( , ) = 1 obtained by the new from for rapid computation. This local reasoning with limited knowledge soundly and completely extends maximizing for the following theorem.

Theorem 1. For every camera with its mode , it always holds that
where * is the bargained mode set for every camera, − is the opposite mode of , * ,− is the mode set where only is replaced by − in * , and , is 's assumed mode set of and 's neighbors associated with target j for the bargaining process.
Proof. The following equation (10) holds by the global utility definition and (11) is derived because the change of camera 's mode affects only the global target utilities of every in . As * and * ,− are, respectively, restated as , and − , , we must demonstrate that (12) always holds for every with for our claim. Consider that To more easily understand the claim, we restate the target argument as the following if-then rule. By the nature of our bargaining process, ( * ) ≥ ( * ,− ) always holds for every camera . Given the condition, (12) also always holds for every .
Subsequently, we verify this claim for both its soundness and completeness.
When decides its mode as active while assuming that every mode for the less bargaining-powerful neighboring cameras is sleep, it is believed that it can improve the global target utility of any in . Let us say that { } is the target set each global target utility of which is actually raised by . The difference resulting from 's mode change is given as follows: Because any of less bargaining-powerful neighbors could be active to contribute to the improvement of the concerned global target utilities, * is likely to contain an equal or greater number of active cameras than , . Thus, (12) is valid. In addition, (13) is always greater than or equal to 0 by the definition of the target utility, which eventually leads to the conclusion that ( * ) ≥ ( * ,− ) is valid.
Completeness. We demonstrate that * exists, bargained by our bargaining process satisfying (12).
Consider any * where every mode is deterministically decided by its associated camera at its order, without possessing actual knowledge of the entire * . For every , regardless of its decided mode, the entire mode set * for any target will have more or the same number of active cameras compared to the virtual , because assumes that its less bargaining-powerful cameras adopt the sleep mode, as we discussed above. Therefore, the difference resulting from its mode change in * would always be equal to or greater than in , . In summary, * can be efficiently and deterministically obtained by (8) and any * satisfies (12) for every camera.
Therefore, the selected mode by (8) with limited knowledge for every camera optimizes the global utility by Theorem 1.

Simulations and Analyses of Our Approach
In this section, we evaluate our proposed serial dictatorial rule-based game for CS utilizing several network performance metrics in different simulations and quantitatively analyze the complexity required to achieve each step of our approach's design.

Simulated Performance Analysis.
Our proposed approach has been simulated in the following multicamera and multitarget environment.  (ii) Every camera is provided with its own safe region and FoV center point in advance.
(iii) The scaling factor for every camera is fixed to 2, because we assume that any two targets having a distinction angle of 45 or more degrees in the camera's FoV can be perfectly and separately monitored.
(iv) Every camera is initially charged with 1000 power bars and only dissipates a bar per time instance if it is active.
(v) The maximum neighboring density for every camera, , is set to 7 as in Figure 5(a) or 3 as in Figure 5(b).
(vi) In a scenario, 10 or 20 targets freely move inside the area, under a speed of 15 cells per time instance, until the network lifetime has expired.
(vii) The target locations at the previous time instance obtained by any image-based localization of [10] are always announced to their associated cameras. After receiving this location information, every camera computes the expected locations of its observable targets at the current time instance by the extended Kalman filter as in [11]. To add authenticity to the simulation, we implement imperfect localization by adding a small amount of noise to our target localization.
To permit greater focus on the network performance of our interest achieved by our approach, we vary only the simulation situations with different neighboring densities and different target cardinalities, whereas the energy and scaling specifications of the cameras are not changed. In this environment, our approach is evaluated by observing the number of cameras that are active on average, #Active, the number of cameras that are redundantly active, #RActive, the number of 8 International Journal of Distributed Sensor Networks targets that are missed, #Missing, at a time instance, and the amount of time that a camera network survives, Lifetime. To assist in understanding the three metrics, #Active, #RActive, and #Missing, we present two successful and a single-failed examples of = 7 and 10 targets as in Figure 6. Generally, distantly located targets are well covered as in Figure 6(a); however, Cameras 3 and 5 of Figure 6(b) collectively monitor their observable targets closely located, because they could be occluded in the FoV of Camera 5. For both cases, #Active is 7. As in Figure 6(c), cameras may miss targets as Cameras 7 and 9, #Missing = 1, or be redundantly active as Camera 15, #RActive = 1, because of wrongly given previous, or differently estimated current, target locations. Table 2 illustrates the average simulated values of our four primary performance metrics (in two or more significant figures) over 100 random scenario tests. The following claims can be derived from the results. Claim 1. On average, six cameras for 10 targets and nine cameras for 20 targets are active. Accordingly, Lifetime for 10 targets is longer than that for 20 targets.
Claim 2. As is larger, #RActive and #Missing are smaller. It would be more advantageous for more neighboring cameras to synthesize more accurate target locations by exchanging different information and the full network coverage of = 7 is wider than that of = 3.

Claim 3.
A smaller numbers of active cameras in all cases somewhat extend the network lifetime from 1000 to 1481. Claim 4. Because of the low #RActives and #Missings in all cases, our approach might be able to work over some localization errors, which necessarily occur in any existing localization techniques.

Complexity Analysis.
Because we consider wireless cameras, we must discuss the resource overhead required by our proposed method. For each time instance in our design, energy consumption occurs according to the operations listed in Table 3.
The complexities for (1) and (5) depend on the models or algorithms camera sensors employ, and we leave them as and . We strongly emphasize that the two operations are conducted only by active cameras, and our proposed approach, on average, activates 0.57 (=9.1/16) times fewer cameras, including the worst case. The computational complexity for (2) is referred to as ( 2 ) [16]. Each target utility, each camera utility, and each global target utility respectively consume ( ), ( 2 ), and ( ) computations, which leads to ( 2 + ) computation for (3). Given such utilities, a camera determines its mode by searching in the ( ) space for (4). Conversely, the communication complexity for (6) to (8) is equal to ( ) because a camera maximally exchanges pieces of information with neighbors. The energy consumption of camera sensors is dominated by (1) and (5) because of the significant size of image data [3,17]. This supports our simple assumption about power consumption that only active cameras can monotonously consume a single power bar per time instance and indicates that our approach can be considered as fairly competitive if targets are not highly crowded in any of the FoVs.

Comparison Work
Utilizing the same environments employed in our simulations, we simulated the representative alternatives, PG of [8] and GS of [5], while measuring the four network performance metrics, #Active, #RActive, #Missing, and Lifetime. For GS in particular, we have heuristically assumed that the optimal number of active cameras for 10 targets and 20 targets are, respectively, 6 and 9 by #Actives of Table 2. Because the numbers are not consistently optimal, we examined three GS tests for the two different target sets as {5-GS, 6-GS, 7-GS} for 10 targets and {8-GS, 9-GS, 10-GS} for 20 targets. Table 4 lists each of the network performance results for the five different approaches, on average, after 100 tests. A smaller number of active cameras typically result in longer network lifetime; correspondingly, the GSes with smaller numbers of active cameras provided the network with longer life. However, their performance observed for #RActive and #Missing was lower than analogous results from our approach and PG. Compared with our approach, PG produces similar results for every factor. However, PG requires each camera to compute required utilities for every camera and to communicate with every other camera. This consumes greater resources compared with our approach. Therefore, we emphasize that our approach generally provides more advantageous tradeoffs between {#Active, Lifetime} and #Missing and between Lifetime and resource overheads compared with the alternate approaches.
As representative instances of this comparison process, we provide five simulation screenshots for each approach in one simulation of ( , ) = (3, 20) as in Figure 7. Aside from unobservable targets, our approach covers every target by 10 active cameras, whereas PG activates one redundant camera, Camera 15. As previously stated, because GS cannot adaptively select the number of active cameras, it misses targets as in Figures 7(b) and 7(c) or it additionally activates redundant cameras as in Figure 7(d).

Conclusion
In this paper, we addressed trade-offs between extending network lifetime and enhancing its sensing accuracy. To minimize the energy consumption necessary for data transmission while aggregating individual inferences to build a global inference, we utilized the inference tree method to initialize an optimal data transmission path, and we demonstrated that it is very effective for reducing the number of data transmissions and energy consumption. We modeled a CS in the context of a cooperative bargaining game, where every participating camera serially optimizes the global utility, employing only local knowledge based on the serial dictatorial rule. The simulated results demonstrated that our approach extends network lifetime and performs accurately over limitedly accurate target locations. Moreover, our approach is energy-efficient for uncrowded targets, compared with the alternative representative conventional studies.