Distributed Compressed Video Sensing in Camera Sensor Networks

With the booming of video devices ranging from low-power visual sensors to mobile phones, the video sequences captured by these simple devices must be compressed easily and reconstructed by relatively more powerful servers. In such scenarios, distributed compressed video sensing (DCVS), combining distributed video coding (DVC) and compressed sensing (CS), is developed as a novel and powerful signal-sensing and compression algorithm for video signals. In DCVS, video frames can be compressed to a few measurements in a separate manner, while the interframe correlation is explored by the joint recovery algorithm. In this paper, a new DCVS joint recovery scheme using side-information-based belief propagation (SI-BP) is proposed to exploit both the intraframe and interframe correlations, which is particularly efficient over error-prone channels. The DCVS scheme using SI-BP is designed over two frame signal models, the mixture Gaussian (MG) model and the wavelet hidden Markov tree (WHMT) model. Simulation results evaluated on two video sequences illustrate that the SI-BP-based DCVS scheme is error resilient when the measurements are transmitted through the noisy wireless channels.


Introduction
Current video coding paradigms, such as MPEG and the ITU-T H.26x, are traditionally designed for the applications followed the so-called "broadcast" model, as shown in the left part of Figure 1. The video sequence is complicatedly encoded at the powerful server only once, and then the compressed video stream is distributed and decoded frequently on many cheap and simple user devices. So MPEG and H.26x standards both have complicated encoder and light decoder.
However, with the booming of video devices ranging from low-power visual sensors to camera mobile phones, visual applications now have already developed beyond this broadcast model. The video processing paradigm in camera sensor networks, which are composed of spatially distributed smart camera devices capable of processing images or videos of a scene from a variety of viewpoints, is more like a "multiple-access" model, as shown in the right part of Figure 1. In this scenario, these video devices with limited battery power and storage memory need to send their captured video streams to the monitor server. Meanwhile, high compression efficiency is also required considering both the limitations of wireless bandwidth and transmission power. The requirements of the video processing paradigms here are diametrically opposed to MPEG and H.26x. Thus, lightweight and efficient encoding technologies are developed to satisfy such requirements of these multiaccess applications.
The first step to satisfy the multiaccess model has been taken by exploiting the interframe statistics at the decoder only, known as distributed video coding (DVC) [1], and the information-theoretic basis of DVC is distributed source coding (DSC) [2]. It states that the correlated sources can be separately encoded and jointly decoded using the correlation between them; furthermore, this separate encoding will achieve the same coding efficiency as the joint encoding. The concept of DSC has achieved a lot of attentions with the booming of wireless sensor networks where the correlated sources captured by sensors have to be encoded without communications with each other while decoded jointly at the sink node. Slepian-Wolf theorem [2] for lossless coding and Wyner-Ziv theorem [3] for lossy coding are the most important theoretical foundations of DSC. The first practical strategy of DSC is proposed in [4] to exploit the potential of the Slepian-Wolf theorem by introducing channel codes. The statistical dependence between the two sources is modeled as a virtual correlation channel. One source is used as the side information to help the decoding of the other source.
DVC combines DSC methods with traditional intraframe video coding systems so as to shift the complicated motion search operations from the encoder to the decoder, and perfectly suits the "multiple-access" model. A classical framework called power-efficient, robust, high-compression, syndrome-based multimedia coding (PRISM) is proposed in [5]. They encoded the P frames following the same procedures as that of I frames but at lower rates, while the motion search is used at the decoder to estimate the side information from the neighboring recovered I frames. Based on the side information, the P frames can be successfully reconstructed using DSC decoding algorithms.
Girod et al. [6] investigated DVC in the similar format but using turbo codes instead of the trellis codes in [5]. The scheme in [7] tackled the multiview correlations using DVC methods. DVC not only provides light video encoder but also contributes to the low-cost protection of traditional video coding. The layered Wyner-Ziv video coding system [8] achieved robust video transmission by adding Wyner-Ziv bitstream layers as the enhanced layers.
Another recent direction to achieve light encoder is to exploit the intraframe signal's sparsity property, known as compressed sensing (CS) [9][10][11][12][13][14]. In traditional intraframe coding, a large number of pixel values are firstly transformed, and then only the important low-frequency coefficients are entropy coded while other coefficients are discarded. In CS, we denote an N-dimensional vector as X = (x 1 , . . . , x N ) ∈ R N that can be represented as X = Ψθ, where Ψ is an orthogonal N-by-N matrix and θ ∈ R N is the projection of X on basis Ψ. If θ has only K nonzero entries, K N, we say that X is K-sparse with respect to representing matrix Ψ. If θ is approximately sparse with K larger entries, X is called compressible.
As long as an M-by-N sensing matrix Φ can be found incoherent with the representing matrix Ψ, the signal X can be sampled as Y = ΦX, where Y = {y 1 , . . . , y M } ∈ R M , M < N, is the measurement vector. With the measurements, the signal can be recovered by solving the p -norm minimization problem: The solution to the optimization program is θ, and then the estimated signal is X = Ψ θ. According to the compressive sensing theory [9][10][11][12][13][14], the recovery can be achieved with probability close to one, if M satisfies M ∼ 2K log(N/K) [14].
Sparsity [11] is one of the essential factors of CS which guarantees the signal can be compressively sampled, and the other is the incoherence [12] which provides the sensing matrices. Usually, random matrices are largely incoherent with any fixed representing matrix. In the CS methodology, the image or video signals are usually sparse on the basis of discrete cosine transform (DCT) matrix and discrete wavelet transform (DWT) matrix [15], where the frame signal can be directly compressed to a small number of measurements. Thus, CS may greatly improve the efficiency and decrease the complexity of intraframe compression. CS has been applied in image coding by many researchers [16,17], and [18,19] discussed the modified transforming, blocking and quantizing methods for video coding according to the feature of CS measurements.
Combining DVC and CS is a main research direction for video compressed sensing, so called distributed video compressed sensing (DCVS). In [20], the authors reconstructed the difference between frame signals firstly using ordinary gradient projection for sparse reconstruction (GPSR) [21] algorithm, and then recovered each signal. GPSR is essentially a gradient projection algorithm applied to a quadratic programming formulation of (1), in which the search path from each iteration is obtained by projecting the negativegradient direction onto the feasible set. [22] tried to explore the correlation between random measurements of signals using Wyner-Ziv method, but such cascaded design needs two sets of encoders and decoders, which will consequently International Journal of Distributed Sensor Networks 3 increase the complexity. Taking the side information into account, the scheme in [23] modified the initializations and various stopping criteria within GPSR recovery. The idea is similar to ours, but the recovery algorithm using basis pursuit is different from ours. It also has to be noted that all these schemes did not consider the transmission noise of the measurements.
In this paper, a novel DCVS scheme using side-information-based belief propagation (SI-BP) to utilize both the interframe and intraframe correlations of video sequence is proposed. Each frame is compressed separately using structured sensing matrices. However, the P frame has lower compression ratio because it can be jointly decoded using SI-BP where the reconstructed adjacent I frames can be utilized to generate the side information. The algorithm SI-BP is derived from the Bayesian inference, so the frame signal model is crucial for the performance of the DCVS scheme. Two signal models are introduced, the mixture Gaussian (MG) and the wavelet hidden Markov tree (WHMT). Due to the recovery method based on Bayesian inference, our scheme is resilient to the noisy transmission channel which is inevitable in wireless networks. Although the recovery method based on the belief propagation (BP) has been introduced into CS in [24], it is only designed for single signal not for image or video signals.
The proposed DCVS scheme has three advantages. First, the system structure of DVC is introduced so that the motion search is moved from encoder to decoder to alleviate the complexity of the encoder. Second, a coding-theory-like sensing and recovery scheme is proposed based on Bayesian inference where the SI-BP algorithm is used, which is quite different from the optimization recovery schemes in prior work. Third, the MG and WHMT models of wavelet coefficients are exploited to recover signals not only from the point of sparsity but also the point of statistical distribution. The sufficient utilization of the video frame signal's properties makes our scheme robust to noise-prone transmission channels.
The rest of the paper is organized as follows. Section 2 introduces the two MG and WHMT frame signal models. In Section 3, we present the details of the proposed DCVS scheme consisting of separate compression and joint recovery algorithms. In Section 4, simulation results are illustrated and discussed. And finally, Section 5 gives some concluding thoughts with directions for future works.

Frame Signal Models
In our SI-BP based DCVS scheme, the correlations between the frames are used to initialize the decoding iteration. So the correlation models deeply affect the performance of the decoder. Generally, the frame signal of videos is compressible in the DCT basis or DWT basis, and the transform coefficients constitute a compressible signal with special construction model. In this section, we will discuss the effectiveness of the MG model and the WHMT model.

Mixture Gaussian
Model. The MG model has been proved to be a simple yet effective model of real sparse signals in image processing and inference problems. The wavelet coefficients of the video frames can be regarded as a K- According to the magnitude of these coefficients, there are divided into K large elements and N − K small elements. Thus, it is modeled as a two-state MG, where the large elements ("large" state) and small elements ("small" state) are picked from Gaussian distributions N(0, σ 2 1 ) and N(0, σ 2 0 ), respectively, where σ 2 1 > σ 2 0 . So the probability density function of the element in the frame signal is given as The investigation of the information-theoretic bounds for the performance of CS is always a focal point [25][26][27] based on the particular sparse representations. For such twostate MG sparse signal, [25] has derived the rate-distortion bounds using the mean squared error (MSE) distortion measure. The simple upper bound on D(R) is obtained using an adaptive two-step code. Firstly, the appearance of the "large" and "small" states obeys the Bernoulli distribution. Then, the Gaussian distributions of the two states are encoded respectively. Therefore, the distortion rate function of the DWT coefficients is upper bounded by The lower bound on D(R) is found by considering the Markov chain of the S → X → X, where S is a hidden discrete variance of the sparse pattern. The lower bound is These results are for a single DWT vector, while for video coding, the correlation between I frame and P frame should be exploited. Thus, we discuss the rate-distortion bound for one sparse signal X with the side information Y . Based on the correlation model X = Y + Z, where Y and Z are independent sparse signals, we have the lower bound according to the conclusions in Wyner-Ziv coding [25]: where . = denotes asymptotic equality for D → 0. These results on rate-distortion bounds of DWT coefficients are the theoretical foundations of our proposed scheme. It can be noticed that the MG model is not unique for video frame signal but a universal model for any sparse signal. In other words, the particular features of video frame signals are not demonstrated fully through this model. So we adopt another model, WHMT model, which will be introduced in the next sub-section.
The quad tree structure of wavelet transform coefficients.

Wavelet Hidden Markov Tree
Model. The DWT coefficients of an image or a video frame have the quad tree structure [15] which has been studied well before the advent of CS. And recent wavelet-based CS [28,29] used this prior information into the CS reconstruction. We also introduce it into our joint recovery scheme. Figure 2 shows the tree structure of the 3-level DWT coefficients of an image. The coefficients were decomposed with high (H) and low (L) pass filters at each level. HL l , LH l , HH l represent the sub-band directions at level l = 1, 2, 3, respectively, and LL 3 is the basic representation of the image. Due to the analysis of DWT coefficient values, they tend to persist through scales for each sub-band. Therefore, we can construct a quad tree, where the coefficients at the highest level l = L (L is the number of levels) is called the "root" with scale s = 1, and coefficients at the lowest level l = 1 is the "leaf " with scale s = L. The coefficients at LL L is denoted as scale s = 0.
The DWT coefficients are modeled as a WHMT which is the general version of the two-state MG model. The two hidden states of this WHMT are also the "large" state and "small" state. The coefficient values of each state are drawn from a Gaussian distribution N(0, σ 2 1 ) or N(0, σ 2 0 ). However, the transition probability of a coefficient (not a root coefficient) between the two states is conditioned on the state of its parent coefficient. If the parent coefficient is in the "small" state, its children coefficients are in the "small" state with probability close to one. If the parent coefficient is in the "large" state, the "large" state and the "small" state are both possible for its children coefficients. And the root coefficient at scale s = 1 has high probability to be in the "large" state, while the coefficient at scale s = 0 must be in the "large" state. Thus, the prior probability of the ith coefficient at scale s is given as where π (s,i) is the transition probability to the "large" state of coefficient θ (s,i) , and σ s1 and σ s0 are the standard deviations in the "large" state, and "small" state at scale s respectively. Particularly, the setting of π (s,i) have some hints from the tree structure that where ω p(s,i) denotes the parent coefficient of θ (s,i) , and π s1 and π s0 mean the transition probability when the parent coefficient is in "large" state and "small" state, respectively.

Implementation of DCVS Using SI-BP
In this section, we will focus on the implementation of DCVS with SI-BP, including the separate compression and joint recovery algorithms. The framework of DCVS is shown in Figure 3. A frame group is assumed to consist of one I frame and two P frames. The I frames X I , X I+1 and P frames X P , X P+1 are all transformed into their DWT coefficients denoted by θ I , θ I+1 and θ P , θ P+1 respectively. The coefficients of I frames and P frames are sampled by sensing matrices Φ MI ×N I and Φ MP×N P , respectively, where M P < M I < N. The measurements Y I , Y I+1 , Y P , and Y P+1 are transmitted through an AWGN channel with noise W ∼ N(0, σ 2 W ), and the received versions are represented by Y I , Y I+1 , Y P , and Y P+1 . At the more powerful decoder, firstly the coefficients of two I frames are reconstructed as θ I and θ I+1 so that the side information θ SI can be constructed via interpolation. Then the P frames are reconstructed as θ P and θ P+1 with the help of the side information θ SI via the SI-BP algorithm. Finally, these frames are inversely transformed as X I , X I+1 , X P , and X P+1 to recover the original video.

Separate Compression.
The BP and SI-BP algorithms for recovery are processed on the bipartite graph, so the sensing  For the WHMT model, according to the unequal importance of different part of the tree, the LDSM is redefined as where d v,l is the number of non-zero elements of the columns corresponding to the coefficients of layer l, l = 1, 2, . . . , L.
In other words, the LDSM is written in a layered format where N l is the number of coefficients of layer l, and So, for both the MG and the WHMT model, the measurements of frames are generated as The compression ratios are calculated as International Journal of Distributed Sensor Networks and check nodes have prior information in SI-BP. The correlation between the side information frame and the current P frame to be decoded can be modeled as a virtual channel. So the side information θ SI is used to initialize the variable nodes. And the measurements Y P are used to correct the errors caused by the correlation noise and the transmission noise.
For clarifying the iterative BP decoding algorithms, we give the following assumptions firstly. (iii) q j,k (x) represents the message sent from variable node θ j F to check node y k F . r k, j (x) represents the message sent from check node y k F to variable node θ j F . (iv) C( j) is the set of check nodes connected with the variable node θ j F . R(k) is the set of variable nodes connected with the check node y k F . The decoding algorithms for MG model and WHMT model are similar but different at the initialization. However, the WHMT model can be degenerated to MG model by setting all π (s,i) equal to K/N, where K is the sparsity of the frame. So we just discuss the WHMT model in what follows.
Each variable node of I frames is initialized as which is send to the corresponding check nodes. While for the P frame, the variable node is initialized as It can be found that the side information s j is used as the prior mean of the variable node of P frame. This is determined by the maximum a posteriori probability (MAP) estimate. Then the other steps in recovery of I frames and P frames are the same, and we listed them below.
The check node y k F receives the messages sent from variable nodes in the previous half-iteration, and then calculates the message to be transmitted back to variable nodes in following steps: where θ k(l) ∈ R(k) and h k(l) is the non-zero element in row k, l = 1, 2, . . . , d c F . w(θ) is the pdf of the channel noise. And Δ(m k F ) represents a function of measurement m k F to adjust the message. The function r k,k(l) (θ) is the final message sent from check node y k F to variable node θ k(l) . The variable node θ j F receives the messages sent from connected check nodes, and calculates the message to be transmitted back to check node as where y The iteration is repeated for the desired number of iterations. Finally, we get the pdf of each variable node as The MAP estimate is used to determine the value of each variable node. In the SI-BP algorithm, the side information θ SI is used to initialize the pdfs of variable nodes, and affects the pdfs in each iteration. This initialization is more reasonable than using zeros uniformly when there is no side information. And the θ SI 's deviation from θ P is gradually corrected by the measurements Y P during the iterations.

Simulation Results and Analysis
For the simulations, we select the commonly used YUV 4 : 2 : 0 video sequences "Coastguard" and "Foreman" to test the performance of the DCVS algorithm, where the first one is in QCIF format and the last one is in CIF format. Here only the Y frames are processed. For the CIF sequence, a frame is divided into 32 × 32 blocks, and they are encoded individually. While for the QCIF sequence, the block's size is 16 × 16. These settings are used to make good tradeoff between SI-BP efficiency and complexity. The group of pictures (GOP) consists of one I frame and two P frames. The forward and backward I frames are reconstructed firstly to generate the side information by interpolation for decoding the P frames. In order to evaluate the performance of SI-BP based recovery scheme compared with traditional convex optimization recovery schemes, we use the DCVS algorithm [23] using GPSR in [21] for comparisons with our scheme using SI-BP. The two-state MG model and WHMT model are both simulated in our scheme. For the MG model, the sparsity K, the standard deviation σ 1 and σ 0 are empirically decided. And the LDSM is designed with d c = d c I = d c P = 20, and the variable node degrees are determined by the compression ratios. The WHMT model is trained before coding using the software in [30]. The average peak signal to noise ratios (PSNRs) at different average compression ratios R avg is demonstrated to justify the performance, where R avg = (R I + R P × 2)/3. For example, R avg = 0.3 is achieved by setting R I = 0.5 and R P = 0.2. When R I and R P both increase 0.1, the R avg will increase 0.1 too.

The Noiseless Channel Case.
Firstly, the ideal transmission channel is considered. The results are shown in Figure 4 and give the performance comparison between the proposed SI-BP schemes and the GPSR scheme of these two video sequences "Foreman" and "Coastguard", respectively. It can be found that the SI-BP scheme based on WHMT model performs better than the SI-BP scheme based on MG model and the GPSR scheme. This is because the WHMT model is specific for the video frame signal, and it just relies on the statistical properties of the signal which can be obtained by training. Besides, the unregular density distribution of LDSM protects the important part of the low-layer DWT coefficients, so it performs better than the MG model. For the MG model, its PSNR performance on Coastguard sequence is worse than that obtained from GPSR scheme, while for the Foreman sequence, it is better than that of GPSR. This is due to that the MG model also considers both of the sparse and statistical properties. And the SI-BP recovery algorithm depends heavily on the accuracy of the signal model. We can infer that the sparse property plays a more important role in the noiseless channel case.

The Noisy Channel Case.
And then the PSNR performances of these three schemes in the error-prone channel case are simulated, where the noise standard deviation σ W of the AWGN channel is set to 15. As shown in Figure 5. For the "Foreman" CIF sequence, the SI-BP schemes on WHMT and MG models both outperform the GPSR scheme. The SI-BP scheme based on WHMT and MG model is superior to GPSR scheme by at least 6 dB. For the "Coastguard" QCIF sequence, the SI-BP scheme on MG still outperforms the GPSR scheme up to 2.5 dB. It can be found that the GPSR scheme almost keeps the same PSNR through all the compression ratios. The performance is not improved with increased measurements because the GPSR recovery algorithm has little ability of resisting the errors of measurements. In contrast, the noise does not degrade the performance of our proposed SI-BP scheme. And because of the unequal protection provided by the un-regular LDSM matrix, the performance of SI-BP based on WHMT is about 0.5 dB higher than that of the MG model. Figure 6 gives additional demonstration of the excellent resiliency of the noisy measurements of the SI-BP scheme. The PSNR performances of the SI-BP scheme based on WHMT and MG models and the GPSR scheme are compared at the average compression ratio R avg = 0.4 with the changing standard deviation of the channel noise. When the noise   is very small, the GPSR scheme is considerable with our schemes, just as that in the noiseless case, but degrades rapidly with increased noise variance. While the proposed DCVS scheme using SI-BP, achieve the stable PSNR performance, whether the frame model is WHMT or MG. So we can come to the conclusion that the SI-BP based DCVS is more suitable for the practical applications as the channel noise is inevitable in wireless networks.

Conclusions and Future Works
In this paper, a novel DCVS scheme using side-information based belief propagation (SI-BP) is proposed to deal with the multiaccess model of video processing. Each video frame signal is compressed to its measurements separately, and the intra-and inter-correlations are utilized at the joint decoder. The SI-BP recovery algorithm is proposed based on Bayesian inference, which is quite different from the optimization recovery schemes in prior CS work and is error-resilient when the measurements are transmitted through the noiseprone channels.
The proposed DCVS scheme shifts the complexity of video coding to the decoder and guarantees a light and efficient encoder for the constrained camera sensors. The decoding algorithm based on SI-BP is more suitable in practical applications where the transmission noise is inevitable. In the future, we will further improve the performance of SI-BP based DCVS in noiseless scenarios and expand the achievements of this paper in other video and image analysis tasks, such as motion tracking and so on.