Masking of temporal activity for video quality control, measurement and assessment

Every video stream possesses temporal redundancy based on the amount of motion presenting in it. An ample amount of motion in a video sequence may cause distorting artifacts, and in order to avoid them, there is a possibility to mask the motion or temporal activity that is not noticeable to a human eye in real time. The artifacts such as blockiness and blurriness are instigated in the video sequence as soon as it is subjected to the process of compression, and they tend to become more and more intense with the increase in temporal activity. In this paper, an algorithm is proposed to mask the temporal activity using temporal masking coefficient (q) that is unnoticeable by a human eye to bring down the distortion levels. It is possible to adjust the quality of the video sequence by varying the q parameter and thus controlling its overall quality index. Frames are extracted from the video sequence, and displacement or motion vectors are also calculated from the consecutive frames using a bi-directional block matching algorithm. These motion vectors are used to estimate the quantity of motion present between consecutive frames of the same scene. Video sequences used for this purpose are basically H.264 format. Temporal masking is performed on a video sequence with and without the implementation of motion vector. Structural similarity index and peak signal-to-noise ratio are the quality measurement tools used to assess the performance of the proposed algorithm. A bit rate of 1.2% was saved by implementing proposed algorithm at q = 1 in contrast to the standard H.264/Advanced Video Coding.


Introduction
In recent times, usage of multimedia information has increased exponentially, due to which there is an enormous amount of data available on Internet and not enough bandwidth to transmit it over the wireless channel; due to these limitations in modern communication medium, a compression process is required to reduce the size of a video for an available bandwidth. This may lead to certain disturbing artifacts in a coded video stream. Video quality depends on the amount of motion present in a coded stream, every video after compression possessing disturbing artifacts that degrade its quality dramatically. 1 The compression process is classified into two different categories: lossy and lossless. Both these algorithms are able to remove unnecessary information to reduce the size of an image or video content. 2 Each person can have a different perception of the distortion depending on the information masked during the process. Motion or temporal masking (TM) can help adjust the relevance-distortion before quantifying its total amount.
Researchers over the years have introduced different algorithms to verify the compression ratio in the coded video sequences. 3 Video quality assessment meters are also designed to assess the quality of the video after the compression process. Wang et al. 4 proposed a video quality metrics using full reference video at the receiver end for an efficient video quality assessment tool. High-Efficiency Video Coding (HEVC) standard was also used to reduce the bit rate. By comparing HEVC and Advanced Video Coding (AVC) standards, Tan et al. 5 demonstrated that about 59% bit rate can be saved with same visual quality compared to the conventional AVC standard. Video sequences contain an array of frames, and researchers have used them to predict motion using a present and an upcoming frame. 4,6,7 An extensive amount of work has been done on block matching algorithm (BMA) to estimate motion present in the video. 6,[8][9][10] Temporal activity is an important feature in video sequence and it is imperative to know its quantity to apply appropriate compression algorithms. [11][12][13][14][15] Discrete cosine transform (DCT) is also used in the literature [16][17][18] to convert an incoming frame into spatial domain from where the information is quantized. Motion vectors (MVs) are required of each dual sequence frame to determine individual frame quality.
In this paper, the process of motion masking is applied on acquired MVs using TM parameter (q). Our aim is to reduce the amount of bit rate to transmit the video information over wired or wireless channel. In this regard, MVs are used to reduce the temporal redundancy by introducing the TM parameter (q). Masked motion is small and can be optimized. MV and residual image are the two main components required to reconstruct the encoded frame at the receiving end. MVs are coded in such a way that any motion that is greater than or less than the q parameter is reduced to zero and thus only requires a single bit to encode as discussed in section ''Proposed TM parameter (q) for motion masking.'' The motion within the frames is non-linear and the object can move anywhere along the XY plane. Due to this reason, it possesses negative values as discussed in section ''H.264 video coding model with TM.'' By the implementation of the proposed algorithm, the bit rate is reduced up to 13% for the Sky video sequence at q = 4 without losing its visual quality as discussed in section ''Results.'' An opinion score (OS) is also included in section ''Results,'' which we have acquired from the masses by asking them to view the given video sequences and share their valuable response for subjective analysis of the proposed algorithm. Five different video sequences are used for this purpose, and all of them scored more than 50% in OS as given in Table 7, which validates the quality index and further proves the importance of the proposed algorithm. The video data set used in this paper is provided by Seshadrinathan et al. 19,20 H.264 video coding model with TM A block diagram of an encoder of H.264 video compression model is displayed in Figure 1 in which the TM process is also included for motion reduction. H.264 encoder acquires an input video signal s(r,c,n), where r and c represent the number of rows and columns, and since it is a video sequence, n represents the frame number. The feedback signalŝ(r, c, n), also known as the predicted signal, is subtracted from the input signal and induces an error signal e(r,c,n) which may contain far less information compared to an original input frame. The error signal is actually a difference or residual image acquired from the difference between feedback signal and an input signal. The error signal is subjected to the process of domain transformation, widely known as DCT, where the residual signal is transformed into the frequency domain to remove the high-frequency components. 18 Only low-frequency components contribute to the process of image or video reconstruction; thus, by introducing the process of quantization, it is possible to effectively remove highfrequency components without compromising video quality.ê(r, c, n) is gained after quantizing the DCTapplied frame, and it is a key component required to reconstruct frame in order to gain MVs; hence, it is encoded using lossless coding into a binary bit-stream which can be transmitted over wireless channel. e(r, c, n) is also de-quantized and transformed back using inverse discrete cosine transform (IDCT) to gain the original error response e(r, c, n), but due to quantization it is not fully recovered, but most of its components are unchanged. The error signal e(r, c, n) along with the motion-compensated frameŝ(r, c, n) is used to reconstruct the original frame s(r, c, n); this signal is the one applied at the input, but due to the process of quantization it is not fully restored and has a high correlation index close to unity. s(r, c, n) is stored and considered to be a previous frame s(r, c, n À 1).
Both upcoming or new frame and previous (Stored) frame are used to acquire MVs with the help of the BMA. TM is applied on the acquired MVs to gain temporally masked displacement vectors (TMDV) which are used in conjuncture with the stored frame s(r, c, n À 1) to make a frame prediction denoted bŷ s(r, c, n); it is also called the temporally masked displacement-compensated frame (TMDCF), and the prediction requires temporally masked displacementcompensated prediction (TMDCP). The whole process repeats itself until the video is completed; the initial frame is transmitted without undergoing motion compensation and thus becomes a reference frame for the next one.
To reconstruct or predict the frame at the decoder end, only two components are required for such purpose; first is the acquired MVs between the present frame s(r,c,n) and the previous frame s(r, c, n À 1). The error or difference signal e(r,c,n) is the other component needed for reconstruction. Both these components undergo entropy coding to convert the information into the binary bit-stream to prepare it for transmission or storage. At the decoding end, the whole process is repeated in reverse. The received binary bit-stream is decoded into the two components that were encoded and regains MVs along with the difference frame.  Figure 2(c) that can be transmitted. Figure  2(d) displays the DCT-applied image with quantized response, which actually holds the lossy information needed to reconstruct an image at the decoder end. It is a fact that two consecutive frames possess a small amount of temporal activity that can be coded with a simple difference. The DCT-applied image in Figure  2(d) possesses a lot of redundant information, thus saving a huge amount of space. It generates high-frequency components with values close to zero. Quantization is applied on the DCT output, which will further reduce the near-zero values to actual zero. Table 1 represents the quantized macro-block of 8 3 8 pixels. All zeros in Table 1 represent the amount of redundant information an image possesses. The quantity of zeros in the DCTapplied image determines the space it requires to be stored.
Low-frequency components are located on the top left part of the 8 3 8 macro-block given in Table 1, and the rest of the values are high-frequency components reduced to zero by quantization. Only these components contributed toward the reconstruction of the frame at the decoder end, and the rest of the values are deemed unnecessary and can be easily coded while occupying less memory space in case of storage. They also require less bit rate if they are to be transmitted over wireless channel or uploaded on Internet.
Bi-directional MV estimation using BMA BMA is a process to identify the movement of an object with the two consecutive frames; 13 it is also considered to be the most widely used method of motion detection among the researchers in the field of video coding. 11,14 A macro-block that holds a certain object in an image will be slightly moved in the next frame due to the object's motion. It searches a macro-block of the past frame that holds a similarity with the one in the next or  Table 1. DCT-applied 8 3 8 macro-block.
upcoming frame. Bi-directional motion estimation process will require both the past frame and the next frame to predict a present frame. It is imperative for the search macro-block to have same size as that of the reference frame; otherwise, there will be mismatch error. BMA requires a specific matching criterion to identify the matching block in the previous or next frame. It is a fact that BMA is a task that is computationally intensive and requires intense hardware configuration to perform, but the quantity of motion among the adjacent frames is not that significant; thus, it is a complete waste of computing resources to search the whole image for the specific macro-block. The search for the specific macro-block is limited to the search window that is of smaller size compared to the entire image frame as displayed in Figure 3. Figure 3(a) represents the 8 3 8 macro-block that possesses an object that is moved to another location within the next frame in Figure 3(b). Red arrow in Figure 3(b) is basically the MVs that hold the displaced information of the macro-block. The frame in Figure  3(a) has a resolution of M 1 3 N 1 and M 2 3 N 2 for Figure 3(b). The search area is given in equation (1) Equation (2) represents the mean absolute difference (MAD) where d x and d y are the displacement vectors. MAD is considered to be the best matching criterion among the research community in the respective field. MAD(d x , d y ) is calculated for each macro-block, which makes it possible for the closely matched block to be identified. Equation (3) Figure 4(a) and (c) is used to predict the frame in Figure 4(b). In this case, both forward and backward predictions are necessary because a red object in the past frame is in front of the blue object due to which a small part of it is being hidden. The hidden part is recovered from the next frame in which the red object has passed and revealed the hidden part of the blue object.

Proposed TM parameter (Q) for motion masking
MVs are the most essential components required for video coding. They hold the motion activity and are used to reconstruct the required frame. Since video is the input signal for the proposed algorithm, it is safe to say that besides rows and columns, time is the third factor. MVs acquired are of two types, in which one denotes the displacement vector in the X direction and the other one in the Y direction. These MVs determine the location of a displaced object from the previous frame with respect to the next frame. Decoder at the receiving end aligns the macro-block using these MVs and reconstructs the frame.
A human eye is unable to follow and identify the less motion activity in a video sequence because high motion activity superimposes the less motion activity. This feature of human visual system can be easily exploited and reduces the less motion activity to save the bit rate required for its transmission. In this paper, a rule is proposed to reduce the low motion activity expressed in equations (4) and (5). The q parameter is used for TM and it is a variable quantity that has a range of 1 to 4. If the value of q is set to be greater than 4, it is possible that it may mask some significant motion activity, which might dampen the experience of the viewer. Equation (4) is used to mask MV of the X direction, while equation (5) is used to mask MV of the Y direction. Table 2 represents the sample of 6 3 6 extracted MV values from frames given in Figure 2. Table 3 represents the 6 3 6 motion masked blocks on which the proposed rule in equations (4) and (5) is   Table 3, q is set to be equal to 1; hence, all values in the MV field of Table 2 will be masked to zero if they are less than 1. The same scenario is repeated for q up to the value of 4; in this case, all the values that are less than 4 are masked to zero as given in Table 4 MVX

applied. For
Results Figure 5(a) represents a reconstructed frame without using the TM parameter (q), where Figure 5(c) is an extracted MV field used to reconstruct the frame in Figure 5(a). Figure 5(b), (e) and (f) denotes the reconstructed frames using the TM parameter (q) where the   value of q is 4, 2, and 1, respectively. Figure 5(d), (g), and (h) represents the MV fields of the reconstructed frames using the TM parameter (q). Small blue arrows in the MV field point to the direction of the moving object contained by the macro-block. The MV field in Figure 5(d) has the least amount of temporal activity where most of the values are reduced to zero as the TM parameter (q) is equal to 4. If the value of q is reduced to 2 or 1, the MV field possesses more temporal activity as given in Figure 5(g) and (h). White spaces in all MV fields in Figure 5(c), (d), (g), and (h) are zero, which means that there is no motion activity in this section of a frame. Figure 6(a) is a reconstructed frame using the standard (H.264/AVC) format, and Figure 6(c) represents its MV field. Figure 6(b) is a reconstructed frame using TM at q = 4, and Figure 6(d) is its MV field. The algorithm is applied on five different video sequences discussed in Tables 5 and 6 to increase the data set. The performance of the proposed algorithm is measured and compared with the standard H.264/AVC video using PSNR and SSIM as quality measuring tools. For Sky video sequence, the lower threshold value of PSNR and SSIM is 35 dB and 0.8, respectively, which means that this is an acceptable range of the quality index. Bit rate actually shows the amount of bit saved using the proposed algorithm, which can be seen in Figure 7(a). If the value of q is set to 1 and the bit rate is 1750 kbps, we saved 1.2% bit rate, while we maintain the PSNR at 38.72 dB which is extremely close to the PSNR of the standard H.264/ACV video coding, that is., 38.81 dB. SSIM for the proposed algorithm at q = 1 is 0.934, while it is 0.941 for the standard H.264/AVC. In a real-time application such as this one, this difference is considered to be negligible. If the value of q is greater than 4, then the difference between the standard H.264 and the proposed algorithm will become prominent; thus, the range of q is set to be from 1 to 4. At q = 4, the PSNR value is 35.3 dB     Table  5 for the TM parameter q = 1 are really close to those of the standard H.264/AVC format. At q = 2 and 4, these values have small difference, but at the same time it is saving more bit rate while losing some of its quality features as displayed in Table 5. This difference is extremely small and cannot be noticed in video sequences that are played at 30 frames per second (fps) rate. For example, PSNR and SSIM values of a Traffic Video sequence are 33.96 dB and 0.931, respectively, when they are compressed using the standard H.264/ AVC format as shown in Table 6. After implementing the proposed TM algorithm, PSNR and SSIM at q = 1 are 33.2 dB and 0.9, respectively, while saving 1.9% bit rate at 1750 kbps speed as shown in Table 6. This difference is almost negligible when a video is played at 30 or 60 fps rate. Table 7 is the OS, which was acquired from the masses by showing them 20 videos of 10 s each. For each video sequence in Table 5, four different videos were displayed, for instance, video 1 (standard H.264/ AVC) format of Sky video sequence, video 2 (TM applied at q = 1), video 3 (TM applied at q = 2), and video 4 (TM applied at q = 4). People were told that video 1, video 5, video 9, video 13, and video 17 are the reference videos. We acquired the opinions of 50 people to generate OS. It can be seen from Table 7 that video 2 scored 83%, which means that 83% people identified this video as a reference video or they were unable to point the difference between them. Reference video was played first for the people and they were asked to comment on the three compressed videos that closely resemble the reference.

Conclusion
The process of compression is the most important and essential part of video coding. It is key component that makes it possible for multimedia information to be delivered wirelessly within the given bit rate. Communication channel can be wired or wireless in nature. In this paper, an algorithm is proposed that utilizes TM in MVs to reduce bit rate. A video sequence is composed of multiple individual frames, and most of these consecutive frames in a same video sequence will retain a large chunk of information from its predecessor and it will possess temporal activity. Less temporal activity within the consecutive frame is masked using the proposed TM parameter (q) in section ''Proposed TM parameter (q) for motion masking'' that is insignificant to human eyes in real time. By varying q, it is possible to control the bit rate required for the transmission. SSIM and PSNR are the two measurement tools used for assessing the performance of the proposed algorithm. At the bit rate of 1750 kbps, the SSIM was found to be 0.934 at q = 1, and for the standard H.264/AVC, it is 0.941. This difference is almost negligible for a human eye if the video is playing in real time, and in return 1.2% of the bit rate is saved. The acceptable range of PSNR is 35 dB, and if it is less than the given PSNR threshold value, the difference within the frames becomes more prominent. For this reason, the range of q is set to be from 1 to 4. The proposed algorithm performed slightly better in terms of saving bit rate by varying the q parameter.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.