Hardware Architecture for Real-Time Computation of Image Component Feature Descriptors on a FPGA

This paper describes a hardware architecture for real-time image component labeling and the computation of image component feature descriptors. These descriptors are object related properties used to describe each image component. Embedded machine vision systems demand a robust performance and power efficiency as well as minimum area utilization, depending on the deployed application. In the proposed architecture, the hardware modules for component labeling and feature calculation run in parallel. A CMOS image sensor (MT9V032), operating at a maximum clock frequency of 27 MHz, was used to capture the images. The architecture was synthesized and implemented on a Xilinx Spartan-6 FPGA. The developed architecture is capable of processing 390 video frames per second of size 640 × 480 pixels. Dynamic power consumption is 13 mW at 86 frames per second.


Introduction
Computation of regional descriptors based on gray value, color, texture, and geometrical features for segmented image components is a basic step for many machine vision systems.The requirement for the detection of image objects and their feature calculation can be seen in a huge number of applications, for example, medical science [1], optical navigation [2], thin film inspection [3], and smart cameras [4], as well as in industrial process monitoring [5].Hardware architecture design for machine vision systems is thus of great importance and is also an active research topic.The architecture developed in this case ensures a high frame speed, low latency, modularization, and also low power consumption.The developed architecture is suitable for smart camera applications, where only refined and processed information is sent over a low bandwidth communication channel rather than sending whole video frames.This smart camera can form part of a visual sensor network in order to send processed results to any other node or to the base station.
Machine vision algorithms are often divided into the following steps as shown in Figure 1 [6].Video is acquired from an image sensor at image acquisition.Image objects are extracted from the preprocessed video data at Segmentation.
During Labeling, pixels belonging to the same image component are assigned a unique label.During Feature extraction, an image component is described, for example, in terms of region features such as ellipse, square, or circle parameters.Components can also be described in terms of gray value features such as mean gray value or position.These features are sometimes also referred to as descriptors.Feature information can then be used for the Classification of image components.Information with regards to recognized objects in the camera's observation area can then be transmitted through the camera's output using, typically a low bandwidth.As typical reactive systems, such as robot navigation or surveillance of industrial processes, demand a high frame speed and low latency, it is necessary for there to be highly parallelized hardware architecture in order to provide rapid computation of image component features.Field Programmable Gate Arrays (FPGAs) are thus chosen as the preferred computation platform as they possess massive parallelism, on chip memories and arithmetic units.
In this work, the main focus is on image component labeling and feature extraction.Component labeling assigns a unique label to each connected component in an image.The classical 2-pass labeling of an image component is described in [7] and is shown in Figure 2. The pixel neighborhood is analyzed for the 8-connectivity.The neighboring pixel either gives its label to the currently processed pixel or, alternatively, labels are recorded as equivalent, which is referred to as label merging.In Figure 2, in the third row, assigning label to 2nd pixel means that it has two neighboring pixels, labeled 2 and 3.The smaller label is assigned to the current pixel, that is, 2 and labels 2 and 3 are recorded as equivalents.If there is no labeled pixel in the neighborhood, a new label will be assigned to the currently processed pixel [8].From a typical image, many equivalent labeled pairs merge and these pairs form linked lists in the equivalence table .Resolving equivalent labels means that all the merged labels, which are linked to the same list, must be replaced by one unique label [9].The second scan then assigns the unique label to each pixel in the image component.In our implementation, only equivalent labels are stored and resolved but the resolved labels are not written back to the image, further referred to as single pass labeling in this paper.The equivalence table can be implemented in a memory, where all memory cells are initiated to their own address [10].The label merging is illustrated by means of a graph in Figure 3.It shows the merging of four label pairs from the example in Figure 2.
After the decision is made that labels 2 and 3 are equivalent, the contents of label 3 can then be altered by means of the content of label 2. Thus, this is the starting point for label 3 in relation to its equivalence to label 2. At the end of each frame these equivalences must be resolved to the root of the link list.Five labels in a linked list can be seen after the labeling of the pair (4,5).The final graph shows how this linked list has been resolved so that all labels of the same image component should point to a unique value.
The key design metrics that were considered while developing the architecture for component labeling and feature extraction for real time machine vision systems are as follows: (i) high frame speed and low latency, (ii) modularization, that is, a clear separation between component labeling and the extraction of different image component feature descriptors, (iii) parameterizeable (frame size, label code word length, etc.), (iv) low power consumption.
The developed architecture is suitable for an application that requires a high frame speed, low power consumption, and programmability.Battery operated smart camera nodes for surveillance [11,12] and camera based navigation for blind persons [13] require low power architecture as well as programmability in order to incorporate any changes depending on the external environment.Camera based industrial process monitoring requires a high frame speed [5].Our developed architecture addresses the issue of modularity and there is a separation between the image component labeling and feature extraction.This work presents three basic image component feature descriptors, namely, Centre Of Gravity (COG), area, and bounding box.It is possible that future work will be able to extend this architecture with more complex descriptors, for example, geometrical shape, color, and texture.Our belief is that this work is a valuable scientific contribution in those areas for which power consumption or speed of a machine vision system is of the highest importance.
This paper is organized as follows.Section 2 describes the related work.Section 3 describes the hardware architecture for image component labeling and feature extraction.Section 4 discusses the memory accesses and storage requirement for different features.Section 5 explains memory implantation and the results with regards to area and simulated power consumption.Section 6 explains the performance and device utilization results and Section 7 is a discussion of the results.

Related Work
This paper presents new hardware architecture for connected component analysis and there is also a requirement to highlight its relevance by referring to applications and state of the art research within the computer vision community.With regards to the recognition of image objects, labeled image components (blobs) are described based on features such as texture, color, or geometrical shape.This paper includes hardware models for three basic feature descriptors, namely, COG, area, and bounding box.More complex descriptors are often derived as histograms accumulated over described image regions [14].The architecture presented in this paper can be extended by means of an additional hardware module to support the accumulation of histograms.An Edge Histogram Descriptor (EHD) can be accumulated to describe the shape of the regions [15].Local Directional Pattern (LDP) and Local Binary Pattern (LBP) are coded descriptors regarding the difference in gradients and gray values used to describe the texture of the image regions [16,17].LBP was used by Mohammed and Yampolskiy in relation to face recognition [17].Bounding box, COG, and area of blobs were used by Yoon et al. in relation to the recognition of a car's license plate and its characters [18].A Gabor filter bank was used to segment fabric defects on textile as blobs [19] and this is a good example of a real-time application that could benefit from additional analysis of geometrical shape, position, and size of defect regions.Chan and Pun present work on character recognition based on connected component analysis [20].Successful experiments have been conducted by us using connected component analysis with regards to the reading of signs for speed limits [21].
Connected component analysis is very much dependent on an efficient method for segmenting images into regions.Too many spurious regions will increase the workload on the processors and will also increase the requirements with regards to memory storage.Luis-García et al. defined a success score for the evaluation of efficiency of segmentation [22].Cao et al. provide an example regarding how a color gradient tuned to a specific pair of colors can be exploited for the efficient segmentation of road signs [23].We have published similar work for the segmentation of colored labels [24].
For real-time image component labeling and feature extraction, several methods and proposed architectures were studied.The first connected component labeling method was proposed by Rosenfeld and Pfalz [25].The labeling was performed in Raster scan order, which is the same as the output from a camera.However, this method requires a large global table for equivalent label pairs.Parallel hardware architecture was proposed in [26], utilizing two processing units on an FPGA and in this case, 2 pass labeling was used.Image component labeling is performed for  2 clock cycles for an  ×  image.The working clock frequency achieved was 80 MHz.This corresponds to, approximately, 256 VGA sized frames per second which are only able to be processed for a labeling operation.
Aggressive approaches were presented in [9,27,28], which only use one labeling pass and by which it is proposed that the image features can be calculated.The main emphasis of their work was to minimize the memory usage and they have, additionally, reported the worst case scenario for a number of labels.A row buffer is used to store the label assigned to a previous row and a merger table is used to retain a record of the equivalences.The resolving of the merged labels is conducted during the horizontal blanking period.The maximum clock frequency achieved was 40.63 MHz for an image of size 640 × 480.The architecture proposed in this case is, in the main, based on their novel work.The similarity to Bailey et al. work is that the authors also used single pass labeling approach to avoid memory requirements for image storage.In addition the proposal is to compute feature descriptors in well-separable hardware modules.The labeling process continues on incoming pixel, while the feature extraction continues uninterrupted in separate hardware modules.This approach allows for an easy inclusion of combinations of descriptors depending on the application.The massive memory bandwidth on the FPGA is efficiently exploited when each descriptor is accumulated in separate sets of blockram memories.
There are still differences that are worth mentioning.Firstly, we resolve merged labels after scanning each video frame and no additional clock cycles are required for row or frame synchronization, thus increasing both the throughput and frame speed.It should be kept in mind that the row and frame synchronization overhead is a mechanism inherited from the blanking of the old Cathode Ray Tubes (CRT).This unwanted overhead only reduces the throughput of a digital video system.Secondly, the proposal is to compute feature descriptors in well-separable hardware modules.The labeling process continues on incoming pixel, while the feature extraction continues uninterrupted in separate hardware modules.This approach allows for an easy inclusion of combinations of descriptors depending on the application.The massive memory bandwidth on the FPGA is efficiently exploited when each descriptor is accumulated in separate sets of blockram memories.Secondly, the architecture was developed at RT level rather than using higher level abstraction model, for example, Handel C.

International Journal of Distributed Sensor Networks
Real time component labeling with boundary tracking was described in [10].Two-pass labeling is used such that a locally labeled image must be stored in the FIFO memory, thus increasing the latency, power, and hardware utilization.
FPGA implementation of component labeling is presented in [29] in which they use Handel C as the development language and off-chip memories to store the images, which is not a power efficient technique.Another image component labeling implementation in Handel-C is described in [30].Handel-C can support the parallel execution of a program but a more optimized hardware design can be achieved by using HDL.Handel-C implementation of an encryption protocol is compared with a VHDL implementation by Mylonas et al. [31] and here, the authors reported that a VHDL based design requires fewer resources and can be operated at a higher clock speed as compared to that for a Handel-C design.
Component labeling using contour tracking was implemented in [32].The proposed architecture is not suitable for real time processing, and the required extra memories increase the power consumption.Another labeling implementation in Handel-C is described in [30], but this is not a power-efficient implementation.
Image component labeling on a one-dimensional DSP array was described in [33].In this case, the majority of the execution time is spent on interprocessor communication, and it can also be determined that the latency for the implementation is quite high.In relation to this a 128 × 128 image size was used.The execution time for binary images takes hundreds of milliseconds and for a gray scale image it is in thousands of milliseconds.This execution time leads to a low frame rate and is thus not suitable for real time applications.
No research discovered, which is related to connected component analysis on FPGA, focused on a comprehensive investigation of power and speed.

Hardware Architecture
This section describes the hardware architecture for image component labeling and feature extraction being proposed as suitable for implementation on an FPGA.Image component labeling assigns unique labels to different image components in an image as described in Section 1. Pixels are assigned labels at 5, based on the neighboring labels in 6 to 9 as shown in Figure 4(a).A delay line of one FIFO-buffer and two registers holds the recursive data dependency arising from the neighborhood of previously labeled pixels, as shown in Figure 4(c).The length of the FIFO buffer is   − 2 elements where   equals the length of one image row.The kernel for labeling and feature extraction is shown in Figure 6.
The Labeler assigns the label codes.The label pairs (A, B) are sent to the equivalence table whenever neighboring labels are found to be equal and which must thus be merged.Label merging is targeted to either Table S or T dependent on either an odd or even frame (O/E).The resolving of linked lists of labels is thus frame interleaved with labeling and label merging.When the resolver is ready, the final steps of feature extraction can be performed.This final calculation of features will require access to the resolved equivalence table.At the same time as the labeler is assigning label codes to different component pixels, the data of that particular pixel is accumulated in the data table S or T for feature extraction.When the labels are resolved in an equivalence table, the data present for the different label codes of same image component is resolved.There are multiple instances of data tables S and T in Figure 6, which indicates the different features that are desired to be, computed for example, area, bounding box, and so forth.
The architecture has no dependency on row synchronization for either assigning labels or label merging.Data required for the extraction of features accumulates in parallel with the labeling process and more Data tables can be added for additional features.
In this paper the computation of three basic image component features, namely, COG, area, and bounding box has been implemented.The Centre of Gravity of an image component  can be defined as [34] In (1),   ,   are the centroid of the image object and   ,   are a pixel belonging to the image component .(  ,   ) is the intensity value of the pixels in the image component .This intensity value can, for the simplest case, be a binary value.However, if a high subpixel precision is desired, the bit-width for intensity should be selected to be sufficiently large such that the quantization noise is small in comparison with other noise sources [34].
The area of an image component equals the sum of the pixels it contains.The area of the image component  is given in (2), where   ,   are the indices of the pixels belonging to the image component : The bounding box is the minimum rectangle enclosing an object.It is parameterized by its minimum and maximum coordinates in the row and column directions, that is, coordinates of the upper left corner (ULC), and the lower right corner (LRC).An example of a bounding box is shown in Figure 4(b), where   ,   are the indices of the pixels belonging to the image component : (3) 3.1.Image Component Feature Calculation.The architecture for image component labeling and feature calculation is described in Figure 6.The computation of the image component features is controlled by a sequencer, as shown in Figure 7.The sequencer is controlling all the feature extractions and is also responsible for sending the extracted feature data to, typically, a classification step.The COG unit accumulates the numerators and denominators and performs serial divisions for the final COG output.This accumulation of pixel data in the COG, area and bounding box units are conducted in parallel with the labeling process.When the equivalence table is resolved, the Table ready signal becomes active and the sequencer starts processing the data from those regions belonging to the same image objects.The sequencer is able to perform this data processing based on the resolved equivalences read from the equivalence table, as shown in Figure 5.After adding up the data from the different regions of the same component in the COG unit, the numerators and denominators are ready for division.After the completion of the division, a compute done signal is sent back to the Sequencer, which then enables Feature Strobe, to send the COG value to communicate to the external host.For the area and bounding box parameters, the valid data is ready after resolving the data of the same image components with different codes.The sequencer scans the equivalence table for all image objects until all objects' features are computed and transmitted.

Memory Access for Equivalence Table
This section describes how memory accesses are managed for the equivalence table during the merging of labels.Label merging is performed in such a manner that linked lists of equivalent labels are mapped onto a linear memory array as described in Figure 3.The equivalence table is fed with two equivalent labels,  and , to be merged.(i) Initialize all memory cells to its own address at the start of every frame.
(ii) If labels  and  are equivalent, read the contents of the memory cells *  and *  addressed by  and .
This algorithm requires three memory accesses for every pair of merged labels.In addition, the merging of the two labels is required to be performed in one single clock cycle in order to keep the system clock frequency low as this is essential based on the dominance of the high power consumption in the clock nets [35].The conclusion to be drawn from this is that it is necessary to pipeline the merging algorithm and introduce a triple port memory.At the first clock cycle, the contents of the memory cells  and  are read from memory.However, if any of those cells are scheduled to be written in the next pipeline step, the memory content is copied directly from the pipeline.A memory write occurs at the second clock cycle.The triple port memory is synchronous, thus providing a clock cycle delay for readwrite operations.

Memory Storage Requirements.
In this section the discussion will be in relation to the memory requirement for different modules used in our design.According to Section 3 and (1), numerators N and denominators D are accumulated in the memory storage.The worst case scenario for the COG calculation will be at the point when a large image component of size  ×  pixels is present at the right bottom corner of the video frame, where the row and column have their maximum values, as shown in Figure 8(a).The frame size is  and there are  rows and columns.The maximum pixel intensity  max is assumed to be a power of two since it is given by the bit-width assigned for the representation of the gray level intensity  of the pixels.For the memory storage estimation, the assumption is that the worst case object has the maximum intensity for all its pixels.Firstly, the maximum integer value  Num for the center of the numerator row is calculated, assuming that the row dimension is the largest.From (1) it can be concluded that Substituting ( 6) into log 2 of ( 4) where  Num is thus the number of bits required to store one single numerator.For these experiments, the following values have been used for , , , and  max :  = 640,  = 480,  = 170,  max = 255.For these values, the value of  Num ≤ 32.This means that, for the maximum image component of size 170 × 170, 32 bits are required in order to accumulate the numerator into one single memory cell.The depth of the memory for all data tables is chosen to be 1024; thus, a total of 32 Kb of memory is required.
The memory requirement for the denominator in the COG unit is given in (8).For object size of 170 × 170, the denominator will require 23 bits and a total of 23 Kb memories.The memory requirement,   , for the area unit depends on the maximum object size.For an object of size  × , this is provided in (10).Consider the following: For  = 170, 15 bits are required to store the area of one object and a total of 15 Kb is required.For the bounding box unit, the storage requirement,   depends on the maximum image dimension chosen for the feature calculation: For an image size of 640 × 480, 10 bits are used to store the maximum row or column positions and a total of 10 kb memory is required for each corner parameter of the bounding box.For all the feature descriptors, two data tables are being International Journal of Distributed Sensor Networks 7  maintained such that the estimated memory requirement will be doubled.The depth of the tables was chosen to be 1024.The data table depth depends on the maximum number of image components and their complexity.The number of image components is related to the segmentation process that separates the image objects from the background.Too many spurious image components will increase the workload on the processors and will thus increase the requirements on the memory storage.The method for segmentation should thus be selected and evaluated with regards to its success in finding the components of interest [22].The images from the blood sample of a malarial patient are studied in [36].The segmented image in this study shows approximately hundred segmented objects.Real-time speed traffic sign detection and its classification were studied in [21].The number of image components has been studied at different distances from a speed sign under different illumination conditions and, in this case, color segmentation was used to separate the speed sign from the background.The results from this study showed that, at most, 90 image components are present in the image after segmentation and thresholding.The number of image components after the optimization of colored labels positioned in an office environment was reported in [24].No more than 500 components were segmented for high resolution images which possessed a complex background.Another study on magnetic particle detection from machine oil images is presented in [37].The author reported the size and number of particles in this study.Over an operational period of 3 months the author noted that there were approximately 24 particles which had a size greater than 4 um.The object size of 170 × 170 is chosen as an example for the selection of the storage requirement.This object size is sufficient for the application stated above.For applications where the object size is greater than 170 × 170, an appropriate feature data width can be selected.
A good segmentation will result in rather few objects which is necessary in real time system in order to reduce the computational burden.The table depth of 1024 is sufficient for the above presented applications.The depth can be customized to a greater extent, based on the nature of the specific application and the nature of the objects of interest.

Experiments and Results
Experiments and results on implementation of triple-port memories, utilization of FPGA area, power consumption, maximum system throughput, latency, and functional verification are reported in this section.Power consumption is both measured on a hardware prototype as well as compared with simulations.The hardware prototype is shown in Figure 13.This prototype is a Digilent Atlys Spartan-6 board [38], connected with a CMOS imaging sensor from Aptina, MT9V032 [39].
International Journal of Distributed Sensor Networks

Memory Implementation.
From Section 4 it is evident that the label merging process requires three memory accesses per clock cycle, that is, one write and two reads.The equivalence tables in the labeling process are implemented as triple port memories.The Xilinx tool set duplicates the memory storage during synthesis; that is, it uses two block RAMs.It is possible to use multitap delay lines if these memory accesses are regular and occur within the same clock cycle.However, for the equivalence table, the memory accesses are irregular and an independently addressed multiport memory is required in order to implement three memory accesses in one single clock cycle.Another approach in relation to providing an improved memory bandwidth would be to increase the clock frequency and to serialize the memory accesses.However, the dynamic power dissipation is then expected to increase dramatically, since the clock nets dissipate the majority of the dynamic power [35].
In this case, an experiment has been performed which involved doubling the clock speed using a dual port block RAM and serialized memory accesses in comparison with the use of triple port memory being implemented as duplicated block RAMs.The memory bandwidth was selected to be 103 MB/sec for both cases.Synthetic data was used with regards to the reading and writing in both cases of memory storage.Block RAMs were enabled only during clock cycles when memory was accessed for writing or reading.The results for the dynamic power consumption are summarized in Figure 9.The results show that clock nets are power hungry and increasing the clock frequency to achieve a certain bandwidth consumes more power than using additional block RAMs.

FPGA Area Utilization and Power Consumption.
The utilization of logic cells and power consumption are two important design metrics for an FPGA based hardware system.Power simulations have been conducted for different configurations based on different label code word lengths.The hardware architecture was implemented using a Digilent Atlys Spartan-6 board, which has the capacity of monitoring the power consumption on the different available voltage rails.The FPGA core is connected to the 1.2 v rail.The monitors are based on Linear Technology's LTC2481C sigma-delta analog-to-digital converters that return 16-bit samples for each channel [38].The clock frequency for the experiments was chosen to be 27 MHz.
The experiments were firstly performed on component labeling alone in order to determine how different label code word lengths affect the area and power consumption.Assigning the number of bits for label codes must be based on the estimated maximum number of image components and their complexity.Figures 10 and 11 show the results as reported by the design tool set, Xilinx Integrated Service Environment (ISE), after post place and route simulations.The power simulations were conducted using the Power tool included in ISE.Experiments have also been performed for dynamic power consumption based on different numbers of objects for labeling along with a COG calculation.4, 8, 16, and 30 objects of 6000 pixels were chosen in relation to this task.These three cases correspond to 8, 16, 31, and 58 percent of the total image area.The power consumption was shown to increase as the number of objects increased.The results of the simulated and measured values are shown in Figure 12.The measured values show an increase in dynamic power dissipation as the number of objects increased.

Functional Verification, Performance, and Device
Utilization.The hardware architecture was modeled in VHDL and this model was simulated at the register transfer level.The architecture was implemented using a Diligent Spartan-6 board as shown in Figure 13.The experiments were conducted using a variety of input stimulus and two sample images are shown in Figure 14.The first image shows the segmented image of magnetic particles in the fluid of a hydraulic machine.The presence of magnetic particles in the fluid demonstrates the deterioration inside the machine.The oil particles are correctly detected in the image and their features are extracted in a correct manner.The number of magnetic particles and their occupied area can provide an indication in relation to the status of the machine.The second image shows a speed sign and, in this case, the objects have been correctly counted and the extracted features can assist with regards to its correct classification.
Both the device utilization and power consumption for the whole architecture, as shown in Figure 6, are presented in Table 1.The pixel clock frequency was set to 27 MHz at a frame speed of 86 frames per second.Ten bits are assigned for label codes, 34 bits are used for the numerator and 19 for the denominator in the COG calculation, and 17 bits are used for the area of each component and 10 bits for each of the bounding box parameters.The Spartan-6 device that was selected for implementation is based on a 45 nm CMOS process, which is optimized for low power and cost.Table 1 shows the simulated power consumption for different elements of the FPGA, the total power consumption, the maximum clock frequency on which the whole system can be operated, and the number of block RAMs and the FPGA slices' utilization.For comparison, a measured value for the dynamic power consumption is also reported.This measured value was derived after firstly measuring the total power and then subtracting another measured value for the static power consumption.

Quantitative Comparison with Other Architectures.
In this section a quantitative comparison of the proposed architecture is presented with already published work.The comparison includes the maximum achievable frame speed and the power consumption results.The comparison is presented for the labeling unit only, in order to provide a fair correspondence of the results for the maximum frame speed and power consumption associated with related published work.This result is reported for the Xilinx Virtex-II pro-FPGA so as to compare it appropriately with the work of Bailey et al.The result from the maximum achievable clock speed is presented in Table 2.The maximum frame speed can be calculated based on the maximum clock speed values.For the proposed architecture, there is no dependence on a row blanking period and for a frame size of 640 × 480, the maximum frame speed is around 420 frames per second.The work presented by Bailey et al. [9,27,28] performs the label resolution during a horizontal blanking period.Their stack size can accommodate 128 merges.If one clock cycle is required to resolve one label pair, then at least 128 clock cycles are necessary in order to resolve the whole table .The maximum frame speed that can be achieved based on a horizontal blanking period and he reported maximum clock frequency is around 110 frames per second.Kim performed image component labeling and boundary tracing using the Xinlinx Spartan-3 FPGA and the maximum frame speed reported in this paper was at 72 frames per second.The presented results show that the proposed architecture in this paper is more robust as compared to the architectures presented in [9,10,27,28].

Latency.
Latency is important in time critical machine vision systems and in a video frame it is data dependent as the pixel data originating from the sensor array is sequential in nature.The latency of computation for a labeling process can be defined as follows.Def: latency is the time from arrival of the first pixel belonging to one object until the time point when the whole object is assigned one unique set of labels.
The latency of our proposed architecture is for the worst case scenario and is equal to one video frame.The latency value reported in [9,27,28] is equal to one image row.This claim of latency is only true if an image object occupies one row.In this scenario, the labeled object can be obtained after International Journal of Distributed Sensor Networks  a row blanking period.It is very unlikely to discover that such an object only occupies one row.For example, if there is a "U" shaped object that extends from row 100 to 200, as shown in Figure 8(b), then the labeling process for this object cannot be completed until row number 201 has been scanned.Even if the equivalences are resolved during the horizontal blanking period, the labeling process will be complete only at the end of the image object.In this case, the latency of the labeling process, as defined above, is equal to 101 image rows.In the worst case scenario, an image object extends from the top to the bottom of an image frame.In this case, an object cannot be labeled until the end of the frame, thus producing the latency of one image frame.The claim presented in [9,27,28] is that the latency of one row is valid for only one specific case of input objects.The latency of the architecture presented in [10] is also equal to one-image frame.

Power Consumption.
Low power consumption is important for battery operated machine vision systems [11][12][13].There is no reported data on power consumption in the work of Bailey et al. [9,27,28].However, the row blanking period used to resolve equivalences does have an impact on the frame speed.Since our proposed architecture does not have any dependence on a row blanking period, it is possible to achieve the same frame speed at a lower clock frequency.This lower clock frequency will result in lower power consumption as the dynamic power consumption is directly proportional to the operating frequency.
Post place and route simulation have been performed to show this effect more clearly and the results for power  consumption with a zero clock cycles row blanking period and 128 clock cycles row blanking period will be presented.The frame speed chosen for study is 86 frames per second.To achieve 86 frames per second it is necessary to have a 27 MHz clock frequency with zero blanking periods, while for 32 MHz it requires a 128 clocks cycle blanking period.The result in Table 3 clearly shows that achieving the same frame speed with 128 cycles for the row blanking period requires a higher clock frequency as compared to the design without  any dependence in relation to the row blanking period.The power consumption will also subsequently increase as the clock frequency is increased.

Discussion
In this work, hardware architecture is presented in which the image component features are computed in parallel along with single-pass labeling of connected image components.Since there is no requirement for intermediate storage of image data, the latency and overall memory utilization are maintained at a minimum.The characteristic of the developed architecture is that the majority of the data required to calculate the image features is collected in the data tables along with the labeling process.This single-pass labeling and simultaneous data accumulation are similar to that presented previously by Bailey et al. [9,27,28].When the equivalences are resolved in the equivalence table, data, which is stored on different codes, is accumulated and postprocessed into feature descriptors which are used as the input for subsequent object recognition.Our work has similarity with Bailey et al. [9,27,28] in a way that single pass labeling is used and feature extraction is in separate hardware modules.Parallel memory accesses within multiple hardware modules to banks of block ram memories will efficiently exploit the massive memory bandwidth available on the FPGA.In addition, the separation of computations is desirable from the perspective of the RT-level modeling of machine vision systems.It allows for different combinations of hardware modules for the parallel computation of an object perimeter, ellipse features, feature descriptors which are computed as histograms, and so forth, to be included into the connected component analysis.These feature descriptors are all relevant in relation to the connected component analysis used for applications such as those presented in [18,20,21,36,37].We believe that modularity and a clear separation between the descriptor computations and the labeling make the presented architecture suitable for design automation.It is also believed that a CAD tool can be developed to read a high level system description where a user can include feature descriptors of his/her own choice.From this system description, the CAD tool can be made to generate synthesizable VHDL code.The Embedded Development Kit (EDK) of Xilinx is a similar source code generator for the modeling of general signal processing systems.The EDK allows a user to add IP components of his/her own choice and generate synthesizable VHDL code for Xilinx platforms.Visual Applets is a tool for graphical modeling and for the synthesis of machine vision systems on an FPGA [40].
Our work differs in scheduling of resolving merged label if it is compared with that of Bailey et al. [9,27,28].The presented architecture resolves labels at the end of each video frame, while Bailey et al. present a method for resolving labels during row synchronization by exploiting the additional clock cycles during the blanking periods.Row-and frame synchronization is inherited from the old analog TV systems using CRT displays.Blanking periods are an unnecessary overhead for digital video systems and will thus reduce the system throughput.The architecture presented in this paper operates independently of blanking periods and will thus allow for higher frame speeds but at the cost of longer latency.Latency is a very important issue for real time systems and, in particular, for applications such as optical navigation and modern optical sensor based video games.
The system should be fast enough to give result for decision making as quick as possible.Our architecture provides this low latency.It is almost equal to one image frame period.The latency value can be reduced to 2.5 msec if operated at maximum clock frequency and if avoid synchronization is overhead.This performance achievement is due to the massive parallelism available in FPGAs.
Area and power consumptions for component labeling together with the computation of COG, area, and bounding box are summarized in Table 1.Only 12% of the available slices and 7% of block RAMs are used.One block ram is used for the delay of   − 2 labels from the previous row.Four block rams RAMB16 [41] are used for two equivalence tables, as there are two tables but they are triple port memories and a synthesis tool duplicate memory that uses two block RAMs.Our investigation of triple port memories shows that parallelization is more power efficient than serialization of memory accesses.Figure 9 shows the use of two dual port block RAM versus a single dual-port block RAM in combination with an increased clock speed to achieve the same memory bandwidth.
In order to store 34-bit value from one of centroid, two 16-Kbit block rams (RAMB16) are required.Four RAMB16 are required to maintain two such data tables for odd and even frames.RAMB16 can accommodate up to eighteen bits if no parity bits are used.For the denominator data of 23 Kb, one RAMB16 and one RAMB8 are required.Two RAMB16s are required for the area calculation and eight RAMB16s are required for one bounding box parameter.In total, 25 RAMB16s and two RAMB8s are used for the whole architecture.This can be viewed in the synthesis results reported in Table 1.
The maximum clock frequency computed by the tool set is 120 MHz which means that it is possible to process up to 390 frames per second for image component labeling and feature extraction.This frame speed is calculated without including the synchronization overhead present in a video stream.If controls of camera parameters are limited, the suggestion is to employ an asynchronous FIFO buffer on the input video stream to store the incoming pixel data and to reduce the output clock speed such that blanking periods with nonvalid pixels are removed.This means that the pixel clock frequency for the computations on the FPGA can be lower than for the pixel clock connected to the image sensor.Maintaining the FPGA pixel clock frequency to be as low as possible is efficient for reducing the dynamic power consumption.This becomes clear from Figure 12 showing that the clock nets are the dominant source of power dissipation.A smart camera system usually has a good control of all sensor parameters through a register programming model.This will allow for the synchronization overhead of the sensor readout to be reduced to a minimum.For the Aptina sensor MT9V032 used for experiments in this paper, the overhead can be reduced to only a few clock cycles per frame [39].Values for the dynamic power reported in Table 1 and Figure 12 are measured for a sensor readout having a synchronization overhead of 16 percent.
Figure 12 shows the increased power consumption while increasing the number of image objects.This is due to the higher power consumption in block RAMs, signal nets, and logic while memory transactions are occurring.The block RAMs are only activated when data transactions to or from the memory occur.If block RAMs remain enabled at all times, then their power consumption becomes even greater than for the clock nets.The fact that the power dissipation stemming from the clock nets is dominant offers an explanation as to why the total dynamic power is only increasing by 36 percent while the amount of data to be accumulated for feature descriptors is increasing by 650 percent in Figure 12.The dominant power dissipation from the clock nets also offers an explanation as to why the serialization of memory accesses is less power efficient in comparison with-using triple port memories.The reason for this is because the serialization of memory accesses for increased memory bandwidth also requires an increase in the clock frequencies.This is shown in Figure 9.The dominant static power, as shown in Table 1, can be reduced by selecting a more appropriate device since for the current setup of the architecture only 12% of the available slices are being used.

Conclusion
The hardware architecture presented in this work is suitable for the efficient implementation of machine vision systems.This architecture supports robust, high speed, low latency, and low power smart camera applications.
Configuring the imaging sensor for a reduced synchronization overhead can either increase the maximum frame speed and, simultaneously, reduce the latency or reduce the power consumption at a maintained frame speed.
The majority of the dissipated dynamic power stems from the clock nets.Aggressive parallelization of computation and memory accesses maintaining the clock nets at a lower frequency would appear to be a good strategy with regards to achieving a low dynamic power for DSP systems on an FPGA.Static power can be controlled by selecting an FPGA device which has a size that matches the size of the application.

Figure 4 :
Figure 4: (a) Neighborhood of pixels, (b) bounding Box parameters, and (c) labeling kernel and delay line.

Figure 6 :
Figure 6: Architecture of image feature calculation.

Figure 7 :Figure 8 :
Figure 7: Image feature computation module running in parallel.

Figure 10 :
Figure 10: Percentage LUT utilization versus label code word length for labeling unit.

Figure 11 :
Figure 11: Power consumption versus label code word length for labeling unit.

Figure 12 :
Figure 12: (a), (b), (c), and (d) Simulated power consumption with increasing number of objects and (e) measured power with increasing number of objects.

Figure 13 :
Figure 13: Experimental Setup for verification of architecture.

Figure 14 :
Figure 14: Examples of input stimuli used for functional verification.

Table 1 :
Power and area utilization for whole design.

Table 2 :
Comparison of maximum clock frequency and resource utilization.