Evaluation of convolutional neural networks for the classification of falls from heterogeneous thermal vision sensors

The automatic detection of falls within environments where sensors are deployed has attracted considerable research interest due to the prevalence and impact of falling people, especially the elderly. In this work, we analyze the capabilities of non-invasive thermal vision sensors to detect falls using several architectures of convolutional neural networks. First, we integrate two thermal vision sensors with different capabilities: (1) low resolution with a wide viewing angle and (2) high resolution with a central viewing angle. Second, we include fuzzy representation of thermal information. Third, we enable the generation of a large data set from a set of few images using ad hoc data augmentation, which increases the original data set size, generating new synthetic images. Fourth, we define three types of convolutional neural networks which are adapted for each thermal vision sensor in order to evaluate the impact of the architecture on fall detection performance. The results show encouraging performance in single-occupancy contexts. In multiple occupancy, the low-resolution thermal vision sensor with a wide viewing angle obtains better performance and reduction of learning time, in comparison with the high-resolution thermal vision sensors with a central viewing angle.


Introduction
The most recent studies of the World Bank estimated that the number of elderly people is increasing and expected to double again by 2050 worldwide. As the average age of the population continues to rise, elderly people are continuing to suffer from certain chronic diseases like dementia, hypertension, diabetes, gait issues. 1,2 Fall detection is a major challenge in the area of public health care, especially for the elderly, and reliable surveillance is a necessity to mitigate the effects of falls. 3 In this context, an alarming 42% of people aged 70 and above are involved in falls annually, with 37.3 million of those requiring medical attention as a result of their severity. 1,4 Accidental falls experienced by elderly people are a prominent cause of hospitalization, death due to the injuries sustained, and reduced independence. Several risk factors exist in relation to falls in older adults, studies mainly identifying physical frailty, poor balance, unsteady pace, poor muscle strength, and cognitive impairment. The prevalence of falls also escalates as age increases, particularly when combined with other risk factors such as chronic disease, poor sleep patterns, and diminished vision. 5 Ambient assisted living (AAL) is becoming an important consideration to provide assistive technologies aimed at sustaining independence, well-being, and quality of life. 6 So, there has been a growing need to promote and support ''aging in place'' due to demographic issues, increasing health care costs, a shortage of caregivers, and the fundamental fact that a large portion of elderly people prefer to remain independent in their own homes for as long as possible. 7 These issues open up new research avenues for tracking activities related to elderly people's daily routine, specifically with the aim of guaranteeing their safety. Over the last decade, interest in ubiquitous computing technologies has provided researchers with enough opportunities to design monitoring and intervention systems, which could provide continuous 24/7 real-time monitoring in environments with sensors, with the goal of improving the quality of life of elderly people. 8 In order to evaluate the proposed methodology, a case study is presented to evaluate the methodology for fall detection using data collected by two different thermal vision sensors (TVSs) and multiple convolutional neural networks (CNNs) in two different smart labs: the smart lab of Ulster University 9 and the smart lab of the University of Jae´n. 10 Moreover, the data set design for data collection includes single-occupancy as well as multi-occupancy scenes.
The article is structured as follows: in this section, we have provided a review of related research in the fields of TVSs and fall detection. The methodology for the evaluation of CNNs to classify the shapes of falls from heterogeneous TVSs is presented in ''Methodology'' section. The experimental setup of the case study and a discussion of the results are presented in ''Experimental setup'' section. Finally, in ''Conclusions and ongoing works'' section, conclusions and ongoing works are discussed.

Related works
The automatic detection of falls within AAL scenarios has attracted considerable research interest due to the prevalence and impact of falls in the elderly, being a crucial research area. 3 Impact-related accidents in indoor environments such as falls and collisions have been identified and studied in an attempt to avoid falls or reduce aid response time. 11 Fall detection approaches in AAL scenarios are divided into two categories: wearable/ambience sensors and vision sensors. 3,12 In approaches based on wearable/ambience sensors, sensors are attached to an inhabitant under observation-namely, wearable sensors or smart phones, or objects that make up the environment where the activity takes place-namely, dense sensing. These approaches work with time series of state changes and/ or various parameter values that are usually processed through data fusion, probabilistic, or statistical analysis methods and formal knowledge technologies. 13 The main benefit of the wearable or ambience sensor is its cost efficiency. However, two main disadvantages of this kind of sensor are intrusiveness and fixed relative relations with the object or the inhabitant that can be easily disconnected. Furthermore, installation and setup can be complex. For these reasons, this kind of device is not a very good choice for the elderly. 3 Approaches based on vision sensors exploit computer vision techniques like feature extraction, structural modeling, movement segmentation, action extraction, and movement tracking to analyze visual observations for pattern recognition. 13 In recent years, the number of approaches in this category has increased due to the fact that video cameras are commonly included in the wearable technologies or systems we use daily. 3 Previously, general vision sensors entailed disadvantages concerning privacy and ethics. After the emergence of TVSs, these disadvantages can be mitigated, being an excellent alternative to find solutions for the elderly.
Exploring state-of-the-art fall detection systems, we found recent studies within vision sensor-based approaches. In the proposal presented in Bromiley et al., 14 the image stream from the thermal detector is monitored. To do so, extracted features include horizontal and vertical gradients, aspect ratio, and centroid angle to horizontal axis of the bounding box. Falls were confirmed when the angle reached a value below 45 . A fall detection system was proposed in De Miguel et al. 15 based on a low-cost device comprising an embedded computer and camera, executed in a low-cost device such as Raspberry Pi, obtaining good performance values (i.e. 96% sensitivity), comparable to other systems using more expensive and more powerful hardware. An approach for unobtrusive indoor fall detection by an infrared (IR) thermal array sensor was proposed in Hayashida et al. 16 The main innovation of this proposal was to perform the fall detection within the sensor node by a computationally inexpensive algorithm which notifies the server only when a fall has occurred. A method was proposed in Rougier et al. 17 to detect falls by analyzing human shape deformation during a video sequence. A shape matching technique was used to track the person's silhouette along the video sequence. The shape deformation is quantified from these silhouettes based on shape analysis methods. In Asbjorn and Jim, 18 data collected from a ceilingmounted 80 3 60 thermal array were combined with an ultrasonic sensor device. This approach monitored activities, recognizing the location and posture of an individual. In Taramasco et al., 19 a non-invasive monitoring system for fall detection in older people was presented by using very low-resolution thermal sensors for classifying a fall and then alerting the care staff. Furthermore, the authors analyzed the performance of three recurrent neural networks for fall detection: long short-term memory (LSTM), gated recurrent unit, and bi-LSTM. Finally, a methodology based on CNNs to detect falls from non-invasive TVSs was presented in Medina-Quero et al. 20 with data augmentation techniques. The results show encouraging performance in single-occupancy contexts, with up to 92% accuracy, but a 10% reduction in accuracy in multiple-occupancy contexts.
Another work related to our proposal but without the application to fall detection was presented in Bayareh et al., 21 studying the diabetic foot by means of a Raspberry Pi as an embedded system and the Lepton-Flir Development Kit as an IR sensor. The IR sensor was characterized to measure the superficial temperature of the human skin radiometrically.
Most of the proposed vision-based approaches lack flexibility due to the fact that these approaches are often case-specific, depending on different scenarios and TVSs.
In this article, we present a methodology to analyze the capabilities of non-invasive TVSs 22 to detect falls by means of several architectures of CNNs in different scenarios. We propose the use of the CNNs because they have provided excellent results in multiple areas such as speech recognition, 23 image classification, 24 or gas classification. 25 The learning process with CNNs requires a large amount of data. 26 Therefore, it is necessary to collect multiple images from different inhabitants, orientations, and cases, which takes a great effort. This process could make customization and configuration in different contexts hugely difficult. This disadvantage can be overcome by data augmentation to enlarge the number of learning cases from a limited set 27 and therefore reduce over-fitting. 28 Similar approaches have been proposed in recent works, 29,30 where the selection of images from objects in a small number of human-annotated examples is then projected in the environmental background to provide new synthetic examples, as well as in thermal vision data sets. 20 In our proposal, the two studied TVSs have different capabilities. The first TVS has low resolution with a wide viewing angle and the second one has high resolution and a central viewing angle. Three types of CNN are adapted for each TVS in order to evaluate the impact of the architecture on fall detection performance. Furthermore, a large data set is generated from a set of few images as a data source, by using ad hoc data augmentation, that is, increasing the original data set size by generating new synthetic images.
Finally, we propose to include fuzzy representation of thermal information to compute the fuzzy color of human temperature. 31 The aim of including fuzzy processing of TVS data provides (1) a filter for irrelevant information, (2) reduction of noise from non-feasible values, 32 (3) scaling and focusing the relevant data range for the CNN kernels during the learning process. The use of a fuzzy approach has been demonstrated as a successful tool to reduce uncertainty in multiple applications. [33][34][35][36] Methodology In this section, we describe the methodology applied. First, in ''TVSs for analyzing fall detection'' section, we describe the TVSs evaluated in this work. Second, in ''Fuzzy representation of thermal information'' section, we define a fuzzy representation of thermal information to improve the performance of the fall detection. Third, in ''Data augmentation'' section, we detail an ad hoc data augmentation for fall detection in the previous learning stage. Fourth, in ''Design of the CNN'' section, we describe several configurations of CNNs evaluated for each TVS.

TVSs for analyzing fall detection
In this work, we have integrated two TVSs with different capabilities to evaluate their performance in analyzing fall detection: Low resolution with a wide viewing angle: in this case, we deployed the TVS Heimann HTPA 32 3 31, 37 which provides thermal vision with a 32 3 31 matrix, where each value defines a heat point of temperature. An effective factory calibration is integrated in the device, with no distortion by the fish-eye lens. 38 The data are collected from the TVS by means of a twisted Ethernet cable which is connected to the local area network. The middleware SensorCentral 39 integrates the TVS as a sensor source, providing the thermal sensor data within a Web Service in JSON format. High resolution with a central viewing angle: in this case, we deployed the Lepton LWIR module included in FLiR Dev Kit, 40 which provides thermal resolution with an 80 3 60 matrix. In addition, a Raspberry PI 41 was used in order to collect the information from the TVS 42 in real time.
In a formal definition, each TVS provides a matrix M w, h which is formed by an array of numbers m i, j whose value represents a heat point of temperature. The dimensions of the matrix are defined by weight w and height h.
In Figure 1, we provide some figures on the sensors deployed and evaluated in this work.

Fuzzy representation of thermal information
The data collected by the TVS represent the heat temperature in a matrix of points. In order to provide a visual representation, a transformation function to gray scale values is required. In this work, we propose to define a fuzzy set to represent a fuzzy color 43 of human temperature by means of a membership function m M (m i, j ), which relates the temperature values m i, j to a degree of relevance between 0 and 1 In order to describe the fuzzy set straightforwardly, the shape of the membership function is given by a trapezoidal function which is defined as a lower limit l 1 , an upper limit l 4 , a lower support limit l 2 , and an upper support limit l 3 (see TS in the ''Abbreviations'' in the appendix) The aim of including fuzzy data processing from TVSs provides (1) a filter for non-relevant information, (2) the reduction of noise from non-feasible values, 32 (3) scaling and focusing the relevant data range for the CNN kernels during the learning process. In Figure 2, we show an example of the application of fuzzy representation.

Data augmentation
In this section, we propose the augmentation and enlargement of the image data from the original data set by means of image transformations. Thus, the innovation of our proposal is based on the creation of a new larger set of synthetic images to train the model. In this work, we have included the following image transformations-translation, rotation, and scale-to augment the original image data set: Translation: the original image is relocated within a maximal window size ½t x , t y + by using a random process, which generates a random translation transformation ½t x , t y , t x 2 ½0, t + x , t y 2 ½0, t + y . Rotation scale: the rotations are provided by two methods. First, the translated image is flipped horizontally and vertically by using a random process, which applies the transformation to a percentage of cases, defined by wH, wR respectively. Second, a rotation and scale transformation is defined by a maximal rotation angle a + and a scale factor s + , which generates a random rotation with an angle a 2 ½0, a + and a random scale s 2 ½1 À s + , 1 + s + . These transformations are then applied in the center of the image. We note that this rotation overcomes the original image size, for which reason a random scale of the image is provided. An example of new synthetic images is shown in detail in Figure 3 in order to extend the data set.

Design of the CNN
In this section, we describe several CNN architectures to classify the falls sustained by inhabitants. The two TVS devices show wide differences regarding technical characteristics and development purposes. For this reason, they are integrated within systems with different computing performance.
Regarding the low-resolution TVS, in our case a Heimann HTPA, the thermal sensor collects a smaller sized matrix of heat points which can be integrated in low-cost boards with low computing performance. For this purpose, three configurations of CNNs to classify fall detection with this kind of device are evaluated: CNN 0 2 : a CNN with two-kernel layers and optimized configuration for MINIST data set. 44 CNN + 2 : a CNN with two-kernel layers and a finer granularity configuration of kernels. CNN 3 : a CNN with three-kernel layers.
These three CNN configurations have been previously identified as suitable structures for fall detection, 20 and their details are shown in Table 1.
Regarding the high-resolution TVS, in our case, an FLiR DEv Kit and a Raspberry Pi, the matrix of heat points is wider in size, requiring deeper CNN configurations to classify fall detection. In this work, we propose three CNN configurations: CNN 4 : a CNN with four-kernel layers and a deeper configuration than the previous ones (see Table 2). Alex 5 : based on the configuration of AlexNet, which is a five-layer CNN for large and deep CNNs with high performance in image classification. 28  Res + Inc: a deeper CNN with 10-kernel layers which integrates two techniques to reduce the high-dimensional hyper-parameter tuning by means of deeper architectures: Inception, which includes multiple-sized kernels operating on the same layer. 45 In this work, we integrate convolutions by 3 3 3 and 1 3 1. Residual, which integrates residual blocks with the same topology ending with identity-shortcut to connect outputs from lower layers as input in upper layers. 46 The residual blocks include convolutions by 3 3 3 and 1 3 1 for a given input and output size which is defined for each layer: res_block([in, out]). The CNN architectures for the high-resolution TVS are shown in Table 2.

Experimental setup
In this section, we detail the experimental setup of the case study carried out to evaluate the fall detection methodology using data collected by two different TVSs and multiple CNNs.
The data collection design to detect falls was divided into single-occupancy and multi-occupancy. In singleoccupancy, we included three subcategories: (1) empty room, (2) one person standing/walking, and (3) one fallen person. In multi-occupancy, we added two new subcategories: (4) two to three people standing/walking and (5) one fallen person with another person standing/ walking. The image data from three participants were collected with the two thermal sensors. While the data were being collected, each person simulated several natural positions to simulate falls, and also took a walk around the vision area of the TVS to capture walking.

Description of case studies
The first case study was carried out in the Smart Lab of Ulster University 9 (https://www.ulster.ac.uk/research/ institutes/computer-science/groups/smart-environments). The experiment was carried out in the hall of the Smart Lab. Three participants (one woman, two men) were involved in collecting data in the hall, using a TVS installed on the ceiling. The participants were 1.72, 1.68, and 1.83 m tall. The vision of the TVS in the hall was determined by a square 3.5 m bounding box (12.25 m 2 ).
The second case study was carried out in the UJAmI smart lab of the CEATIC (Center for Advanced Studies in Information Technology and Communication) of the University of Jaen (Spain) 10 (http://ceatic.ujaen.es/ujami/). The experiment was also developed in the hall of the Smart Lab; analogously, three participants (one woman, two men) were involved in collecting data in the hall, using a TVS installed on the ceiling. The participants were 1.88, 1.64, and 1.70 m tall. The vision of the TVS in the hall was determined by a square 2.5 3 2.0 m bounding box (5.0 m 2 ).
In order to evaluate the two data sets, they were divided into 10% for testing and 90% for training by using a cross-validation (10-cross validation).

Evaluation of low-resolution TVS with wide viewing angle
In this section, we detail the results achieved with the three types of CNNs and the performance of the fuzzy representation of thermal information to detect falls from thermal vision images. From the original data set, we include the following data augmentation steps: Translation: the original images have been translated within a maximal window size, ½t x , t y + = ½3, 3. Rotation scale: each image is flipped horizontally and vertically by a random probability wH = 0:5, wR = 0:5 respectively, that is, horizontally in half of the cases, and vertically in the other half. Second, a rotation and scale transformation is defined by a maximal rotation angle a + = p=2 and scale s + = 0:1. We note this configuration provides random rotation in all quadrants and angles. Crop-scale: we compute a final centered image with a window size of 28 pixels, ½s x , s y = ½28, 28, in order to fix to the bounding box of the smart lab for the case scene.
Evaluation of the best CNN configuration. In this section, we present the results from the low-resolution, wide viewing angle TVS, which was evaluated previously in Medina-Quero et al. 20 to detect the best CNN configuration. In Table 3, we include the data for the singleand multi-occupancy data set. CNN 3 provides the best results in classifying fall detection with up to 91% accuracy in single-occupancy contexts and a 6% reduction in accuracy for multi-occupancy.
Evaluation of fuzzy representation of thermal information. In this section, we evaluate performance when applying fuzzy representation to the raw data of the matrix of heat points. To define the fuzzy set which represents human temperature, we have included the following trapezoidal membership function (TR is described in the ''Abbreviations'' in the appendix). where l 1 = 219 and l 2 = 252 correspond to the average temperature collected by the TVS from background and human presence, respectively. These parameters can be straightforwardly computed from a few samples in the tuning stage of the system.
In order to provide a symmetrical evaluation, both with fuzzy representation and raw data, a new augmented data set has been computed and the performance of the best configuration CNN 3 has been analyzed for both cases and the same augmented data. In Table 4, we show the results of the single-and multi-occupancy data with raw and fuzzy representation, including the evolution of accuracy while learning in Figure 4. In Figure 5, we also include a confusion matrix for the best model in single-occupancy contexts.

Evaluation of high-resolution TVS with central viewing angle
In this section, we detail the results of the three types of CNNs to detect falls from the high-resolution, central viewing angle TVS. From the original data set, we include the following data augmentation and fuzzy steps applied to previous learning data: Translation: the original image is translated within a maximal window size ½t x , t y + = ½7, 7.  Table 4. Table summarizing the results of single and multi-occupancy data with raw and fuzzy representation for the best configuration CNN 3 . In Figure 5, we also include a confusion matrix for best the models in single-and multi-occupancy scenarios. Rotation scale: each image is flipped horizontally and vertically by a random probability wH = 0:5, wR = 0:5 respectively, that is, horizontally in half of the cases, and vertically in the other half. Second, a rotation and scale transformation is defined by a maximal rotation angle a + = p=2 and scale s + = 0:3. Fuzzy configuration: m M = TR(½l 1 , l 2 ), where l 1 = 8150, l 2 = 8405 which is provided as a suitable device configuration from previous works. 42 In Table 5, we include the data for the single-and multi-occupancy data set for each CNN configuration proposed in this work. In addition, the evolution of accuracy while learning is shown in Figure 6. In Figure 7, we also include a confusion matrix for the best model in multi-occupancy contexts.

Discussion
In this work, two TVS devices with (1) low resolution and wide viewing angle and (2) high resolution and central viewing angle, data processing stages and different CNN architectures are proposed to classify human falls in single and multi-occupancy contexts.
First, high performance is obtained in singleoccupancy scenarios, achieving over 90% accuracy for both devices. For the low-resolution, wide viewing angle TVS, we evaluate the impact of including fuzzy representation of thermal information with previous results, which has been demonstrated to increase learning speed and accuracy notably, which with CCN 3 is increased by +5%, achieving 97:2% accuracy, and by more than +7:5%, achieving 94:3% accuracy, for single-and multi-occupancy contexts. respectively. This fact highlights the use of pre-processing the thermal data to improve both performance and learning time of CNN models.
Furthermore, despite the capabilities of CNNs to extract visual features, the initial processing of information, such as fuzzy representation, is key to obtaining encouraging results. In the case of the high-resolution TVS, different CNN architectures have been evaluated, obtaining the best performance with the configuration  Alex 5 based on AlexNet 28 with 93:8% accuracy. We also note learning time is up to 10 times longer in Alex 5 than CNN due to the differences in the size of the matrix data (from 28 3 28 to 60 3 80).
Second, notable performance is obtained in multioccupancy; the results show a variance of 2:3% and 7:7% of accuracy between best model and second one in single and multi-occupancy for the high-resolution TVS, but a wide difference in performance is noted between the wide viewing angle and the central viewing angle TVS. For the low-resolution TVS with a wide viewing angle, the best performance is achieved using CNN 3 with an accuracy of 94:3%. For the highresolution device, Alex 5 provides the best result but   with an unremarkable accuracy of 77:8%, derived from the conflicting images collected in a very limited space. Regarding the performance difference, we also note the reduction of learning steps in high-resolution approaches due to the augmentation of learning time.
In this sense, a longer data set and learning time could improve these approaches, but it is outside the aim of this work, where straightforward methods for agile deployment are proposed. It is noteworthy that one of the key reasons for this low performance derives from differences in the vision area between the two devices (12.25 and 5.0 m 2 , respectively). The conflicting images we collected of standing and fallen people in the multi-occupancy context represent a greater visual challenge in limited spaces.

Conclusions and ongoing works
In this work, we have evaluated two TVSs with different capabilities located in the roof of a smart environment to classify the shapes of falls. Two case studies in the Smart Lab of the University of Ulster (UK) and in the Smart Lab of the University Jaen (Spain) are examined. Several CNN configurations are evaluated for each TVS. A low-resolution TVS with a wide viewing angle using fuzzy representation of thermal information provides outstanding performance in single-and multi-occupancy contexts.
In future works, we will analyze the impact of temporal sequences in dynamic data sets with fall detection in natural conditions using Deep Learning approaches on temporal models, such as LSTMs.