Toward navigation ability for autonomous mobile robots with learning from demonstration paradigm: A view of hierarchical temporal memory

Learning from demonstration, as an important component of imitation learning, is a paradigm for robot to learn new tasks. Considering the application of learning from demonstration in the navigation issue, the robot can also acquire the navigation task via the human teacher’s demonstration. Based on research of the human brain neocortex, in this article, we present a learning from demonstration navigation paradigm from the perspective of hierarchical temporal memory theory. As a type of end-to-end learning form, the demonstrated relationship between perception data and motion commands will be learned and predicted by using hierarchical temporal memory. This framework first perceives images to obtain the corresponding categories information; then the categories incorporated with depth and motion command data are encoded as a sequence of sparse distributed representation vectors. The sequential vectors are treated as the inputs to train the navigation hierarchical temporal memory. After the training, the navigation hierarchical temporal memory stores the transitions of the perceived images, depth, and motion data so that future motion commands can be predicted. The performance of the proposed navigation strategy is evaluated via the real experiments and the public data sets.


Introduction
Learning from demonstration (LfD), as an important issue in imitation learning, 1 is a paradigm for robot to learn new tasks.It is inspired from the fact that the human being learns the new skills or obtains the experiences under the guidance of the human experts.In contrast to the traditional scenario, LfD does not require analytically programming a detailed behavior, and allows the users to take the appropriate showing and to "teach" the robot how to perform the new tasks.With observing more demonstrations and repetitions, LfD provides the robot the ability to acquire the means of behaving new skills.Considering the application of LfD in the navigation issue, a person follows the tour guide to move from any position to the destination when he first visits an unknown place.After the person remembers the path which the guide showed, he learned the navigation skill on how to go to the destination in that place.As the human being learns the navigation behavior above, the robot can also acquire the navigation task via the human teacher's demonstration.This natural communication way between the human teacher and robot learner releases the complex couple of perception and planning in the navigation process, and therefore, LfD for autonomous navigation has become an attractive topic in robotics area.

Related work
The comprehensive surveys on LfD are studied by Argall et al. 2 and Billard et al. 3 These works, respectively, phrased that the LfD can follow the machine learning and computational neuroscience approaches.Nehaniv and Dautenhahn 4 analyzed four key issues of LfD, where "What to imitate" and "How to imitate" in our opinion are the two most important problems for the navigation task.
Learning the relationship between the perceptual information and actions is dominant in the literatures.We call this as end-to-end learning.The difference in previous research works is on the representations of this relationship.
The first paradigm is learning the mapping of the perceptual data and action commands [5][6][7][8] directly.To learn this mapping, De Rengervé et al. 5 used artificial neural network to recognize the places according to the panorama.The recognized places combined with odometry and compass data are applied to learn the motion commands by Gaussian mixture models.Similarly, in the study by Choi et al., 7 leveraged Gaussian process regression, another statistical technique, was presented to get the navigation policy from sequences of sensor data and action pairs.This method also allows demonstrations from casual or novice users not limited to experts.The associations between percepts and actions can be described by a set of fuzzy rules, 6 and predictive sequence learning (PSL) algorithm 6 is used to learn these associations and to predict expected sensor events in response to executed control commands.In addition, with PSL and simulation theory, the robot can generate the experience of novel sequences of events according to the learned relationships. 80][11][12] Learning this cost function is implemented by LEArning to seaRCH algorithm, [9][10][11] which is a proper technique for imitating a nonlinear cost function, and by Optimal Rapidly-exploring Random Trees planner. 12Suleman and Awais 13 proposed to find a translatable map function of teachers' and learners' actions by shared circuits model theory.It is a comprehensive and multidiscipline representative theory explaining imitation and other related social functions.Konidaris et al. 14 described a value function as the cost to link the trajectory segments/chains and the sequential motion commands and applied constructing skill tree algorithm which incorporates the pros of hierarchical reinforcement learning and statistical change-point detection algorithm to learn this value function.

Why hierarchical temporal memory
As futurist Ray Kurzweil described in his book, 15 the neocortex contains a hierarchy of pattern recognition circuits and they are responsible for most aspects of human thought.He also explains that if there exists a design of the digital neocortex, it could be used to create the same capabilities as the human brain.Hierarchical temporal memory (HTM) theory, 16 first proposed by Hawkins, 17 is an implementation version of Kurzweil's view of digital neocortex.It attempts to model the brain at a functional level rather than at a neuron or molecular level.HTM is a bioinspired model that captures the predominant characteristics of the neocortex.It mimics the neocortex's abilities of learning, inference, and prediction from sequential input patterns that are represented in sparse distributed forms and, therefore, it can describe a complex model of the world.Additionally, HTM uses the sparse distributed representations (SDRs) to represent the complex input data and lend the HTM so much flexibility, which is similar to the idea that the brain is a recursive probabilistic fractal whose line of code is represented within the 30-100 million bytes of compressed code in the genome. 15he core of the Kurzweil's book is the pattern recognition theory of mind.Its main idea is that the hierarchical structure is treated as pattern recognizer and is not just for sensing the world but for nearly all aspects of thought.][20] The reasons stated above indicate that HTM can be considered as a promising approach for implementing LfD-based navigation task.Therefore, in this study, we designed an LfD navigation paradigm from the view of HTM.It is also a type of end-to-end learning form.The relationship between perception data and motion commands will be learned and predicted by using HTM.This framework first perceives images to obtain the corresponding categories information; then the categories incorporated with depth and motion command data are encoded as a sequence of SDR vectors.The sequential vectors are treated as the inputs to train the navigation HTM (Nav-HTM).After the training, the Nav-HTM stores the transitions of the perceived images, depth, and motion data so that future motion commands can be predicted.The performance of the proposed navigation strategy is evaluated via the real experiments and the public data sets.The contribution of this work is not to appraise the literature above but just to provide a promising solution from the view of mimicking the neocortex capabilities.

Materials and methods
As a memory system, HTM is essentially a type of neural network.It first models the cells, interconnects and arranges cells in columns, organizes columns in a twodimensional (2-D) array to constitute the HTM region, and finally establishes a hierarchical neural network, as shown in Figure 1.The network learns from the time-varying inputs.These inputs have the format of SDR, which is either transformed from the environmental sensory data by an encoder or received from the outputs of the lowerlevel region.The HTM network is trained by a simple learning algorithm, namely, the cortical learning algorithm (CLA).It learns and stores sets of distributed input pattern sequences (including the sensory or sensory-motor patterns) and their transitions in the hierarchical organization through spatial and temporal pooling.With the remembered sequences and transitions, the HTM network performs inference (i.e.recognition) and prediction for the new coming inputs.The proposed HTM-based LfD navigation system follows the HTM workflow and is illustrated in Figure 2. The detailed explanation and properties of HTM and SDR can be found in technique reports. 16We describe the crucial contents related to our application in the following section.

HTM network model
The HTM network is composed of numerous interconnected HTM cells, which are organized in a column paradigm.HTM cells extract the most important capabilities of biological neurons, and as shown in Figure 3, they have more complex structures than conventional artificial neurons.
A typical HTM cell has three output states: the active state activated from feed-forward input, the predictive state activated from lateral input, and the inactive state.Each HTM cell in one column shares a single proximal dendrite segment (closest to the cell body) and has a list of distal dendrite segments (farther from the cell body).The proximal dendrite segment receives all feed-forward inputs,   including the environmental sensory data and outputs of the lower-level region, via active synapses marked by green dots.These active synapses have a linear additive effect at the cell body.Distal dendrite segments receive the lateral inputs from nearby cells through active synapses marked by blue dots.Figure 3 shows that each distal dendrite segment is a threshold detector.The segment will be activated if the number of active synapses on a segment is above a threshold Th seg .An OR operation is executed on all active distal dendrite segments to make the associated cell become the predictive state.Synapses of the HTM cells have binary weights and are formed by a set of potential synapses, which are axons that are sufficiently close to a dendrite segment and may become synapses.For the proximal dendrite, a potential synapse consists of a subset of all inputs to a region; and for the distal dendrite, the potential synapses are predominantly from the nearby cells in a region.Each potential synapse is assigned a scalar value ranging from 0 to 1.This scalar value is named as permanence, which represents a closeness or connection degree between an axon and dendrite segment.A larger permanence yields a stronger connection.If the permanence is above a threshold Th per , the potential synapse becomes a valid synapse, and the weight of this valid synapse is set as 1.The cell body receives the inputs of synapses from proximal and distal segments and provides two outputs along the axon: one is in an active state, which is horizontally sent to other adjacent cells, and the other is the OR of the active and predictive states sent to the cells of the next region.
Because the perception and action are integrated in the HTM network, distal dendritic input can also be the external input.That is, lateral connections between cells will typically be turned off in sensorimotor inference.

Sparse distributed representation
SDR is an efficient information organization in the HTM.Sparse indicates that a small percentage of cells among the large interconnected cells are activated at one time."Distributed" indicates that active cells are spread out across the region and will be involved in representing the activity of the region. 16In HTM, the binary SDR converted from a certain encoder is considered because the binary representation is more biologically plausible and highly computationally efficient.Although the number of possible inputs is greater than that of possible representations, the binary SDR does not generate a practical loss of information because the SDR has the following crucial properties.
Semantic overlap: Each cell can be thought of as capturing some "feature" in the inputs; therefore, every active cell in an SDR has semantic meaning assigned from the structure in the inputs.Different active cells at different columns in a region can produce exponential combinations of representation for the various inputs, even if any two inputs look similar.SDR possesses the property of mapping similar inputs to similar representations, which can be identified by comparing the overlap of bits with overlap ðx; yÞ jjx ^yjj ð1Þ where x and y are binary SDRs of input vectors or the stored vectors in a region; |||| is the vector length operator, and it is simply the total number of "1" bits; and ^denotes the bitwise AND operator.Union: Given a set of SDRs, they can be reliably stored in a single fixed representation by the OR operation following equation ( 2).This is important for HTM, as it holds a dynamic set of elements and underlies the prediction process in the temporal pooling.As such, a fixed set of cells and connections can operate on a dynamic list, and the union is also used to represent invariance or check a given prediction by searching the union containing its SDR.
where _ is the bit-wise OR operator.

CLA dynamic process
The CLA is a mechanism for explaining the operation in a single region of the neocortex.It has a simple framework and mathematical descriptions.The HTM uses the CLA dynamic process to learn the spatial and temporal variability commonly occurring in sequential input data and then to make predictions.The typical CLA is composed of two subprocesses: spatial and temporal pooling.The detailed explanations are described in the following subsections.
Spatial pooling.The essential function of spatial pooling is to form an SDR of the inputs.When an input appears on a region, each bit in the input signal will be assigned only to a subset of columns.The number of columns is computed by p pot , which is the percentage of inputs that a column can be connected to within a given column's potential radius r pot .
The potential synapses associated with cell proximal dendrites on these columns will be activated when their permanence values are above a threshold Th syn_per .The number of active synapses is multiplied by a boost factor (bf), which is dynamically determined by how often a column is active relative to its neighbors.This is the phase of overlap, as shown in equation ( 3) where x t in is the input SDR vector at time t, sdr c is the stored SDR in column c, bf c is the boosting factor for column c, and ol min is the minimum overlap.
The columns with the highest activations after boosting disable a fixed percentage of the columns within an inhibition radius.The result of the inhibition is to form a sparse set of active columns that are treated as the inputs of the temporal pooling subprocess in the same region.The mathematical inhibition process is where C act (t) is the set of the active column index at time t and LA min is the minimal number of winning columns.
A Hebbian-like learning procedure is implemented for each of the active columns.Permanence values of synapses aligned with active input bits are increased, and those aligned with inactive input bits are decreased, which is represented in equation ( 5) ps j denotes the jth potential synapse in active column c, and its permanence value is denoted by pm c ps j .pm syn_inc and pm syn_dec are the increment and decrement permanence values, respectively.The changes in permanence values make some synapses become valid or invalid accordingly.Simultaneously, the bf and inhibition radius are both updated according to equation ( 6) ADC avg (active duty cycle) is a sliding average that represents how often column c has been active after inhibition, for example, over the last 500 iterations.ADC min represents the minimum desired firing rate for column c. f bf is the update function, which linearly interpolates the bf between the points (0, bf max ) and (DC min , 1), as shown in Figure 4.In general, the bfs for all columns are updated simultaneously.For the inhibition radius updating, the number of inputs to which a column is connected (denoted by CS avg ) should first be determined, and then, this number is multiplied by the total number of columns that exists for each input (denoted by PI col ).For multiple dimensions, the aforementioned calculations are averaged over all dimensions of inputs and columns.
Temporal pooling.The key to CLA is the ability to learn and predict how the patterns in the world change over time and how these changes have a sequential structure that reflects transitions of the real world.The temporal pooling is more complex than spatial pooling because it combines the learning and inference procedures.It consists of three phases, and the inputs are the C act (t) obtained from the spatial pooling dynamic.
Phase 1: Determining the active state of cells.For each active column obtained in spatial pooling, the cells that are fired to a predictive state from a previous time are activated (referring to equation ( 7)).Simultaneously, the distal dendrite segment on each of these cells is marked as active when the number of synapses is over a threshold Th act .The learning cells are chosen by equation (10).Additionally, if a segment is activated from the learning cells during the previous time, the cell to which this segment connects is set as the learning cell (see equation ( 8)).
If no cell is in a predictive state, all of the cells in the column are activated, which is defined in equation (9).For this case, the segment that has the largest number of active synapses is found in column c of cell i at time tÀ1, and then, the related cell to which this segment connects is chosen as the learning cell.If no cell has such a segment, we select the cell that has the fewest number of segments as the learning cell (see equation ( 10)).In phase 1, the resulting set of active cells consists of the current input in the context of prior inputs.
For the perception-action integration case, there is an optional "Learn-On-One-Cell (LOOC)" 21 hysteresis mode.This mode is switched in the following situation.When a column is not predicted but activated by the sensory input, cells that were previously selected as the learning cell would still act as the learning cell at the current time.If no such cell exists, the learning cell is also determined by equation (10).If the LOOC mode is triggered, a copy of the motor signal is added to the input of the distal dendrites On column c of cell i, the current active segment is added to the update list SU c i ðtÞ, which will be used in phase 3. To extend the prediction back in time, another distal dendrite segment that has the largest number of active synapses at the previous time is also considered to add to the update list.
Phase 3: Updating synapses.Similar to the synapse updates of the proximal dendrite in the spatial pooling dynamic, whenever a distal dendrite segment becomes active, the permanence values of its associated potential synapses are modified by the Hebbian rule only if the cell correctly predicted the feed-forward input.Thus, the synapse permanence values for the segments in update list will be reinforced positively or negatively by Finally, a vector representing the OR of the active and predictive states of all cells in a region becomes the input to the next region in the hierarchy.With the prediction, the HTM network can estimate approximately when the inputs will likely arrive next as well as invoke and separate the motor information.

Results
To examine the performance of the HTM-based navigation strategy, we designed two experiments using the TurtleBot 2 mobile robot in a typical indoor environment of our department.One is a simple navigation in a typical office indoor environment.The robot loaded two motion sensors, odometry and gyro, and moved at translational and rotational speeds of 200 cm s À1 and 20 s À1 , respectively.The perceptual image data were acquired from a Kinect RGB-D camera loaded on the top of the robot.In these two experiments, we stored RGB images with sizes of 640 Â 480 per second.To make the computation efficient, the depth information within a region of interest (ROI) was extracted.The ROIs were selected as individual 64 Â 48 rectangles around the image center.Simultaneously, the motion data, including the translational and rotational speeds, were collected from the interior motion sensors.The RGB-D and motion information were incorporated for HTM network training and prediction.The other experiment is designed by using the public data set of outdoor environments to further evaluate our proposed navigation methods.
The HTM was designed based on the open-source project NuPIC (available at https://github.com/numenta/nupic),and its settings were identical for both experiments.The network has a hybrid structure.As shown in Figure 5, the image data were first processed by another vision HTM (VHTM) network, which is an earlier version of HTM implementation, and its output combined with the depth and motion data was encoded to send to the upper oneregion Nav-HTM network for motion prediction.We treated the VHTM as a recognition system and set it as a four-region network.Each region has a form of a 2-D cell matrix.The input region has 640 Â 480 cells, which is equal to the image size; region 1 is an 80 Â 80 cell matrix, region 2 is 10 Â 10, region 3 is 2 Â 2, and region 4 has only one cell, and it is also the output cell for the recognized category.For the Nav-HTM, the number of cells in each column was set to 32, and the size of the columns was set to 2048 (arranged as 64 Â 32 in a 2-D plane).This configuration maintains the diversity of SDR inputs and a low probability of a false match between any two SDR inputs.We applied a scalar encoder 16 to organize the motion data as the two 256-bit one-dimensional (1-D) SDR vector and a custom encoder to represent the depth data as the 8-bit 1-D SDR vector.For the output of VHTM, we also used a scalar encoder to encode the image category as the 16-bit 1-D SDR vector.As shown in equation (13), all encoded 1-D SDR vectors were integrated as a 1024-bit binary string, where the image category and depth bits consisted of perception bits and the wheel velocities were motion bits.This binary string will be sent to the Nav-HTM network for training and prediction.We set three valid bits of 16 bits for the scalar encoder of the image category.The number of 1s represents the category to which the input image belongs.For example, the encoded SDR 0111000000000000 indicates that the input image is in category 1, whereas 0011100000000000 indicates that it is in category 2. The length of the image category bits is designed for our evaluation cases, and it can be tuned adaptively according to different experimental settings.The custom encoding mechanism for depth bits is determined by the minimal distance extracted within the ROI.If the minimal distance is less than a threshold, that is, 40 cm in our experiments, the least significant bit of the depth bits is set as 1.

Â Á Á Á Â zfflfflffl}|fflfflffl{
The motion bits consist of two groups of speeds on the wheels, one for translational speed and the other for rotational speed.Because the maximal translational and rotational velocities of TurtleBot 2 are 70 cm s À1 and 110 s À1 , respectively, we set the velocity range for both translation and rotation as [À50, 50] (cm s À1 or s À1 ) based on practical considerations, where "À" indicates the negative direction.In our experiments, we defined forward movement and leftward turning as positive for translational and rotational velocities, respectively.Twenty-one bits in each 256 bits of the action encoder are set as valid bits, and 20 cm s À1 and 20 s À1 are both encoded as the same representation.The reserved bits are designed for the additional sensor information, such as the accelerometer.The CLA dynamic process parameters described in the previous section are listed in Table 1.

The case study on "Department hallway dataset"
In this experiment, a human tele-operated the robot in the corridor by a joystick for demonstration.The robot started to move beside a door and stopped in front of a cabinet in an office room.The robot met three typical objects: an open door, a closed door, and a chair (see Figure 6) during the navigation.We designed a set of simple action strategies: the robot goes through the open door, stops in front of the closed door 40 cm away, and turns left at a distance of 40 cm from the chair.The hand-measured environment map is shown in Figure where the predefined navigation routine is marked by the arrow lines and the robot and several grabbed environment scenes are also displayed.
Five sets of data were recorded in two separate demonstrating executions.Each included 140 RGB-D images and motion data.We used the first 140 captured demonstrated data to train the vision and Nav-HTM networks.After the training, the remaining groups of data were sent to the trained networks for offline evaluation.Offline validation is a batch testing, that is, the images collected at all sampling times were first sent to the VHTM to obtain a batch of image category information, then the image categories, depth, and motion data sampled at t i (i ¼ 1, . . .,139) were sent to Nav-HTM, and finally the Nav-HTM outputs the predicted inputs of Nav-HTM at t j (j ¼ 2, . . .,140).The motion commands can be split from these predictions.The offline evaluation results by using the second demonstrated data set are shown in Table 2 and Figure 8. Table 2 shows that the VHTM outputs for all testing data sets are identical with our desired values, which maintains the valid inputs for the Nav-HTM network.Figure 8 lists one-step ahead sequential action predictions of wheel translational and rotational speeds.It can be found that the predicted commands for the next sampling time are consistent with the practical ones captured by the motion sensors.In particular, when a command switch occurs (highlighted by the black arrows in Figure 8), this prediction mechanism still works well and produces correct motions.These offline examination results demonstrate that our proposed navigation method provides the correct motion predictions according to the different perceived environmental input data.
In online examination, the real-time captured RGB images were sent to the trained VHTM network and the depth data were fed to the Nav-HTM network.Only the motion data taken at the first sampling time were sent to the Nav-HTM network.The Nav-HTM itself predicts a command for the next sampling time according to the current RGB-D and motion data.The predicted action is executed and fed back to the Nav-HTM to integrate with the new RGB-D data so that the next action prediction can be generated.Figure 9 provides the online navigation routine compared with the demonstrated routine.The current routine (marked in red line) recreates the learned routine (marked in blue line).The difference between these two lines is caused by odometer noise and accumulated error of dead reckoning.This result suggests that our proposed approach can be used for online autonomous LfD navigation.In fact, once the robot starts to move, it will maintain velocities received at the initial time, and therefore, the feedback of motion data at every sampling time exactly is used to update the previous actions.The learned motion data in the demonstration process are remembered in the Nav-HTM, and they are treated as the reference for the predicted actions.If the prediction is abnormal, these stored actions can be used for anomaly detection, which will be discussed in the "Conclusion" section.
The computational platform is a Pentium M 1.73 GHz, with a 2G RAM laptop.The time for training the Nav-HTM network is 80.9 s, whereas the VHTM training time is much    terms of computational time, it is logical to use the proposed method for real-time LfD navigation tasks.
The case study on "Barcelona Robot Lab Dataset" The Barcelona Robot Lab Dataset (this data set is available at http://www.iri.upc.edu/research/webprojects/pau/datasets/BRL/index.php) is applied in this section to further evaluate the performance of the proposed navigation paradigm.This data set is intended to benchmarking algorithms for robust outdoor navigation in robotics community covers 10,000 m 2 of the UPC Nord Campus in Barcelona and include multiple sensor information.The interested data in this article are a time-stamped sequence of action/motion command from the odometry, impressively rich threedimensional (3-D) laser data, and the sequential stereo images obtained with the custom-built 3-D scanner.Since the trajectories (i.e. the demonstrations) of days 1 and 2 are different, it is not convenient to train the HTM network with the data of day 1 and test the HTM with those of day 2. In this article, we only used the day 1 data to validate our navigation method.The training set is comprised of the data obtained at the odd sampling time (t s ¼ 1,3,5, . . ., n; n ¼ 649, where t s is the sampling time and n is the total number of data), that is, the training data are selected every two sampling time; in addition, the data corresponding to the motion command switches have to be included in the training set.The stereo images are the inputs of VHTM, the velocities are from the odometry, and the depth is extracted from the stereo images within the ROI 128 Â 96 (the size of original image is 1280 Â 960).After the HTM network is trained, the online motion prediction process, similar to the first experiment, is executed for every sampling time.The difference between this online experiment and the first one is that the image and depth data are not captured in realtime form.We send the stereo images and related depth data to the HTM network frame by frame according to the time stamp.With this configuration, the robustness of our proposed navigation method can be further examined.Figure 10(a) and (b) shows the predicted motion commands compared with the practical commands of data sets.It can be found that there exist errors between the predicted and practical commands which are different from the results in Figure 8. Since, in the first experiment, all the data are used to train and only a part of data are selected as the training set in this experiment, the sequential commands predicted based on the partial demonstration data generate the errors.However, the time interval for training data is short, and especially, the data corresponding to the motion command switches sometimes follow the data grabbed at the odd time sample.This makes the training set almost the continuous data.In online experiment, most motion commands and stereo images have been used in training procedure, and   the input data sent to HTM networks are recalled from the data set one by one and not the practical data acquired from the real sensors which lack the parameters of sensor uncertainty.Therefore, the calculated robot poses according to the motion commands have small accumulated errors.The mean and variance for the translational and rotational commands are tran ¼ 0.0077, s tran ¼ 1.08 and rot ¼ À0.24, s rot ¼ 0.021, respectively.These errors have little influence on the robot pose estimation, which is illustrated in Figure 11.The predicted navigation routine (dash-dot line with circle marker) is close to the demonstrated robot poses (dash line with cross marker).Table 3 lists the precision and recall rates of our proposed method compared to PIRF-Nav 2.0 algorithm. 22For the PIRF-Nav 2.0, we used the first motion command to calculate the initial robot pose and then estimated the next pose according to the next motion command and stereo image data.The errors between the estimated pose by using two different methods and practical pose computed from the motion commands of day 1 data set are obtained.With these errors, mean and variance for robot pose can be calculated, as shown in Table 3.The recall rate is the average detection rate at the loop-closure parts which is marked in Figure 12.For our proposed method, loop-closure recognition is implemented by VHTM module.From the comparison results, it can be found that the recall rate of our proposed method is a little bit higher than PIRF-Nav 2.0 with the similar robot pose precision.These results state that our proposed LfD navigation can also be applied for an outdoor complex environment.

Anomaly detection
There is an important issue to be considered in the online evaluation.If the predicted actions deviate from those expected, the robot likely fails in the autonomous tasks, such as the navigation of our experimental environment.This situation is referred to in the terms of NuPIC as an anomaly.It is valuable to detect anomalies in real time for many applications.CLA takes the anomaly likelihood computed from an anomaly score, a powerful anomaly detection analysis approach, to address this problem. 23he anomaly likelihood enables the CLA to provide a metric representing the degree to which each record of the input sequence is predictable.It is relative to the data stream rather than an absolute measurement of abnormal behavior and is thus a critical reference to detect whether the pattern with a high anomaly score is actually anomalous.Anomaly likelihood creates an average of the error score and then compares the current average error to a distribution of what the average error has been over the past data stream.This allows us to identify anomalies based on probability.As shown in Figure 13, if the anomaly likelihood is in the green section, this suggests that the record is normal.If it is in the red section, the record shows an abnormal value, which indicates that the pattern is a novel one not seen in any sequence.The yellow section indicates that the pattern is somewhat unusual and that we do not have high confidence.In our application, we consider a pattern anomalous if its likelihood is in the yellow section.Based on the concept of anomaly detection, we calculated the anomaly likelihood for each predicted action in the online navigation experiment.If the anomaly likelihood of any action is above a predefined probability threshold P Th_ano (0.90 in our experiment, i.e. the probability or accuracy of the green section is 90%, which is equivalent to a 1.65s tolerance interval for a normal distribution), we designed a simple action retrieval strategy, that is, recalling the remembered action sequence stored in Nav-HTM to replace that which has    a higher anomaly likelihood.The retrieved action is treated as the prediction for the next time.
We did not detect any abnormal predicted actions in the online navigation experiment above.To validate the performance of the proposed action retrieval strategy, we added an impulse noise with an amplitude of 15 on the 65th predicted translational speed.The anomaly likelihood for this predicted action is 0.954, which is over 0.90.We replaced this anomalous speed with the stored speed and sent it back to Nav-HTM as the prediction for the next time.With this replacement procedure, the following predicted actions after the 65th sampling time were correctly maintained.Because the CLA prediction mechanism in our experiment is one step ahead, we only retrieved one predicted action.If a multistep ahead prediction mechanism is adopted, the number of action retrievals is determined by the number of prediction steps and anomaly likelihoods.

New image encoder
In the present study, we used the earlier generation of HTM implementation to design a VHTM network so that the obtained images could be recognized or classified as a special category, and we further encoded the categories.However, some disadvantages exist for this implementation mechanism.The learning algorithm of the old generation HTM is a partial CLA, which only includes the key CLA components, that is, spatial and temporal pooling, and has simpler learning dynamics.Additionally, the old generation HTM has no concept of encoders, no completed structure of cells, and only one-cell-per-column network.All of these factors negatively impact the learning performance, making this process only suitable for solving the pattern recognition problem.Hence, the VHTM is not an image encoder but rather a classifier system.Additionally, it is a complex programming implementation to incorporate two different generations of HTM under different compiling platforms.In our experiments, we transferred large parts of the old generation HTM code to the new HTM compile platform.However, the compiling platform transformation decreases the computational efficiency.
To address the problems above, it is necessary to design a new encoder to convert the image data to SDR.In our previous work, 24 we attempted to use a visual vocabulary technique to encode the images.Unfortunately, it cannot always maintain the sparse distributed property.A promising work is from Rinkus' research. 25He proposed a hierarchical sparse distributed coding and quantum computing technique, which has been successfully used to solve the visual processing problem.The future work of our present study can be directed to address how to integrate Rinkus' work into the current CLA algorithm.
Biological evidence for action prediction.The actions incorporated into the perceived inputs are able to contribute to predict the future consequences of the current actions.This is an important cognitive function in the perception-action integration system, which has been examined by Knoblich and Flach. 26They also proved that this type of prediction becomes more accurate when one obtains the knowledge from one's own actions rather than those of others.Their research provides the biological evidence to support the action prediction mechanism of HTM and its application for robot navigation tasks.However, the current HTM only implements a simple consequence prediction.It provides a sequence of predicted actions, including one-step or multistep predictions, but does not consider the potential information behind these predictions.From a biological viewpoint, the present version of HTM does not link the perceptual input with the action system to predict the future outcome of actions, 26 that is, it does not explain the perception of intentionality for goal-related actions 27 or implement the understanding of the intention hidden in the sequential predicted actions. 28Additionally, how the predicted actions guide the future perception process is not considered.Therefore, both of these two issues above will be the topics of our future work.

Conclusion
This study is the first attempt to explore the perceptionaction integration from the view of HTM, which mimics the substantial functions of the human neocortex.The main concept is that sequential perceptual information combined with motion data simultaneously contributes to predicting one-step future actions.The perceived images were first sent to a VHTM network to obtain corresponding categories.The categories were then incorporated with depth and motion data to be encoded as a sequence of 1-D SDR vectors.By using spatial and temporal pooling dynamics of CLA, the sequential vectors were treated as the inputs to train the Nav-HTM network; after the training, the Nav-HTM stored the transitions between the perceived images, depth, and motion so that the future actions could be predicted.

Figure 1 .
Figure 1.Structure of a typical HTM neural network.HTM: hierarchical temporal memory.

Figure 3 .
Figure 3. Components of an HTM cell.HTM: hierarchical temporal memory.

Figure 4 .
Figure 4. Function for updating the bf.bf: boost factor.
represents the active state of cell i in column c at time t given the current feed-forward input and previous temporal context; nl c i ðtÞ and np c i ðt À 1Þ are the learning and predictive state of cell i in column c at time t and tÀ1, respectively; and sga c i ðt À 1Þ represents the active segment on cell i in column c at time tÀ1.Similarly, sgl c i ðt À 1Þ is the segment activated by the learning cell at time tÀ1.If multiple segments are active, sequence segments are given preference.n c is the number of cells in column c.Phase 2: Forming a prediction based on the input in the context of prior inputs.Following phase 1, according to equation (11), the cells with active segments are admitted to the predictive state unless they are already active due to feed-forward input.np c i ðtÞ represents the predictive state of cell i in column c at time t.All of the predictive cells form the prediction of the region np c i c2C act ðtÞ ðtÞ ¼ 1; if sga c i c2C act ðtÞ ðtÞ ¼ 1 dec Þ; if np c i ðtÞ ¼ 0 and np c i ðt À 1Þ ¼ 1 ðtÞ represents the jth synapse permanence value of a segment on column c of cell i, and pm inc and pm dec are the incremented and decremented permanence values in temporal pooling dynamics, respectively.

Figure 5 .
Figure 5. Structure of hybrid HTM network for perceptionaction application.HTM: hierarchical temporal memory.
longer (370.7 s).The online evaluation process, which consists of loading trained networks, encoding RGB-D and action data, implementing spatial and temporal pooling, and predicting output, consumes 0.27 s.The cost of validation is considerably less than that of the training because the training is a batch processing.Categorizing all of the RGB images comprises nearly half of the training time.In comparison, only one image frame, depth, and motion data have to be processed in online evaluation; hence, the time cost is reduced considerably.Considering the results in

Figure 7 .
Figure 7. Hand-measured map and predefined navigation routine.

Figure 6 .
Figure 6.Typical objects in the simple experiment setting.

Figure 8 .
Figure 8. Offline evaluation results of the predicted actions.

Figure 9 .
Figure 9. Navigation routine in online evaluation.

Figure 10 .
Figure 10.The errors between the practical and predicted motion commands.

Figure 11 .
Figure 11.Online evaluation results of the robot poses.

Figure 12 .
Figure 12.The loop-closure parts of day 1 data set.

Table 1 .
Parameters of the CLA dynamic process.

Table 2 .
Offline evaluation results for VHTM.