Combining learned and analytical models for predicting action effects from sensory data

One of the most basic skills a robot should possess is predicting the effect of physical interactions with objects in the environment. This enables optimal action selection to reach a certain goal state. Traditionally, dynamics are approximated by physics-based analytical models. These models rely on specific state representations that may be hard to obtain from raw sensory data, especially if no knowledge of the object shape is assumed. More recently, we have seen learning approaches that can predict the effect of complex physical interactions directly from sensory input. It is, however, an open question how far these models generalize beyond their training data. In this work, we investigate the advantages and limitations of neural-network-based learning approaches for predicting the effects of actions based on sensory input and show how analytical and learned models can be combined to leverage the best of both worlds. As physical interaction task, we use planar pushing, for which there exists a well-known analytical model and a large real-world dataset. We propose the use of a convolutional neural network to convert raw depth images or organized point clouds into a suitable representation for the analytical model and compare this approach with using neural networks for both, perception and prediction. A systematic evaluation of the proposed approach on a very large real-world dataset shows two main advantages of the hybrid architecture. Compared with a pure neural network, it significantly (i) reduces required training data and (ii) improves generalization to novel physical interaction.

I. INTRODUCTION We approach the problem of predicting the consequences of physical interaction with objects in the environment based on raw sensory data. Traditionally, interaction dynamics are described by a physics-based analytical model [24,18,26] which relies on a certain representation of the environment's state. This approach has the advantage that the underlying function and the input parameters to the model have physical meaning and can therefore be transferred to problems with variations of these parameters. They also make the underlying assumptions in the model transparent. However, defining such models for complex scenarios and extracting the required state representation from raw sensory data may be very hard.
More recently, we have seen approaches that successfully replace the physics-based models with learned ones [27,4,20,15,3]. While often more accurate than analytical models, these methods still assume a predefined state representation as input and do not address the problem of how it may be extracted from the raw sensory data. Some neural network based methods instead simultaneously learn a representation of the sensory input and the associated dynamics from large amounts of training data, e.g. [5,2,22,8]. They have shown impressive results in predicting the effect of physical interactions. In [2], the authors argue that a network may benefit from choosing its own representation of the input data instead of being forced to use a predefined state representation. The underlying function of the dynamics and the state representation are however not intuitively understandable and cannot be mapped to physical quantities. It is thus unclear how these models could be transferred to similar problems. Neural networks also often have the capacity to memorize their training data [25] and learn a mapping from inputs to outputs instead of the underlying function. This can make perfect sense if the training data covers the whole problem domain. However, when data is sparse (e.g. because a robot learns by experimenting), the question of how to generalize beyond the training data becomes more important.
Our hypothesis is that using prior knowledge from existing physics-based models can provide a way to reduce the amount of required training data and at the same time ensure good generalization beyond the training domain. In this paper, we thus investigate using neural networks for extracting a suitable representation from raw sensory data that can then be consumed by an analytical model for prediction. We compare this hybrid approach to using a neural network for both perception and prediction and to the analytical model applied on ground truth input values.
As example physical interaction task, we choose planar pushing. For this task, a well-known physical model [18] is available as well as a large, real-world dataset [24] which we augmented with simulated images. Given a depth image of a tabletop scene with one object and the position and movement of the pusher, our models need to predict the object's position in the given image and its movement due to the push. Although the state-space of the object is rather low-dimensional (2D position plus orientation), pushing is already a quite involved manipulation problem: The system is under-actuated and the relationship between the push and the object's movement is highly non-linear. The pusher can slide along the object and dynamics change drastically when it transitions between sticking and sliding-contact or makes and breaks contact.
Our experiments show that despite of relying on depth images to extract position and contact information, all our models perform similar to the analytical model applied on the ground truth state. Given enough training data and evaluated inside of its training domain, the pure neural network implementation performs best and even outperforms the analytical model baseline significantly. However, when it comes to generalization to new actions the model-based approaches are much much more accurate. Additionally, we find that the proposed, combined approach needs significantly less training data than the neural network model to arrive at a high prediction accuracy.
In summary, the contributions of this paper are: (i) Combining a neural network for perception with an analytical dynamics model and training it end-to-end. (ii) Demonstrating the advantages of this approach over using a neural network for learning both, perception and prediction, on a real world physical interaction task. (iii) Augmenting an existing dataset of planar pushing with depth and RGB images and additional contact information. The code for this is available online.

Models for pushing
Analytical models of quasi-static planar pushing have been studied extensively in the past, starting with Mason [19]. Goyal et al. [9] introduced the limit surface to relate frictional forces with object motion, and much work has be done on different approximate representations of it [10,11]. In this work, we use a model by Lynch et al. [18], which relies on an ellipsoidal approximation of the limit surface.
More recently, there has also been a lot of work on datadriven approaches to pushing [27,4,20,15,3]. Kopicki et al. [15] describe a modular learner that outperforms a physics engine for predicting the results of 3D quasi-static pushing even for generalizing to unseen actions and object shapes. This is achieved by providing the learner not only with the trajectory of the object's global frame, but also with multiple local frames that describe contacts. The approach however requires knowledge of the object's pose from an external tracking system and the learner does not place the contactframes itself. Bauza and Rodriguez [3] train a heteroscedastic Gaussian Process that predicts not only the object's movement under a certain push, but also the expected variability of the outcome. The trained model outperforms an analytical model [18] given very few training examples. It is however specifically trained for one object and generalization to different objects is not attempted. Moreover, this work, too, assumes access to the ground truth state, including the contact point and the angle between the push and the object's surface.
Learning dynamics based on raw sensory data Many recent approaches in reinforcement learning aim to solve the so called "pixels to torque" problem, where the network processes images to extract a representation of the state and then directly returns the required action to achieve a certain task [17,16]. Jonschkowski and Brock [13] argue that the state-representation learned by such methods can be improved by enforcing robotic priors on the extracted state, that may include e.g. temporal coherence. This is an alternative way of including basic principles of physics in a learning approach, compared to what we propose here. While policy learning requires understanding the effect of actions, the above methods do not acquire an explicit dynamics model. We are interested in learning such an explicit model, as it enables optimal action selection (potentially over a larger time horizon). The following papers share this aim.
Agrawal et al. [2] consider a learning approach for pushing objects. Their network takes as input the pushing action and a pair of images: one before and one after a push. After encoding the images, two different network streams attempt to predict (i) the encoding of the second image given the first and the action and (ii) the action necessary to transition from the first to the second encoding. Simultaneously training for both tasks improves the results on action prediction. The authors do not enforce any physical models or robotic priors. As the learned models directly operate on image encodings instead of physical quantities, we cannot compare the accuracy of the forward prediction part (i) to our results. SE3-Nets [5] process dense 3D point clouds and an action to predict the next point cloud. For each object in the scene, the network predicts a segmentation mask and the parameters of an SE3 transform (linear velocity, rotation angle and axis). In newer work [6], an intermediate step is added, that computes the 6D pose of each object, before predicting the transforms based on this more structured state representation. The output point cloud is obtained by transforming all input pixels according to the transform for the object they correspond to. The resulting predictions are very sharp and the network is shown to correctly segment the objects and determine which are affected by the action. An evaluation of the generalization to new objects or forces was however not performed.
Our own architecture is inspired by this work. The pure neural network we use to compare to our hybrid approach can be seen as a simplified variant of SE3-Nets, that predicts SE2 transforms (see Sec. III-B). Since we define the loss directly on the predicted movement of the object, we omit predicting the next observation and the segmentation masks required for this. We also use a modified perception network, which relies mostly on a small image patch around the robot's end-effector.
Finn et al. [8] is similar to [5] and explores different possibilities of predicting the next frame of a sequence of actions and RGB images using recurrent neural networks.
Visual Interaction Networks [22] also take temporal information into account. A convolutional neural network encodes consecutive images into a sequence of object states. Dynamics are predicted by a recurrent network that considers pairs of objects to predict the next state of each object.

Combining analytical models and learning
The idea of using analytical models in combination with learning has also been explored in previous work. Degrave et al. [7] implemented a differentiable physics engine for rigid body dynamics in Theano and demonstrate how it can be used to train a neural network controller. In [21], the authors significantly improve Gaussian Process learning of inverse dynamics by using an analytical model of robot dynamics with fixed parameters as the mean function or as feature transform inside the covariance function of the GP's Kernel. Both works however do not cover visual perception. Most recently, Wu et al. [23] used a graphics and physics engine to learn to extract object-based state representations in an unsupervised way: Given a sequence of images, a network learns to produce a state representation that is predicted forward in time using the physics engine. The graphics engine is used to render the predicted state and its output is compared to the next image as training signal. In contrast to the aforementioned work, we not only combine learning and analytical models, but also evaluate the advantages and limitations of this approach.

III. PREDICTING THE EFFECTS OF PUSHING ACTIONS
Our aim is to analyse the benefits of combining neural networks with analytical models. We compare to models that exclusively rely on either approach. As a test bed, we use planar pushing, for which a well-known analytical model and a real-world dataset are available. In this section, we introduce the analytical model and the different network architectures.

A. An Analytical Model of Planar Pushing
We use the analytical model of quasi-static planar pushing that was devised by Lynch et al. [18]. It predicts the object's movement v o given the pusher velocity u, the contact point c and associated surface normal n as well as two frictionrelated parameters l and µ. The problem is illustrated in Figure 1, which also contains a list of symbols. Note that this model is still approximate and far from perfectly modelling the stochastic process of planar pushing [24].
Predicting the effect of a push with this model has two stages: First, it determines whether the push is stable ("sticking contact") or whether the pusher will slide along the object ("sliding contact"). In the first case, the velocity of the object at the contact point will be the same as the velocity of the pusher. In the sliding case however, the pusher's movement can be almost orthogonal to the resulting motion at the contact point. We call the motion at the contact point "effective push velocity" v p . It is the output of the first stage. Given v p and the contact point, the second stage then predicts the resulting translation and rotation of the object's centre of mass.
Stage 1: Determining the contact type and computing v p : To determine the contact type (slipping or sticking), we have to find the left and right boundary forces f l , f r of the friction cone (i.e. the forces for which the pusher will just not start sliding along the object) and the corresponding torques m l , m r . The opening angle α of the friction cone is defined by the friction coefficient µ between pusher and object. The forces and torques are then computed by α = arctan(µ) (1) where R(α) denotes a rotation matrix given α and c = c − o is the contact point relative to the object's centre of mass.
To relate the forces to object velocities, [18] uses an ellipsoidal approximation to the limit surface. To simplify notation, we use subscript b to refer to quantities associated with either the left l or right r boundary forces. v o,b and ω o,b denote linear and angular object velocity, respectively. v p,b are the push velocities that would create the boundary forces. They span the so called "motion cone".
ω o,b acts as a scaling factor. Since we are only interested in the direction of v p,b and not in its magnitude, we set To compute the effective push velocity v p , we need to determine the contact case: If the push velocity lies outside of the motion cone, the contact will slip. The resulting effective push velocity then acts in the direction of the boundary velocity v p,b which is closer to the push direction: Otherwise contact is sticking and we can use the pusher's velocity as effective push velocity v p = u. When the norm of n is zero (due to e.g. a wrong prediction of the perception neural network), we set the output v p,b to zero. Stage 2: Using v p to predict the object's motion: Given the effective push velocity v p and the contact point c relative to the object's centre of mass, we can compute the object's linear and angular velocity The object will of course only move if the pusher is in contact with the object. To use the model also in cases where no force acts on the object, we introduce the contact indicator variable s. It takes values between zero and one and is multiplied with v p to switch off responses when there is no contact. We allow s to be continuous instead of binary to give the model a chance to react to the pusher making or breaking contact during the interaction.
Discussion of Underlying Assumptions: The analytical model is built on three simplifying assumptions: (i) quasistatic pushing, i.e. the force applied to the object is big enough to move the object, but not to accelerate it (ii) the pressure distribution of the object on the surface is uniform and the limit-surface of frictional forces can be approximated by an ellipsoid (iii) the friction coefficient between surface and object is constant.
The analysis performed in [24] shows that assumption (ii) and (iii) are violated frequently by real world data. Assumption (i) holds for push velocities below 50 mm s . In addition, the contact situation may change during pushing (as the pusher may slide along the object and even lose contact), such that the model's predictions become increasingly inaccurate the longer ahead it needs to predict in one step.

B. Combining Neural Networks and Analytical Models
We now introduce the three neural network variants that we will analyse in this paper. All architectures share the same first network stage that processes raw depth images and outputs a lower-dimensional encoding and the object's position. Given this output, the pushing action (the pusher's movement u and its position p) and the friction parameters µ and l 1 , the second part of these networks predicts the object's linear and angular velocity v o . This predictive part differs for the three network variants. While two of them (simple, full) use variants of the analytical dynamics model established in Sec. III-A, the third (neural) has to learn the dynamics with a neural network. In all three variants, the prediction part has about 1.8 million trainable parameters.
We implement all our networks as well as the analytical model in tensorflow [1], which allows us to propagate gradients through the analytical models just like any other layer.
1) Perception: The architecture of the network part that processes the image is depicted in Fig. 2. We assume that the robot knows the position of its end-effector. Therefore, we extract a small (80×80 pixel) image patch ("glimpse") around the tip of the pusher. If the pusher is close to the object, the contact point and the normal to the object's surface can be estimated from this smaller image. The position of the object is estimated from the full image. Taken together, this is all the information necessary to predict the object's movement.
The glimpse is processed with three convolutional layers with ReLu non-linearity, each followed by max-pooling and batch normalization [12]. The full image is processed with a sequence of four convolutional and three deconvolution layers, of which the last has only one channel. This output feature map resembles an object segmentation map. We use spatial softmax [16] to get the pixel location of the object's centre.
Initial experiments showed that not using the glimpse strongly decreased performance for all networks. We also found that using both, the glimpse and an encoding of the full image, for estimating all physical parameters was disadvantageous: Using the full image increases the number of trainable parameters in the prediction network and adds no information that is not already contained in the glimpse.
2) Prediction: a) Neural Network only (neural): Figure 3 a) shows the prediction part of the variant neural, which uses a neural network to learn the dynamics of pushing. The input to this part is a concatenation of the output from perception with the action and friction parameter l. The network processes this input with three fully-connected layers before predicting the object's velocity v o . All intermediate fully-connected layers use ReLu non-linearities. The output layers do not apply a non-linearity. 1 We provide these inputs as friction related information cannot be obtained from single images. Estimation from sequences is considered future work. We use this variant as a middle ground between the two other options: It still contains the main mechanics of how an effective push at the contact point moves the object, but leaves it to the neural network to deduce the effective push velocity from the scene and the action. This gives the model more freedom to correct for possible shortcomings of the analytical model. We expect these to manifest mostly in the first stage of the model, as small errors can have a big effect there when they influence whether a contact is estimated as sticking or slipping. The second stage of the analytical model does not specify how the input action relates to the object's movement and simple therefore allows us to evaluate the importance of this particular aspect of the analytical model.
3) Training: For training we use Adam optimizer [14] with a learning rate of 1 −4 and a batch-size of 32 for 75,000 steps. The loss L penalizes the Euclidean distance between the predicted and the real object position in the input image (pos), the Euclidean error of the predicted object translation (trans), the error in the magnitude of translation (mag) and in angular movement (rot) in degree (instead of radian, to ensure that all components of the loss have the same order of magnitude). We use weight decay with λ = 0.001.  [v ox , v oy ] denotes linear object velocity.
π |ω −ω| pos = o −ô When using the variant hybrid, a major challenge is the contact indicator s: In the beginning of training, the direction of the predicted object movement is mostly wrong. s therefore receives a strong negative gradient, causing it to decrease quickly. Since the predicted motion is effectively multiplied by s, a low s results in the other parts of the network receiving small gradients and thus greatly slows down training. We therefore add the error in the magnitude of the predicted velocity to the loss to prevent s from decreasing too far.

IV. DATA
We use the MIT Push Dataset [24] for our experiments. It contains object pose and force recordings from real robot experiments, where eleven different planar objects are pushed on four different surface materials. For each object-surface combination, the dataset contains about 6000 pushes that vary in the manipulator's ("pusher") velocity and acceleration, the point on the object where the pusher makes contact and the angle between the object's surface and the push direction. Pushes are 5 cm long and data was recorded at 250 Hz.
As this dataset does not contain RGB or depth images, we render them using OpenGL and the mesh-data supplied with the dataset. In this work, we only use the depth images, RGB will be considered in future work. A rendered scene consists of a flat surface with one of four textures (for the four surface  materials), on which one of the objects is placed. The pusher is represented by a vertical cylinder, with no arm attached. We use a top-down view of the scene, such that the object can only move in the image plane and the z-coordinate of all scene components remains constant. This simplifies the application of the analytical model by removing the need for an additional transform between the camera and the table. Figures 4 and 5 show the different objects and example images. We also annotated the dataset with all information necessary to apply the analytical model to use it as a baseline. The code for annotation and rendering images is available online at https://github.com/mcubelab/pdproc. We construct our datasets for training and testing from a subset of the Push Dataset. As the analytical model we use does not take acceleration of the pusher into account, we only use push variants with zero pusher acceleration. We however do evaluate on data with high pusher velocities, that break the analytical model's quasi-static assumption (in Sec. V-E). One data point in our datasets consists of a depth image showing the scene before the push is applied, the object's position before and after the push and the pushers initial position and movement. The prediction horizon is 0.5 seconds in all datasets 2 . Section V contains more details about the datasets used for each experiment.
We use data from multiple randomly chosen timesteps of each sequence in the Push Dataset. Some of the examples thus contain shorter push-motions than others, as the pusher starts moving with some delay or ends its movement during the 0.5 seconds time-window. To achieve more visual variance and to balance the number of examples per object type, we sample a number of transforms of the scene relative to the camera for each push. Finally, about a third of our dataset consists of examples where we moved the pusher away from the object, such that it is not affected by the push movement.

V. EXPERIMENTS AND RESULTS
In this section, we test our hypothesis that using an analytical model for prediction together with a neural network for perception will improve data efficiency and lead to better generalization than using neural networks for both, perception and prediction. We evaluate how the networks' performance depends on the amount of training data and how well they generalize to (i) pushes with new pushing angles and contact points, (ii) new push velocities and (iii) unseen objects.

A. Metrics
For evaluation, we compute the average Euclidean distance between the predicted and the ground truth object translation (trans) and position (pos) in millimetres as well as the average error on object rotation (rot) in degree. As our datasets differ in the overall object movement, we report errors on translation and rotation normalized by the average motion in the corresponding dataset.

B. Baselines
We use three baselines in our experiments. All of them use the ground truth input values of the analytical model (action, object position, contact point, surface normal, contact indicator and friction coefficients) instead of depth images, and thus only need to predict the object's velocity, but not its initial position. If the pusher makes contact with the object during the push, but is not in contact initially, we use the contact point and normal from when contact is first made and shorten the action accordingly. Note that this gives the baseline models a big advantage over architectures that have to infer the input values from raw sensory data.
The first baseline is just the average translation and rotation over the dataset. This is equal to the error when always predicting zero movement, and we therefore name it zero. The second (physics) is the full analytical model evaluated on the ground truth input values. In addition to the networks described in Section III, we also train a neural network (neural dyn) on the (ground truth) input values of the analytical model. It uses the same architecture of fully-connected layers for prediction as neural. This allows us to evaluate whether neural benefits from being able to choose its own state representation.

C. Data efficiency
The first hypothesis we test is that combining the analytical model with a neural network for perception reduces the required training data as compared to a pure neural network.
Data: We use a dataset that contains all objects from the MIT Push dataset and all pushes with velocity 20 mm s and split it randomly into training and test set. This results in about 190k training examples and about 38k examples for testing. To evaluate how the networks' performance develops with the amount of training data, we train the models on different subsets of the training split with sizes from 2500 to the full 190k. We always evaluate on the full test split. To reduce the influence of dataset composition especially on the small datasets, we average results over multiple different datasets with the same size.
Results: Figure 6 shows how the errors in predicted translation, rotation and object position develop with more training data and Table I contains numeric values for training on the biggest and smallest training split. As expected, the combined approach of neural network and analytical model (hybrid) already performs very well on the smallest dataset (2500 examples) and beats the other models including the neural dyn baseline, which uses ground truth state representation, by a large margin. It takes more than 20k training examples for the other models to reach the performance of hybrid, where predicting rotation seems to be harder to learn than translation.
Despite of having to rely on raw depth images instead of the ground truth state representation, all three models perform at least close to the physics baseline when using the full training set. However, only the pure neural network is able to significantly improve on the baseline. This shows that using the analytical model limits hybrid from fitting the training data perfectly, since the model itself is not perfect and it does not allow for overfitting to noise in the training data. Neural has more freedom for fitting the training distribution, which however also increases the risk of overfitting. The variant simple, which only uses the second stage of the model, seems to combine the disadvantages of both approaches, as it needs much more training data than hybrid, but gets quickly outperformed by the pure neural network.
The comparison of neural and the baseline neural dyn shows that despite of having access to the ground truth data, neural dyn actually performs worse than neural on the full dataset. This seems to agree with the theory of [2], that training perception and prediction end-to-end and letting the network chose its own state representation instead of forcing it to use a predefined state may be beneficial for neural learning.

D. Generalization to new pushing angles and contact points
The previous experiment showed the different model's performance when testing on a dataset with a very similar distribution to the training set. Here, we evaluate the performance of the three networks when evaluating on held-out push configurations that were not part of the training data. Note that while the test set contains combinations of object pose and push action that the networks have not encountered during training, the pushing actions or object poses themselves do not lie outside of the training data's value range. This experiment thus test the model's interpolation abilities.

Data:
We again train the networks on a dataset that contains all objects and pushes with velocity 20 mm s . For constructing the test set, we collect all pushes with (i) pushing angles ±20 • and 0 • to the surface normal (independent from the contact points) and (ii) at a set of contact points illustrated in Figure  4 (independent from the pushing angle).
The remaining pushes are split randomly into training and validation set, which we use to monitor the training process. There are about 114k data points in the training split, 23k in the validation split and 91k in the test set.
Results: As Table II shows, hybrid performs best for predicting the object's velocity for pushes that were not part of the training set. Although still being close, none of the networks can outperform the physics baseline on this test set. Note that the difficulty of the test set in this experiment differs from the one in the previous experiment, as can be seen from the different performance of the physics baseline: Due to the central contact points and small pushing angles, the test set contains a high proportion of pushes with sticking contact, for which the object's movement is similar to the pusher's movement. This makes it hard to compare the results between Table I and Table II in terms of absolute values.
With more than 100k training examples, we supply enough data for the pure neural model to clearly outperform the combined approach and the baseline in the previous experiment (i.e. when the test set is similar to the training set,see Figure  6). The fact that neural now performs worse than hybrid and physics indicates that its advantage over the physics baseline may not come from it learning a more accurate dynamics model. Instead, it probably memorizes specific input-output combinations that the analytical model cannot predict well, for example due to noisy object pose data. In contrast to hybrid, simple again does not seem to profit from the simplified analytical model for generalization and performs similar to neural.
If we supply fewer training data, the difference between hybrid and the other networks is again much more pronounced: Hybrid achieves 20.3 % translation and 43.8 % rotation error whereas neural lies at 38.7 % and 63.4 % respectively.

E. Generalization to Different Push Velocities
In this experiment, we test how well the networks generalize to unseen push velocities. In contrast to the previous experiment, the test actions in this experiment have a different value range than the actions in the training data, and we are thus looking at extrapolation. As neural networks are usually not good at extrapolating beyond their training domain, we expect the model-based network variants to generalize better to pushvelocities not seen during training.
Data: We use the networks that were trained in the first experiment (V-C) on the full (190k) training set. The push velocity in the training set is thus 20 mm s . We evaluate on datasets with different push velocities ranging from 10 mm s to 150 mm s . Results: Results are shown in Figure 7. Since the input action does not influence perception of the object's position, we only report the errors on the predicted object motion.
On higher velocities, we see a very large difference between the performance of our combined approach and the pure neural network. Neural's predictions quickly become very inaccurate, with the error on predicted rotation rising to more than 88 % of the error when predicting zero movement always. The performance of hybrid on the other hand is most constant over the different push velocities and declines only slightly more than the physics baseline.
The reason for the decrease physics' performance on higher velocities is that the quasi-static assumption is violated: For pushes faster than 50 mm s , the object gets accelerated and can continue sliding even after contact to the pusher was lost.
Simple, neural and neural-dyn all get worse with increasing velocity, but simple degrades much less when predicting rotations. The reason for this is that all three architectures struggle mostly with predicting the correct magnitude of the object's translation and not so much with predicting the translation's direction. By using the second stage of the analytical model, simple has information about how the direction of the object's translation relates to its rotation, which results in much more accurate predictions. The advantage of hybrid for extrapolation lies in the first stage of the analytical model, that allows it to scale its predictions according to the action's magnitude. This is in essence a multiplication operation. However, a general multiplication of inputs cannot be expressed using only fullyconnected layers (as used by simple, neural and neural dyn) because fully-connected layers essentially perform weighted additions of their inputs. So instead of learning the underlying function, the networks are forced to resort to memorizing input-output relations for the magnitude of the object motion, which explains why extrapolation does not work well.

F. Generalization to Different Objects
The last experiment tests how well the networks generalize to unseen object shapes and how many different objects the networks have to see during training to generalize well.
Data: We train the networks on three different datasets: With one object (butter), two objects (butter and hex) and three objects (butter, hex and one of the ellipses or triangles). The datasets with fewer objects contain more augmented data, such that the total number of data points is about 35k training examples in each. As test sets, we use one dataset containing the three ellipses and one containing all triangles. While this is fewer training data than in the previous experiments, it should be sufficient for the pure neural network to perform as well as hybrid, since the test sets contain only few objects.
Results: The results in Figure 8 show that neural is consistently worse than the other networks when predicting rotations. It also improves most notably when one example of the test objects is in the training set. For space reasons, we omitted results for predicting translation, where differences are less pronounced. The different models do not differ much when predicting position, which is not surprising, since they share the same perception architecture.
In general, all models perform surprisingly well on ellipses, even if the models only had access to data from the butter object. Reaching the baseline's performance on triangles is however only possible with a triangle in the training set. Predicting the object's position is most sensitive to the shapes seen during training: It generalizes well to ellipses (which have similar shape and size as the butter or hex object), but the error for localizing triangles is by factor ten higher than for ellipses.

VI. DISCUSSION AND CONCLUSION
In this paper, we considered the problem of predicting the effect of physical interaction from raw sensory data. We compared a pure neural network approach to a hybrid approach that uses a neural network for perception and an analytical model for prediction. Our test bed involved pushing of planar objects on a surface -a non-linear, discontinuous manipulation task for which we have both, millions of data points and an analytical model. We observed two main advantages of the hybrid architecture. Compared to a pure neural network, it significantly (i) reduces required training data and (ii) improves generalization to novel physical interaction. This improved generalization is a result of the analytical model limiting the hybrid architecture's ability to (over-)fit the training data. However, it comes at the price of not being able to outperform the analytical model. A pure neural network based approach can beat both, the hybrid approach and the analytical model (with ground truth input values) if trained on enough data. This however only holds when evaluating on actions encountered during training and does not transfer to new push configurations, velocities or object shapes. The challenge in these cases is that the distribution of the training and test data differ significantly.
To enable the hybrid approach to improve on the prediction accuracy of its analytical model, we already experimented with learning an additional additive error correction term. Preliminary results suggest that this kind of model needs slightly more data than hybrid, but can then improve on its results while retaining the ability to generalize to very different test data provided by the analytical model.
In the hybrid approach, the analytical model is is especially helpful for extrapolation tasks, since it provides multiplication operations for scaling the output according to the input action. This kind of mathematical operation is hard to learn for fullyconnected architectures and requires many parameters and training examples for covering a large value range. Avoiding to learn such operations by e.g. providing them to the network or transferring the problem to the log-domain thus promises to not only improve the results but also reduce the number of neurons necessary to achieve a certain accuracy.
In perception on the other hand, the strengths of neural networks can be well exploited to extract the input parameters of the analytical model from raw sensory data. By training end-to-end through a given model, we can avoid the effort of labelling data with the ground truth state. Using the state representation of the analytical model also has the advantage that the network's predictions can be visualized and interpreted. On the other hand, our results suggested that a pure neural network for perception and prediction might benefit from being free to chose its own state representation.
It may however be hard to find an accurate analytical model for some physical processes and not all existing models are suitable for our approach, as they e.g. need to be differentiable everywhere. Especially the switching dynamics encountered when the contact situation changes proved to be challenging and more work needs to be done in this direction.
In the future, we want to extend our work to more complex perception problems, like different camera viewpoints, RGB images or multiple objects. By working on sequences instead of one-step predictions, one could enforce constraints temporal consistency, infer latent variables like friction or use temporal cues like optical flow as additional input.