A Dataset of Daily Interactive Manipulation

Robots that succeed in factories stumble to complete the simplest daily task humans take for granted, for the change of environment makes the task exceedingly difficult. Aiming to teach robot perform daily interactive manipulation in a changing environment using human demonstrations, we collected our own data of interactive manipulation. The dataset focuses on position, orientation, force, and torque of objects manipulated in daily tasks. The dataset includes 1,593 trials of 32 types of daily motions and 1,596 trials of pouring alone, as well as helper code. We present our dataset to facilitate the research on task-oriented interactive manipulation.


Introduction
Robots excel in manufacturing which requires repetitive motion with little fluctuation between trials. In contrast, humans rarely complete any daily task by repeating exactly what was done last time, for the environment might have changed. We aim to teach robots daily manipulation tasks using human demonstrations so that they are able to fulfill them in a changing environment. To learn how human finish a task by manipulating an object and interact with the environment, we need 3-dimensional motion data of the objects involved in fine manipulation motion, and data that represent the interaction.
Most of the currently available motion data are in the form of vision, i.e., RGB videos and depth sequences (for example, Fathi et al. (2012), Rohrbach et al. (2012), Shimada et al. (2013), Das et al. (2013), Kuehne et al. (2014), Fathi et al. (2011), Rogez et al. (2014)), which are of little or no direct use to our purpose. Nevertheless, certain datasets exist which do include motion data. Slice & Dice dataset Pham and Olivier (2009) includes 3-axis acceleration of cooking utensils which are used while salads and sandwiches are prepared. 50 Salad dataset Stein and McKenna (2013) includes 3-axis acceleration of more cooking utensils than Slice & Dice which are involved in salad preparation. CMU-MMAC de la Torre et al. (2009) dataset includes motion capture and 6-degree of freedom (DoF) inertia measurement unit (IMU) data of the human subjects while the subjects are making dishes. The IMUs record acceleration in (x, y, z, yaw, pitch, roll). The Actionsof-Making-Cereal Pieropan et al. (2014) dataset includes 6-DoF pose trajectories of the objects involved in cereal making that are estimated from RGB-D videos. The TUM Kitchen dataset Tenorth et al. (2009) includes motion capture data of the human subjects while the subjects are setting tables. The OPPORTUNITY dataset Roggen et al. (2010) includes 3-D acceleration and 2-D rotational velocity of objects. The Wrist-Worn-Accelerometer Bruno et al. (2014) dataset includes 3-axis acceleration of the wrist while the subject is doing daily activities. The Kinodynamic dataset Pham et al. (2016) includes mass, inertia, linear and angular acceleration, angular velocity, and orientation of the objects, but the manipulation exists in its own rights and does not serve to finish a task.
The aforementioned datasets are less than ideal in that 1) calculating the position trajectory using the acceleration may be inaccurate due to accumulated error, 2) the motions of objects are not always emphasized or even available, and 3) all the activities are not fine manipulations that serve to finish tasks. Having identified those deficiencies, we collected a dataset ourselves that includes 3-dimensional "position and orientation, force and torque" data of tools/objects being manipulated to fulfill certain tasks. The dataset is potentially suitable for learning either motion Huang and Sun (2015) or force Lin et al. (2012) from demonstration, motion recognition Subramani et al. (2017); Aronson et al. (2016) and understanding Aksoy et al. (2011);Paulius et al. (2018); Flanagan et al. (2006); Soechting and Flanders (2008), and is potentially beneficial to grasp research Sun (2016, 2015a,b); Sun et al. (2016). The dataset consists of two parts. The first part contains 1,593 trials that cover 32 types of motions. We choose fine motions that people commonly perform in daily life which involve interaction with a variety of objects. We reviewed existing motion-related datasets ; Huang and Sun (2016); Bianchi et al. (2016) to help us decide which motions to collect.
The second part contains the pouring motion alone. We collect it to help with motion generalization to different environments. We chose pouring because 1) pouring is found to be the second frequently executed motion in cooking, right after pick-and-place Paulius et al. (2016) and 2) we can vary the environment setup of the pouring motion easily by switching different materials, cups, and containers. The pouring data contain 1,596 trials of pouring 3 materials from 6 cups into 10 containers.
We collect the two parts of the data using the same system. We specifically describe the pouring data in Sec. 10.
The dataset aims to provide position and orientation (PO) and force and torque (FT), nevertheless, it also provides RGB and depth vision with a smaller coverage. Table 1 shows the number of trials and the counts of each modality for each motion. The minimum number of trials for each motion is 25. Table 2 shows the coverage of each modality throughout the entire data, where the coverage has a range of (0, 1], and a coverage of 1 means the modality is available for every trial. The lower coverage of the vision modality is due to filming permission restriction.

Hardware
On a desk surface, we use blue masking take to enclose a rectangular area which we refer to as the working area, and within which we perform all the motions. We make a PrimeSense RGB+depth camera aim at the working area from above. We started collecting PO data using the OptiTrack motion capture (mocap) system and soon afterwards replaced OptiTrack with the Patriot mocap system. Both systems provide 3-dimensional PO data regardless of their difference in technology. Patriot includes a source and a sensor. The source provides the reference frame, with respect to which the PO of the sensor is calculated. We use an ATI Mini40 force and torque (FT) sensor together with the Patriot PO sensor. To attach both the FT sensor and the PO sensor to a tool, we use a cascading structure that can be represented as: (tooltip + adapter + FT sensor + universal handle + PO sensor), where "+" means "connect". The end result is shown in Fig. 1. A tool in general consists of a tooltip and a handle. We disconnect the tooltip from the stock handle, insert the tooltip into a 3D-printed adapter, and glue them together. Then we connect the adapter with the tooling side of the FT sensor using screws. We 3D-print a universal handle and connect it with the mounting side of the FT sensor using screws. At the end of the universal handle we mount the PO sensor using screws. In some cases, we track the object in addition to the tool, and to do that we put a second PO sensor on the object, as shown in Fig. 2   Each tooltip is provided with a separate adapter. Since the tooltip and the adapter is glued together, a tool is equivalent to "tooltip + adapter". Fig. 3 shows the tools that we have adapted.

Coordinate frames
To track a tool using OptiTrack, we need to define the ground plane and define the tool as a trackable. The ground plane is set by aligning a right-angle set tool to the bottom left corner of the working area The trackable is defined from a set of selected markers, and is assigned the same coordinate frame, with the origin being the centroid of the markers. This is shown in Fig. 5.
Patriot contains a source that supports up to two sensors. The source provides the reference frame for the sensors as shown in Fig. 6. We define the base point of the tool to be the center of the tooling side of the FT sensor, as shown in Fig. 4. The translation from the PO sensor to the base point of the tool is [14.3, 0, 0.7], in the frame of the PO sensor, unit centimeter.
The FT sensor and the PO sensor are connected through the universal handle. The groove on the universal handle is orthogonal to both the x − y plane of the FT sensor and the y − z plane of the PO sensor. The relationship between the local frames of the FT sensor and the PO sensor is shown in Fig. 7.   Top view of the FT sensor with its local frame, the universal handle, and the PO sensor with its local frame. means into the paper plane, and means out of the paper plane.

Calibrate FT
Definition 1. The level pose of the universal handle is a pose in which the groove of the handle faces up, and in which the y−z plane of the FT sensor or equivalently the x−y plane of the PO sensor is parallel to the desk surface.
Definition 2. An average sample is the average of 500 FT samples.
The FT sensor has non-zero readings when it is static with the tool installed on it. We calibrate the FT sensor, or make the readings zeros, before we collect any data. We hold the handle in a level pose (Definition 1), and take an average sample (Definition 2) which we set as the bias FT b . We subtract the bias from each FT sample before saving the sample: FT t ← FT t − FT b . We calibrate the FT sensor each time we switch to a new tool.

Modality Synchronization
Different modalities run at different frequencies and therefore need synchronization, which we achieve by using time stamps. We use Microsoft QueryPerformanceCounter (QPC) to query time stamps with millisecond precision.
When we start the collection system, we query the time stamp and set it as the global start time t 0 . Then we start each modality as an independent thread, so that they run simultaneously and do not affect each other. For each sample, a modality queries the time stamp t through QPC, and set the difference between t and t 0 , i.e. the elapsed time since t 0 as the time stamp for that sample: (1)

Data Format
Global starting time Epalsed time relative to global starting time Data Figure 9. The structure of a non-OptiTrack csv data file. Figure 10. Formats of the columns for PO for one and two sensors Figure 11. The relationship between the axes and yaw-pitch-roll for the Patriot sensor listed in Fig. 10. Patriot expresses the orientation using yawpitch-roll (w-p-r) which is depicted in Fig. 11, and OptiTrack uses unit quaternion (qx, qy, qz, qw). If we only use one trackable but have defined two in OptiTrack, we disable the inactive one by setting all 7 columns for that trackable to be -1, i.e., the 8 columns for the inactive trackable would be (1, -1, -1, -1, -1, -1, -1, -1). Patriot samples at 60 Hz, its x − y − z has unit centimeter and its yaw-pitch-roll has unit degree. OptiTrack samples at 100 Hz, and its x − y − z has unit meter.

Using the data
We provide MATLAB code that visualizes the PO data for OptiTrack as well as Patriot, as shown in Fig. 12. The visualizer displays the trail of the base point of the tool (Fig.  4) and the object if applicable as the motion is played as an animation in 3D. The user can also manually slide through the motion forward or backward and go to a particular frame.
The FT and PO csv files have multiple formats, and we provide Python code that extracts FT and PO data from each trial given the path of the root folder. Although we have explained the format of the csv files of the FT and PO data in Sec. 7, we highly recommend using our code to get the FT and PO data to avoid error.
Each modality is sampled at a unique frequency, and using multiple modalities requires using the time stamps. One or more modalities need upsampling or downsampling.

Known issue
The PO data recorded using OptiTrack contain occasional flickering and stagnant frames. This is caused by the dependency of OptiTrack on the line of sight. This issue is not present in the data collected with Patriot.

The pouring data
We want to learn to perform a type of motion from its PO and FT data, and generalize it, i.e., execute it in a different environment. Thus, we need data that show how the motion vary in multiple different environments. We realize that since pouring is the second frequently executed motion in cooking Paulius et al. (2016), it is worth learning. Also, collecting pouring data that contain different environment setup is easy thanks to the convenience of switching material, cups, and containers. Therefore, we collected the pouring data.
The pouring data include FT, Patriot PO, and RGB videos (no depth). We collected the data using the same system as described above. In the following, we explain what has not been covered and what differs from above.
The physical entities involved in a pouring motion include the material to be poured, the container from which the material is poured which we refer to as cup, and the container to which the material is poured which we refer to as container. The pouring data contain 1,596 trials of pouring water, ice, and beans from six different cups to ten different containers. Cups are considered as tools and are installed on the FT sensor through 3D-printed adapters.
A second PO sensor is taped on the outer surface of the container just below the mouth.
We collect the FT data differently from above. When the cup is empty, we hold the handle in a level pose (Definition Figure 13. The organization of the pouring data where the red text is verbatim 1), and take an average sample (Definition 2) which we call "FT empty". Then we fill the cup with the material to an amount we desire, hold the handle in a level pose, and take an average sample which we call "FT init". Then we pour, during which we take however many samples (not average samples) which we call "FT". After we finish pouring, we hold the handle in a level pose, and take an average sample which we call "FT final". In summary, we save four kinds of FT data files -three contain an average sample each: FT empty, the FT init, FT final, and one contains regular samples: FT. We do not consider bias.
The organization of the data is shown in Fig. 13. The pouring data can be used to learn how to pour in response to the sensed force of the cup. The force is a nonlinear function of the physical properties of the cup and the material, the speed of pouring, the current pouring angle, the amount of remaining material in the cup, as well as other possibly related physical quantities. Huang and Sun (2017) shows an example of modeling such function using recurrent neural network and generalizing the pouring skills to unseen cups and containers.

Conclusion & Future work
We have presented a dataset of daily interactive manipulation. The dataset includes 32 types of motions, and provides position and orientation, and force and torque for every motion trial. In addition, to support motion generalization to different environments, we chose the pouring motion and collected corresponding data. We plan to extend the collection to other types of motions in the future.