Cloth manipulation based on category classification and landmark detection

Cloth manipulation remains a challenging problem for the robotic community. Recently, there has been an increased interest in applying deep learning techniques to problems in the fashion industry. As a result, large annotated data sets for cloth category classification and landmark detection were created. In this work, we leverage these advances in deep learning to perform cloth manipulation. We propose a full cloth manipulation framework that, performs category classification and landmark detection based on an image of a garment, followed by a manipulation strategy. The process is performed iteratively to achieve a stretching task where the goal is to bring a crumbled cloth into a stretched out position. We extensively evaluate our learning pipeline and show a detailed evaluation of our framework on different types of garments in a total of 140 recorded and available experiments. Finally, we demonstrate the benefits of training a network on augmented fashion data over using a small robotic-specific data set.


Introduction
Grasping and manipulation of rigid objects have been studied extensively. [1][2][3][4] In contrast, deformable object manipulation received relatively little attention due to the challenges related to the complexity in modeling, tracking and control. 5 Manipulating clothing items is particularly difficult, as classical control approaches that require modeling the objects' dynamics are only applicable in restrictive settings. 6 Learning-based and datadriven approaches that do not rely on specific models are a viable approach for tasks that involve highly deformable objects. 7 Clothing items are one example of highly deformable objects and have been used in applications such as grasp point detection, 8,9 folding, [10][11][12] sorting, 13 unfolding, 14 dressing, 12 and classification. [15][16][17] There is also an increased interest in using deep learning techniques for online shopping and e-commerce in the fashion industry addressing problems such as clothing category classification, fashion landmark detection, image retrieval and similarity-based recommendations. Following the creation of large-scale fashion data sets, [18][19][20] significant progress has been made in fashion image analysis. Deep learning-based models have achieved significant performance gain in clothing 1 KTH Royal Institute of Technology, Stockholm, Sweden 2 ETH (Eidgenössische Technische Hochschule) Zürich, Zürich, Switzerland * Oscar Gustavsson and Thomas Ziegler contributed equally to this article.
category classification, 19,[21][22][23] item recommendation 24,25 and retrieval. 19, 26 We present an extension of our earlier work on clothing category classification and fashion landmark detection. 27 While the fashion industry often considers structured data, such as a human wearing clothes facing the camera, the data in robotic applications is less structured and can contain images of upside-down, crumpled clothing items. We built upon the progress made in fashion image analysis and proposed a network architecture and training procedure on a large-scale fashion data set DeepFashion. 19 Our model was capable to generalize well to the noisy, poorly controlled conditions encountered in robotic clothing manipulation. We introduced elastic warping, a novel image augmentation method that uses random displacement fields to create authentic looking clothing configurations to resemble the more challenging clothing configurations encountered in robotic manipulation. Furthermore, we incorporated rotation invariance and attention mechanisms in order to handle difficult configurations faced in robotic manipulation.
In the work presented here, we extend our earlier work 27 and present a full robotic manipulation framework able to classify different clothing items in a robotic manipulation context, and that makes use of the detected landmarks to manipulate the garments. The contributions are: (i) A robotic cloth manipulation framework based on category classification and landmark detection. (ii) An extensive experimental evaluation on a real robot (140 recorded experiments (https://cloth-manipulation-landmarks.github. io/cml-web/)). (iii) An extended analysis on the effect of the elastic warping method parameters. (iv) A comprehensive description of all parts of the framework, including a more detailed description of the underlying network architecture.

Related work
The release of large-scale fashion data sets have sparked increased interested in the computer vision community on the analysis on fashion images addressing clothing recognition, 19,21-23 recommendation, 25 retrieval 26 and fashion landmark localization. 19,23,28,29 . Liu et al. 19 propose a multi-branch network for simultaneous classification, retrieval and landmark localization and in Liu et al., 20 they demonstrate refinement of landmark localization. Works of Wang et al. 29 and Liu and Lu 30 are examples of deep fashion grammar network for combined clothing category classification and landmark localization.
Image data used by the robotics community differs significantly from that commonly used in retail applications. The items are either spread out or crumpled on a flat surface, 13,15 or they are in a hanging state when grasped by a robotic gripper. 9,[31][32][33] The robotics community has mostly focused on task specific, handcrafted feature extraction, such as edges and corners 34 and wrinkles. [35][36][37] Due to the 3D nature of the manipulation task, the use of physics and volumental simulators is more common in robotics. 16,38 Recent methods 9,32,33 use convolutional neural networks (CNN) instead of handcrafted features for classification.
Our previous work, focused on category classification and landmark localization, which is now extended here in a complete robotic cloth manipulation pipeline. The network has a similar architecture to, 29,30 but has been extended to more challenging clothing configurations present in robotic applications. Our method does not require generation of a specific labeled data set with predefined grasp points. Instead, it leverages the existing labeled landmarks present in the recent fashion data sets and generalizes to images taken in a robotic lab.

Method
We first formulate the problem of category classification and landmark prediction to be used in a cloth manipulation pipeline. We introduce two image augmentation methods to perturb clothing configuration in such a way that they are more representative of clothing configurations encountered during a robotic cloth manipulation task. We then give a detailed description of the proposed network and describe how the gained knowledge is used in the downstream cloth manipulation task.

Problem formulation
Our goal is to simultaneously predict the landmark locations L and category classification C, on a given image I. The landmarks are defined as L ¼ ðx k ; y k Þ f g n L k¼1 , where ðx k ; y k Þ is the kth pixel coordinate position in I and n L the total number of landmarks per image. The category classification C 2 ½0; 1 n C satisfies P n C i¼1 C i ¼ 1, where n C is the number of categories depending on the used data set. Using L and C, the goal is to further manipulate the garment in a way that minimizes the landmark position error: L err ¼L À L template , where L template is the desired landmark position.

Image augmentation
The two proposed image augmentations are image rotation and elastic warping. To augment an image together with its landmarks we define the image before transformation as input image I and the image after the transformation as transformed imageĨ. In both cases, w; h stand for the width and height of the image, respectively.
The transformation can be represented as a mapping of the pixels, 8ðx;ỹÞ 2 ½1; w Â ½1; h Iðx;ỹÞ ¼ Iðxðx;ỹÞ; yðx;ỹÞÞ; (1) where x; y are the pixel location in the input image I and x;ỹ the pixel location in the transformed imageĨ. The clothing landmark locations L ¼ ðx k ; y k Þ f g n L k¼1 are a set of n L specific pixel coordinates in the input image I.
When xðx;ỹÞ and/or yðx;ỹÞ are noninteger, interpolation is needed. We apply the commonly used bilinear interpolation 39 in such a case.
Rotation. Rotating images is often used to increase the performance in classification and/or detection tasks. 40 When clothing items lie on a flat surface, they can be in any orientation. We hence randomly sample an angle q in the range ½0; 2p for each rotation.
Elastic warping. Our proposed elastic warping method is similar to the elastic deformation proposed in Simard et al. 39 but is further extended to produce realistic, taskspecific images and to allow for landmark detection.
The deformation is created by generating two random displacement fields Δxðx;ỹÞ and Δyðx;ỹÞ. The whole augmentation is performed in four steps: First: Sample n S pixel positions uniformly in the transformed image: S ¼ fðx i ;ỹ i Þg n S i¼1 . Second: For each pixel location in 8ðx i ;ỹ i Þ 2 S sample a random displacement from a uniform distribution UðÀa; aÞ Δxðx i ;ỹ i Þ*UðÀa; aÞ; Δyðx i ;ỹ i Þ*UðÀa; aÞ: All other entries in the displacement fields are set to 0.
The strength of the distortion can be adjusted by the number of initially displaced pixels n S , the scaling of the uniform distribution a and the smoothness of the Gaussian filter s. We use n S ¼ 3, a ¼ 500 and s ¼ 40 in our experiments. Figure 1 shows some examples when using this configuration. While the presented elastic warping method can for example not emulate folded configuration, the possibility to adjust the distortion with three hyperparameters gives a wide range of different data-augmentation possibilities. Note that a too high n S can lead to undesirable image artifacts.
Landmark warping. The displacement fields indicate where a pixel in the transformed image was located in the input image. Due to the random nature of these fields no inverse exists. Therefore it is not trivial to know if/where the pixels of the input image are found in the transformed image. As our goal is to preserve the correct position of the landmarks defined in the input we describe an efficient method for retrieving the landmark position in the transformed image.
For every landmark position L k ¼ ðx k ; y k Þ, we find n possible pixels in the transformed imageĨ which originated at or near the position of the landmark in the input image I Y ¼ argmin À n 8ðx;ỹÞ2½1;wÂ½1;h sortjỹ þ Δ yðx;ỹÞ À y k j; where argmin À n returns the n smallest values from a sorted set. Note that both X and Y contain coordinate pairs ðx;ỹÞ. The value of n depends on the image size and the chosen parameters n S ; a; and s in the elastic warping. We use n ¼ 200 in our experiments. To get the transformed landmarkL k we need to find the coordinate pair ðx Ã ;ỹ Ã Þ that is either present in both X and Y or the coordinate pair in X with closest neighbor in Y.
We use the fact that the pixel coordinates are unique integer values and create a hash table for all coordinate pairs in one set. In the following, one can search for each pair in the other set if a key exist in the hash table, which reduces time complexity for existing coordinated pairs to OðnÞ.
If the hash table does not return a valid value, no exact match exists in X and Y. In this case, one can create a kdtree (OðnlognÞ) for all coordinate pairs in Y and use kd-tree search 41 to find the nearest neighbor for the coordinate pairs in X .

Network architecture
The main network architecture is loosely based on the VGG-16 42 network structure similar to the networks proposed in Wang et al. 29 and Liu and Lu. 30 The structure can be seen in Figure 2(a). Compared to the base VGG-16 network, several structural changes are included: rotation invariance layers, a landmark localization branch and attention branches for classification.
Rotation invariance. As mentioned before, variation in orientation occurs more often in a robotic cloth manipulation task. In order to account for this, we replace the 2D convolution in the conv1 to conv4 layers with Averaged Oriented Response Convolutions (A-ORConvs). They produce enriched feature maps with the orientation information explicitly encoded. 43 A-ORConvs are an improvement of the Oriented Response Convolutions (ORConvs) initially proposed in Zhou et al. 44 These convolution blocks use Averaged Active Rotating Filters (A-ARFs) and Active Rotating Filters (ARFs), respectively. Both are a 5D tensors of size where n O is the number of output channels, n I the number of input channels, w f and h f are the width and height of the filter and N is the number of filter orientations. This means that in ARFs for each materialized filter, N À 1 immaterialized rotated copies of the same filter are present. Therefore, during forward propagation one ARF produces a feature map of N channels with orientation information encoded. Depending on the orientation of the input image a different copy of the filter has the highest response. A-ORConvs improve over ORConvs by reducing the risk of gradient explosion during training by updating the feature map with the mean value of the gradients from all its rotated copies instead of the sum of all gradients.
In our network (Figure 2(b)), we use the A-ORConvs with four orientation channels (i.e. N ¼ 4). We use the same filter size and the same number of total channels when replacing the standard 2D convolution in the conv1 to conv4 layers. This means that the effective number of parameters of the A-ORConvs is only a quarter of the normal convolution blocks. In order to create rotation invariant features, a Squeeze-ORAlign (S-ORAlign) layer 43 is used to find the main response channel. The S-ORAlign is inspired by the Squeeze-and-Excitation (SE) block, 45 first a squeeze operation is performed by global average pooling. Then the main orientation channel is found via a maximum function and finally all channels are spun such that the main response channel is in the first position.
Landmark localization branch. The landmark localization branch is the same as proposed in Liu and Lu. 30 The branch structure is depicted in Figure 2(c). It uses transposed convolutions 46 to produce heatmaps for all landmarks. The transposed convolutions allow for an upsampling of the S-ORAlign features F of dimension w f Â h f Â n O , where w f and h f are width and height of the feature map and n O is the number of output channels, back to the original input image size. Given the features F a 1 Â 1 convolution is applied to reduce the number of channels in the feature map to F ð1Þ L . Then three blocks of two 3 Â 3 convolutions followed by a 4 Â 4 transposed convolution are utilized. The padding and stride of the transposed convolution are 1 and 2, respectively. Hence, such a block upsamples the feature maps by a factor of two, at the same time the number of channels is reduced by a factor of two. Finally, a 1 Â 1 convolution with a sigmoid activation is used to convert the F ð4Þ L feature map into the predicted heatmapŝ M of dimension w f Â h f Â n L , where n L is the number of landmarks (which corresponds to the maximum number of landmarks in any category considered).
The landmark localization branch can be trained separately from the classification. Let M k 2 ½0; 1 w f Âh f and M k 2 ½0; 1 w f Âh f denote the groundtruth heatmap and the predicted heatmap for the kth landmark, respectively. The landmark localization branch is trained using pixel-wise mean square differences where n B is the total number of training samples. The groundtruth heatmap M i k is generated by adding a 2D Gaussian filter at the corresponding location L i k . Given a sample i the predicted coordinates for the kth landmarkL i k corresponds to the maximal value in the predicted heatmap If there is more than one maximum per landmark one of them is chosen at random.
Attention branch. The attention branch can be seen as a union of spatial attention 47 and channel attention. 45 The attention learns a saliency weight map A of the same size as the S-ORAlign features F. Inspired by the proposed attention modules in Wang et al., 29 the spatial attention itself contains two types of attention, a landmark attention A L spatial and a category attention A C spatial . Thus, the attention branch is designed as a three branch unit; two branches for the spatial attention A L spatial ; A C spatial (Figure 2(d)) and one for the channel attention A channel (Figure 2(f)). These are combined in a factorized manner as A ¼ ðA L spatial þ A C spatial Þ Â A channel : Spatial attention -Landmark. Clothing landmarks represent functional regions of clothing and provide useful information about an item. The predicted heatmaps fM k g n L k¼1 are used to guide attention to the functional clothing regions. The weight map is created by downsampling the predicted heatmaps by a factor d, followed by a maxpooling operation This attention is learned in a supervised manner since it is directly derived from the predicted heatmaps.
Spatial attention -Category. Since the landmark attention only covers corner points of a clothing item, an additional spatial attention is used that focuses more on the clothing center. The category attention (Figure 2(e)) is modeled using an U-Net structure. 48 Given the S-ORAlign features F a 1 Â 1 convolution is applied to convert the features into F The number of feature channels doubles at every contracting step. Then a 1 Â 1 convolution and 4 Â 4 transposed convolution are applied generating the features F ð4Þ A . Followed by the U-Net expanding path, which consists of two 4 Â 4 transposed convolution. The input of the transposed convolution is a concatenation of the output from the previous transposed convolution and the corresponding feature map from the contraction path. The number of feature channels halves at every expanding step. At the end a 1 Â 1 convolution is used to convert the channels to the same number as in the S-ORAlign features. The downpooling to a low resolution of 7 Â 7 gives the spatial attention a large receptive field in the feature map of F. Upsampling is then used to have a weight map of the same size as F. The model learns the important regions of an image by itself. In contrast, our landmark attention receives the groundtruth heatmaps M, which resemble the landmark attention, during training.
Channel attention. The channel attention (Figure 2(f)) is implemented via a Squeeze-and-Excitation block. 45 A squeeze operation creates S, an embedding of the global distribution of the channel-wise feature responses in F. This channel descriptor is created using average pooling where FðÁ; Á; cÞ is the feature map of the cth channel band n O the number of output channels. Then an excitation operation is performed on the channel wise aggregated feature map to create the channel attention. Following the proposal in Hu et al., 45 a bottleneck is created using two fully connected layers, with a reduction rate r.
To refine the attention, an additional 1 Â 1 convolution layer is added afterwards. This is motivated by the fact that the spatial and channel attention are not mutually exclusive but with co-occurring complementary relationship. 49 Afterwards, a tan h function is used to shrink the attention values into a range of Output architecture. Given A, we weight the S-ORAlign features F,U ¼ ð1 þ AÞ F, where denotes the Hadamard product and 1 is a tensor. Hence, features where AðÁ; Á; ÁÞ 2 ½À1; 0Þ are reduced and features where AðÁ; Á; ÁÞ 2 ð0; 1 are increased. Our attention incorporates semantic information and global information into the network helping to focus on important regions in the images. The features U are then fed in to the conv5-1 layer. The rest of the network follows the VGG-16 structure.

Manipulation framework
Our manipulation framework, shown in Figure 3, consists of the deep neural network described in the previous section. It takes an image of the current scene containing a garment, and outputs the estimated landmarksL as well as the predicted classĈ. This is taken in by the manipulation Figure 3. Overview of our cloth-manipulation framework. The deep neural network takes the current state of the garment, identifies the class and estimates the landmark positions. The manipulation strategy is then decided given the template. After execution the new state is fed into the network and the process continuous until the task is successfully performed.
algorithm described in the following section. The manipulation strategy is then executed and the new state of the garment is passed to the network again. This process is repeated until the desired template configuration is achieved or the process terminates.

Manipulation strategy
The implemented algorithm consists of two parts: an analysis step and a manipulation step. The analysis step detects the landmarks and clothing category. Based on the certainty of the landmarks a mode of operation for the manipulation step is selected, and based on the category a template is selected. The manipulation step has two modes of operation: landmark placement, where a landmark is picked and placed at its position in the template, or stochastic stretching, where a random point on the edge of the clothing is picked and placed a distance outward. The intuitive reason for the two modes for the manipulation strategy is that if the method is not confident to have identified the right category and landmarks, the clothing item needs to be further spread out to make an identification of the category/landmarks feasible. After each execution of the manipulation step, the analysis step is repeated followed by another manipulation step until the clothing has reached the final state described by the template.
A template T consists of the indices of the relevant landmarks, their position in world coordinate frame, and their corresponding weight. We use the following notation: i2 ind T to indicate that i is a relevant landmark index in T, p T i to indicate the position of the landmark with index i and w T i to indicate the weight of landmark i2 ind T. A landmark with index i matches template T with tolerance e, written i2 e T, if i2 ind T and k p T i À p i k e where p i (obtained from the landmark localization branch) is the current location of the landmark with index i.

Analysis step
An image is taken and transformed to match an image taken from a virtual camera located right above the clothing to remove perspective distortion and rotation, improving the accuracy of the proposed network. The virtual camera is placed such that the bottom of the clothing in the template is parallel to the bottom of the image. When initializing the algorithm, the homography H between the two camera frames is computed, and subsequently used in the analysis step to transform every point ðx i ; y i Þ in the captured image to a point where k is a scalar.
The contour of the garment is found using the OpenCV 50 implementation of Suzuki et al., 51 and a bounding box containing the region of interest is determined. The region of interest is passed through the network yielding the classĈ, the location ðx 0 i;lm ; y 0 i;lm Þ of each landmark in the transformed image, and a distribution p i ðj; kÞ for each landmark. All landmarks that lie outside of the contour are brought to the nearest contour vertex. If the certainty reported by the network for the category is above a certain threshold and there exists a template for classĈ, the template T for classĈ is selected. If no template has been selected yet, and the certainty is not high enough or there is no template for classĈ, the manipulation step is configured to do stochastic stretching and the analysis step terminates. The covariance matrix is computed for each landmark, the position is transformed to the original image using the inverse homography where k is a scalar. The landmarks are finally transformed to a three dimensional point in the world coordinate frame by assuming that all landmarks lie in the plane coinciding with the table. They are then compared with the template to form a setL err ¼ fi j i= 2 e Tg of all landmarks that do not match their location in the template. If jL err j ¼ 0 the algorithm terminates. The final part of the analysis step determines the certainty of the landmarks to select between the two modes of operation. As a measure of uncertainty U i for landmark i the weighted maximum eigenvalue of the covariance matrix U i ¼ w T i l max ðS i Þ is used. If min i2L err U i is below a certain threshold, landmark argmin i2L err U i is used for landmark placement in the manipulation step, otherwise the manipulation step is configured to do stochastic stretching.

Manipulation step
Dependent on the result of the analysis step, the manipulation step does either landmark placement or stochastic stretching.
Landmark placement: The landmark i with lowest uncertainty as selected by the analysis step is picked and placed at its location p i T in the template. Stochastic stretching: If the analysis step could not determine any certain landmarks, the manipulation step does stochastic stretching. A contour around the clothing in the transformed image is found by using the OpenCV 50 implementation of Suzuki et al., 51 and a random vertex v 0 ¼ ðv 0 x ; v 0 y Þ on the contour is chosen. The contour's centroid c is computed to determine a destination point p 0 ¼ ðp 0 x ; p 0 y Þ outside of the contour as where a is the distance to displace the vertex outward, and a source point is the distance to displace the source point inward. The points s 0 and p 0 are transformed into two 3D points s w and p w in the world coordinate frame. The robot then picks the point s w and places it at p w . An advantage of the stochastic stretching approach is that it does not depend on a good state representation in the first place and is therefore applicable to nearly any configuration.

Data sets
In this section we introduce all data sets used for the training and evaluation.

CTU Color and Depth Image Dataset
The CTU Color and Depth Image Dataset of Spread Garments (CTU data set (https://github.com/CloPeMa/gar ment_dataset)) 52 is designed for testing and benchmarking garment segmentation and recognition. This data set exemplifies the unstructured clothing configuration often found in robotic cloth manipulation, meaning the garments are not only spread out flat but also exhibit wrinkles and a huge number of different orientations. The data set contains 1372 images that are taken from a top view of 17 different clothing items divided into 9 categories. We manually labeled landmark positions in each image. We use this data set to train our network and evaluate its performance on more challenging clothing configurations typical in robotics and to evaluate the effect of our augmentation methods when purely trained on the DeepFashion data set.

In-Lab data set
While the CTU data set is much closer to a real robotics cloth manipulation task than the DeepDashion data set, we created a small In-Lab data set that is even more typical for robotic tasks. It contains 117 images from 6 different clothing categories (i.e. Tank, Tee, Sweater, Hoody, Jacket, Jeans). To highlight the robotic component, each item is held by two robotic arms at predefined grasping points (i.e. shoulders and waist). The robotic arms are then moved to nine different configuration for each item. This results in the robotic arms occluding part of the garments, making the data set more challenging. Furthermore, the background is not uniform and is partially cluttered. We annotated the images with the same landmarks as in the DeepFashion data set and extracted a similar bounding box around each item. We use this data set to evaluate the performance of our network on previously unseen items in a realistic lab environment.

Garments used for manipulation experiments
To evaluate the manipulation algorithm, seven garments from different categories and with different visual features were used. Three of them in the category 't-shirt': one grey with a black star-shaped pattern, one with colorful stripes and one with a wide dark region. Furthermore, we used one grey sweater with colorful thin stripes, a dark and a white blouse, a pair of orange shorts and a pair of blue jeans.

Learning experiments
This section describes different experiments designed to evaluate the performance of our network and learning procedure. In this section the individual experiments and results are described in detail.

Pretraining on the DeepFashion data set
We use the same settings as in the literature 19,29,30 for training and evaluation. The training set contained 20,922 images while the validation set holds an additional 40,000 images. The test set (used for the final evaluation) is composed of the remaining 40,000 images.
We use the normalized error (NE) 20 as the landmark localization error measure. This is the l 2 distance between the predicted and ground truth landmark in normalized coordinates. For the category and attribute classification top-k classification accuracy is used.
Before training, the images are cropped to their bounding boxes. We train our model with and without our proposed data augmentation steps whereas the evaluation is always performed without augmentation. All implementation details can be found in the code basefn: website

Experiments on CTU data set
We perform two types of experiments on the CTU data set. In the first experiment, we analyze the inference performance of our network, solely trained on the entire DeepFashion data set. This is done in order to be able to evaluate the usefulness of the proposed data augmentation methods. In the second experiment, we evaluate the performance of our network when trained and evaluated on the CTU data set.
Experimental setup. In order to use both, the % 5 times larger DeepFashion data set and the CTU data set, we need to resolve the difference in category annotation as they do not exactly overlap. If an item has a collar it is categorized as polo in the CTU data set even though it might look more like a jacket than a polo shirt to a human. Furthermore, the CTU data set distinguishes between long and short sleeve items, whereas DeepFashion does not (e.g. t-shirt and tshirt-long can both be in the Tee category). We combine the categories as follows: bluse ¼ (Blouse), hoody ¼ (Hoodie, Sweater), pants ¼ (Jeans, Jeggins, Joggers, Leggins), polo ¼ (Tee, Button-Down), polo-long ¼ (Button-Down, Henley, Jacket), skirt ¼ (Skirt), t-shirt ¼ (Tee), t-shirt-long ¼ (Cardigan, Sweater, Tee). Note that since the DeepFashion data set does not contain any towels, we ignore them in these experiments.
For the second experiment, we split the CTU images randomly into a train, validate and test set (i.e. 787, 240, 270 images). Both experiments are compared to the publicly available implementation of Liu and Lu. 30 We train both models with the same augmentation methods (i.e. no augmentation, elastic warping (EW), rotation (R) and rotation & elastic warping (R & EW)) to make the comparison as fair as possible.
Performance evaluation. The results of landmark prediction and category classification on the CTU data set with pretrained models are shown in Tables 1 (top) and 2, respectively. First we note that the benefit of training with rotated images becomes apparent. That rotations are boosting the performance is not surprising considering the composition of the CTU data set that contains images taken in a high variety of orientations, where in the DeepFashion data set all items of clothing are upright. Adding elastic warping increases the performance further for the landmark prediction for all cases except the one where training was performed on DeepFashion with no rotation. The overall classification accuracy of 85% shows that our model is able to generalize well even when trained on a data set with significantly different configurations (e.g. items of clothing worn by persons) compared to 56% reached by Liu and Lu. 30 The results of the second experiment, trained and evaluated on CTU data set, are shown in Table 1 (bottom). Note that landmark predictions are obviously significantly better when learned on the original data set. In this case the elastic warping seems to especially boost the performance in the case of no rotations. We hypothesize that this is probably connected to the data set composition and size as the EW augmented images boost the performance. We omit the category classification results on the CTU data set since all the tested models achieve 100% accuracy.
We conclude that, adding elastic warping as a data augmentation method improves the performance in most of the evaluated cases. Our network outperforms the one proposed by Liu and Lu 30 when trained with the same augmentation methods in both experiments. This indicates that state-of-the-art methods are likely to not generalize well to more challenging robotic focused data sets.

Experiments on In-Lab data set
We leverage our In-Lab data set to investigate the performance of the network solely trained on the DeepFashion data set, and then subsequently used to classify images taken in a robotic lab environment.
The results for landmark prediction and category classification are shown in Tables 3 and 4, respectively. Some landmark predictions are exemplified in Figure 4. Interestingly, the hoody item is almost always misclassified with the exception of the model employing the elastic warping method. Furthermore, the long sleeve t-shirt (Figure 4 top row in the middle) is often classified as a sweater. With these two challenging items the best accuracy we achieve is 78:63%. Without these two items the accuracy increases to 93:33%. Due to the limited size of our data set these two items have a significant impact. As the data set is very limited in size, elastic warping can have a negative effect as well, as can be seen, for instance, in the drop in classification accuracy for the class Jacket. Nevertheless, the combination of rotation and elastic warping leads to the best overall performance. The results of the landmark localization also show that our network is able to perform well even when an image contains parts of the robot. We exemplify this useful behaviour in a short video on our project websitefn: website, where the landmarks are being continuously detected while the garment is being folded despite the robotic arms occluding large parts of the garment itself.

Elastic warping parameters
We investigated the effect of the elastic warping parameters a and s via the performance on a wider range of values. Table 5 shows the results for our model using different a and s parameters for the elastic warping for the methods trained on the DeepFashion data set (top) and the CTU data set (bottom). We compare three additional parameter combinations (a ¼ ½100; 150; 200 and s ¼ 10) with the previous model ((a ¼ 500 and s ¼ 40). We observe that the model with a ¼ 100 and s ¼ 10 performs best but landmarks are effected differently from the variation of the elastic warping parameters. This shows that it is possible with enough tuning to find a more suitable elastic warping parameter that improve the generalization towards the target distribution. We are therefore confident that the elastic warping data augmentation method can be used to improve generalization towards a real world scenario as well.

Robotic experiments
The algorithm described in the 'Manipulation strategy' section was implemented on a Baxter robot. The proposed network is used to perform cloth manipulation with the aim to stretch garments. The robot is presented with garments in different predefined starting states and the evaluation criteria is based on whether it can bring the garment into the state described by its template. The experiments is ended either when the state was within the tolerance from the template or when manually terminated.

Experimental setup
The manipulation clothing was tested with the initial states: folded hem, folded collar, folded sleeves, folded waist, folded legs and crumbled, see Table 6. All experiments were repeated five times with a model trained on the Deep-Fashion data set with rotation and with elastic warping parameters a ¼ 150 and s ¼ 10. To extensively evaluate the stretching with landmarks performance, the template are selected manually for the first set of experiments. To compare the performance of the full algorithm as described in the 'Manipulation strategy' section, the experiments folded hem and folded collar for garments Tee (stars) and Tee (stripes) were repeated with classification activated (five trials each). As a final set of experiments, the starting configuration folded hem for Tee (stars) was performed using a model trained on the CTU data set with a ¼ 150 and s ¼ 10, and with five trials (template set manually). a A '-' is placed where the experiment is not applicable. Each experiment was repeated five times.The limits for class and landmark certainty and the weights were chosen empirically, see Table 7. A tolerance of e ¼ 0:06, a lower limit on class certainty of 0:4, and an upper limit on landmark uncertainty of 3500 was used for all experiments. Templates for the classes Blouse, Tee, Sweater, Jeans and Shorts were available at all times.

Manipulation results
The success rates for the experiments are shown in Table 6. The results are summarized in Table 8 where 'manually terminated' are the proportion of failures that were terminated manually because the solution was not advancing, 'closeness' indicates how close the manually terminated experiments were from being solved, 'closeness stdev' is the standard deviation of the closeness number, 'bad move' is the proportion of failures that were terminated manually because of the robot making an irrecoverable bad move, and  Top-1 Top-3 Top-1 Top-3 Top-1 Top-3 Top-1 Top-3 Top-1 Top-3 Top-1 Top-3 Top-1 Top-3 Top-1 Top-3 Top-1 Top-3 Liu and Lu 30    'false success' is the proportion of failures that the algorithm reported as a success. The results of running the complete algorithm with classification are shown in Table 9, where 'no class' means that the algorithm was manually terminated because no class was ever determined. The experiments on the CTU data set resulted in 0 successes. Recordings of all experiments are available at the websitefn: website.
The closeness is computed as the mean error of the landmarks compared to the template. Each time the landmarks are measured in the analysis step, the mean distance to their position in the template is computed based on those measurements. The minimum value during one execution of an experiment, or the minimum mean error, is taken as a measure of closeness to the solution for that execution. The mean of all minimum mean errors, taken across all executions of all experiments for a class that resulted in manual termination, is taken as mean minimum mean error and is reported as 'closeness' in Table 8.
The manipulation strategy has no notion of in what order the landmarks should be placed, it merely picks the one with least uncertainty. This leads to the common failure scenario of placing a landmark in a way that moves the clothing into a state that is much harder to solve. Another failure reason is incorrect output of the network. For some configurations of the clothing the network is over confident in the position of the landmarks, making the manipulation strategy perform a bad move. The extent of this depends on

Class Landmarks Weights
Blouse 0-3, 6, 7 the garment being used, and the effect can also appear when the manipulation strategy tries to place the landmarks in a bad order, resulting in a challenging state as discussed previously.

Discussion and limitations
The method was sensitive to lighting conditions and to the color of the garment. A garment with similar color to the background showed to be problematic for the method as can be seen for Tee (dark) in Table 8. Garments that had smaller parts with a similar color to the background or garments with an overall slightly similar color could cause problems on the contour detection, as the detection of the contour would miss part of the garments and present an incomplete image to the network (see Figure 5). The dependence on lighting can be observed for the garments Sweater and Jeans in Table 8, where the success rate is low and the rate of manual termination is high. Wrinkles and small displacements could influence the detected class with the effects largely affected by the lighting conditions. The model trained on the CTU data set had poor performance, with no successes at all. Even though the CTU data set is more similar to the application than the DeepFashion data set.
This indicates that the smaller size of the CTU data set has lead to over fitting and showcases the importance of using large-scale data sets and data augmentation methods like elastic warping combined with more robust manipulation strategies.
The experiments show the potential in using landmark placement for robotic cloth manipulation. As can be seen in Table 6 some simple cases had a high success rate and there is a possibility of solving the hard initially crumbled state. Furthermore, the method has a low rate of false success, it can accurately determine whether the garment is stretched.

Conclusion and future work
We presented a complete cloth manipulation framework based on category classification and landmark detection. We use a large publicly available fashion image data set with a data augmentation method called elastic warping to train a network for garment classification and landmark detection for robotic manipulation application. We evaluate the performance of the network and the effects of the elastic warping thoroughly. We show that the parameter of the method can be tuned to fit a desired target distribution. Furthermore, we perform a wide set of real world robotic experiments where the goal is to stretch the garment from different starting configurations and provide all experimental videos on our supplementary websitefn: website. This extensive evaluation highlights the importance of more robust preprocessing methods as the used contour detection, as it is susceptible to different lighting conditions as well as erroneous if the garment color is similar to the background color. Finally, we show the inadequacy of using smaller data set for robotic purposes when dealing with novel clothing items, by comparing the performance of our method trained on a large-scale fashion data set with the performance trained on a robotic specific data set. Furthermore we plan to incorporate the learning component also in the manipulation step to formulate more robust manipulation strategies and combine the stretching step with the manipulation step. Performing an investigation towards the effect of occlusions.

Authors' note
Oscar Gustavsson and Thomas Ziegler contributed equally to this article.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.