Task space adaptation via the learning of gait controllers of magnetic soft millirobots

Untethered small-scale soft robots have promising applications in minimally invasive surgery, targeted drug delivery, and bioengineering applications as they can directly and non-invasively access confined and hard-to-reach spaces in the human body. For such potential biomedical applications, the adaptivity of the robot control is essential to ensure the continuity of the operations, as task environment conditions show dynamic variations that can alter the robot’s motion and task performance. The applicability of the conventional modeling and control methods is further limited for soft robots at the small-scale owing to their kinematics with virtually infinite degrees of freedom, inherent stochastic variability during fabrication, and changing dynamics during real-world interactions. To address the controller adaptation challenge to dynamically changing task environments, we propose using a probabilistic learning approach for a millimeter-scale magnetic walking soft robot using Bayesian optimization (BO) and Gaussian processes (GPs). Our approach provides a data-efficient learning scheme by finding the gait controller parameters while optimizing the stride length of the walking soft millirobot using a small number of physical experiments. To demonstrate the controller adaptation, we test the walking gait of the robot in task environments with different surface adhesion and roughness, and medium viscosity, which aims to represent the possible conditions for future robotic tasks inside the human body. We further utilize the transfer of the learned GP parameters among different task spaces and robots and compare their efficacy on the improvement of data-efficient controller learning.


Introduction
Soft robots are composed of highly deformable soft materials exhibiting programmable shape change, mechanical compliance, and high degrees of freedom, which are hard to achieve using rigid materials (Majidi, 2014). The easier access to novel fabrication methods further allows the engineering of stimuli-responsive soft materials that enable new functionalities for soft robots in multiple length scales (Shen et al., 2020). Biology remains to be a source of inspiration for the design, control, and behavior of soft robots (Laschi et al., 2016) and provides templates for new application areas in multi-terrain locomotion (Calisti et al., 2017), adaptive manipulation (Hughes et al., 2016), sensing (Iida and Nurzaman, 2016), human-assistive wearable systems (Walsh, 2018), and biomedicine (Cianchetti et al., 2018). Soft robots also enable safe human-robot physical interaction due to their physical compliance and the mechanical dampening of excess forces (Polygerinos et al., 2017), which otherwise require additional computational effort in conventional robotic systems (Haddadin 1 et al., 2017). Small-scale (i.e., ł 1 cm) untethered soft robots have further potential application areas in medicine owing to their ability to access enclosed small spaces non-invasively (Sitti, 2018) and the embodiment of functionalized materials enabling targeted drug delivery, diagnostics, and surgery (Cianchetti et al., 2018).
Despite their exciting potential and new capabilities, soft robots face challenges that arise from the nature of their soft materials, such as having virtually infinite degrees of freedom, being prone to fabrication-dependent performance variabilities that cause more significant effects at the smaller scale, and nonlinear material behavior (e.g., hysteresis, creep). Moreover, physical interactions of soft-bodied robots with their operation environment, such as solid or fluid operation medium, are very hard to model due to complex fluid-structure interactions, soft body dynamics, and contact mechanics. The combination of these aspects renders the application of conventional modeling and control methods challenging for soft robots, especially for untethered systems at the small scales (Rus and Tolley, 2015). One of the most widely used methods is the employment of the constant curvature (CC) models that utilize the well-established beam theories to model the kinematics and dynamics of axisymmetrically bending soft robotic systems (Della Santina et al., 2020;Webster and Jones, 2010). Alternatively, analytical approaches using Cosserat rod models (Renda et al., 2018) and geometrically exact models have been suggested for continuum robots (Grazioso et al., 2019). Simulation techniques build upon these modeling methods as in the finite-element methods (FEMs), which construct continuum robot structures using a chain of rigid elements connected with tunable spring-damper mechanisms (Chenevier et al., 2018;Goury and Duriez, 2018). Numerical approaches using voxel-based representations (Hiller and Lipson, 2014) and discrete differential geometries (DDGs) (Huang et al., 2020) improve the computation time of soft robotic simulations at the expense of nonlinear dynamics precision. These models and simulation tools typically allow the implementation of static and dynamic controllers for continuum robots on a larger scale (Thuruthel et al., 2018). However, the physical application of these closed-loop controllers depends on the continuous sensing of body deformations from embedded sensors and highly responsive actuators, and computationally heavy model solutions, which are conditions that may not be met for untethered soft robots at the small scales (Rich et al., 2018). Therefore, the soft robotic platforms that successfully employ the analytical models at the small scales still depend on either open-loop Lu et al., 2018;Ren et al., 2019;Wu et al., 2019) or manually applied (Kim et al., 2019) controllers. In particular, for those robots targeting medical applications, their dynamically changing and deformable task environments, fabrication-based variations, and material degradation over prolonged use significantly alter their robotic function performances and pose challenges for the conventional control strategies (Sitti, 2018). The combination of these challenges makes the machine learning-based, adaptive, and data-efficient (i.e., using as few experiments as possible) control methods more desirable for untethered smallscale soft robots.
Data-driven machine learning methods may provide alternative solutions for the design and control of soft robots in the lack of existing analytical or numerical models that describe their underlying kinematics, dynamics, and functions (Chin et al., 2020). One common approach is to learn these models by gathering data from robot experiments and training a neural network (NN) architecture (Bern et al., 2020;Hyatt et al., 2019;Thuruthel et al., 2019). However, the need for data efficiency, i.e., the ability to learn from only a few experimental trials, presents a core challenge for such methods (Chatzilygeroudis et al., 2019). Conversely, Bayesian optimization (BO) (Ghahramani, 2015;Shahriari et al., 2015) allows for the maximization of a performance function using a small number of physical experiments. BO typically employs Gaussian processes (GPs) (Rasmussen and Williams, 2006) as a probabilistic model of the latent objective function. Although no explicit dynamics model is needed, GPs allow for incorporating information as probabilistic priors, thus reducing the experimental data requirements. There are emerging examples that demonstrate the application of this approach to optimize the locomotion performance of robots on different length scales (Calandra et al., 2016;Liao et al., 2019;Marco et al., 2020;Yang et al., 2018). Despite its potential, there are only a few examples that apply this method to address the controller challenge for untethered soft robots, such as in the gait exploration of a tensegrity system (Rieffel and Mouret, 2018), and the optimization of an undulating motion of a microrobot (von Rohr et al., 2018).
For cases where the training and testing domains show differences in terms of features or data distribution, transfer learning (TL) methods may provide further improvements in data-efficient learning and adaptation to new test cases (Pan and Yang, 2009). Within the BO applications that employ GPs, the prior knowledge can be transferred as GP priors (Raina et al., 2006) and hyperparameters (Perrone et al., 2019) from the trained domains to provide predictive information about the unknown features and distributions in the new test domains. In robotics, TL is typically employed as the transfer of the models of kinematics and dynamics between simulated and physical platforms of conventional rigid robotic systems, such as manipulators (Devin et al., 2017;Makondo et al., 2018), humanoids (Delhaisse et al., 2017), and quadrotor platforms (Helwa and Schoellig, 2017). However, the application of TL on soft robotics systems is still in its early infancy (Schramm et al., 2020).
In our recent work in Culha et al. (2020), we demonstrated the controller learning of walking soft millirobots using BO and GPs and showed the improvement of learning efficiency in means of transferring prior mean information between robots as a TL application example. We followed the initial example by von Rohr et al. (2018), which designed a learning scheme by comparing different GP priors and BO settings on generating a semi-synthetic dataset that represents the estimated gait controller space and used this estimation to optimize the one-dimensional crawling gait of a light-driven soft microrobot. In our work in Culha et al. (2020), we adopted the magnetic soft millirobots from Hu et al. (2018) that lacked sufficient predictive kinematic models and was therefore controlled with an open-loop system whose multi-dimensional parameters were tuned manually. We showed that these robots suffered performance inconsistencies due to the fabrication reproducibility issues, material degradation over prolonged experiments, and environmental disturbances, which limited the derivation of a deterministic kinematic model and the application of relevant model-based controllers. Therefore, we applied BO and GPs to directly learn the controller parameters while optimizing the stride length performance of these robots and employed TL methods to improve learning efficiency using a small number of physical experiments.
In this study, we extend our previous work in Culha et al. (2020) and provide an in-depth analysis on using BO and GPs (in particular for TL) to directly and efficiently learn the controller parameters of the magnetic soft millirobots' walking gait on task spaces emulating bio-medical application environments ( Figure 1). First, we introduce our new automated and closed-loop experimental platform that can run the robot learning experiments repeatedly and reliably to eliminate the influence of any human intervention, which caused further material degradation and consequent performance inconsistencies in Hu et al. (2018) and Culha et al. (2020). We start with using an exhaustive search on the two-dimensional (2D) gait controller parameter space of the millirobot and generating benchmark datasets that show the stride length performances of three different robots on three different walking surfaces. We use this benchmark data to learn the optimum gait controllers using BO and GPs, and then to compare the influence of four different TL methods on the improvement of learning efficiency. We choose the best performing TL method from these experiments and use it with the BO and GPs to learn the walking gait controller parameters on a wide range of task spaces. We test our robots on task spaces with different surface roughness and friction, and liquid medium viscosity to emulate the conditions inside the human body for future target operations. Our results reveal that the direct controller learning with BO and GPs allows for adaptation to different task spaces for small-scale untethered soft robots that are prone to fabrication-, material-, and interaction-dependent performance variabilities. We also show that the effective use of TL improves this adaptation by exploring a larger set of successful walking gait controllers within a limited number of physical experiments despite the significantly changing task space conditions. The methodology we present in this study can be used for controlling future small-scale soft robot applications for medical operations that require a dataefficient controller learning system and quick adaptation to the changing task environments. In summary, the main contributions of our work are: 1. demonstration of a data-driven optimization tool (i.e., BO) that can efficiently learn the gait controllers of a small-scale untethered robot whose performance is prone to fabrication-, material-, and physical interaction-based variabilities; 2. successful testing of the walking gait on three different task spaces that emulate, e.g., dynamic (a) Fabrication process of the magneto-elastomeric soft millirobot: the robot, which is composed of non-magnetized ferromagnetic microparticles homogeneously distributed inside a silicone elastomer sheet, is rolled around a cylindrical rod and magnetized with jBj= 1.8 T field (red arrow) with a 45 angle with respect to the y-axis (see Hu 2018 for details on the fabrication method). The unfolded robot maintains a periodic magnetization profile along its body (blue arrows). (b) Photo of the magnetic actuation and imaging experimental setup with six electromagnetic coils and two high-speed cameras that allow real-time evaluation of the robot's walking gait performance. Walking gait control parameters during a single period of a motion (a sample case of 1=f = 90 ms and f = 11 Hz is normalized to 0-1 on the abscissa). (c) The magnetic field B is controlled on the y-z plane and shown with shown y and z components. (d) The magnetic field orientation changes from a 1 to a 2 and (e) magnitude reaches B max . (f) In a single actuation period of the walking gait, the magnetic soft millirobot follows four consecutive gait states shown with the photos: (1) relaxed, (2) front-stance, (3) double-stance, (4) back-stance, and (1) relaxed again (shown with the numbers above and below the figures).
environments inside the human body, and the adaptation of the robot controller parameters to these environments in a small number of experiments; 3. implementation of an automated experimental platform that runs and evaluates the physical learning experiments repeatably and reliably without human intervention and simulated environments; 4. comparison and evaluation of four different TL methods within the context of GP hyperparameters on the learning efficiency of BO on the small-scale soft robots; 5. generation of five benchmark datasets consisting of the exhaustively parsed controller parameter space involving 3,750 different physical experiments for three different robots and three different walking surfaces that would allow further comparison between different optimization methods, which is available at https://github.com/sozgundemir/ softrobotwalkingdataset The organization of this paper is as follows. We describe the design of the robotic system, the walking gait, and the properties of the task environments in our experiments in Section 2. Section 3 describes the learning approach with the details on the BO, GP, and TL methods. In Section 4, we present the experiments on generating the benchmark datasets, comparing the TL methods, and learning the walking gait in different task environments. We discuss the experimental results and conclude our work in Section 5.

Robot design and fabrication
We follow the methods and materials reported in Hu et al. (2018) and our previous work (Culha et al., 2020) and fabricate three magnetic soft millirobots with a 1:1 body mass ratio of Ecoflex 00-10 (Smooth-On Inc.) and neodymiumiron-boron (NdFeB) ferromagnetic microparticles with around 5m m diameter (MQP-15-7, Magnequench). We place this pre-polymer mixture on a methyl methacrylate plate and cut the robots out of the cast using a highresolution laser cutter (LPKF Protolaser U4) after the polymer is cured. Our robots have the final dimensions of length L = 3:7 mm, width w = 1:5 mm, and height h = 185m m as shown in Figure 1(a). We separately fold the robots around a cylindrical rod with a circumference equal to L and magnetize them within a magnetic field with a magnitude of 1:8 T and orientation of 458 measured counterclockwise from the y-axis. Once the robots are unfolded from the rod, the magnetic particles maintain their magnetization orientation forming a circular profile along the longitudinal axis of the robot body. We use these robots (i.e., robots 1, 2, and 3), which have the same nominal material properties and dimensions, in the rest of our experiments.

Walking gait definition
The walking gait of our robot is composed of four consecutive quasi-static states that are inspired by the planar quadrupedal bounding (Alexander, 1984) and a caterpillar inching motion (Trimmer and Lin, 2014). These states are depicted as (1) relaxed, (2) front-stance, (3) double-stance, and (4) back-stance as shown in Figures 1(c)-(f). We control four parameters to generate the walking gait: the maximum magnetic field magnitude (B max ), the frequency of the actuation cycle (f ), and two magnetic field orientation angles (a 1 and a 2 ) measured counterclockwise from the yaxis. The plots in Figures 1(c)-(e) show the change of the control parameters during a single period of the motion for B max = 10 mT, f = 11 Hz, a 1 = 308, and a 2 = 608, which are adopted from the hand-tuned parameters reported in Hu et al. (2018). At the beginning of a single gait period, the robot starts at a relaxed state for 0 ł jBj ł 4 mT. The robot tilts forward when a = a 1 and jBj increases from 4 mT to B max = 10 mT. Although jBj remains constant at B max , the orientation of the magnetic field changes from a 1 to a 2 causing the robot to initially switch to the doublestance state and then to the back-stance state when a = a 2 . Then, jBj decreases while keeping the orientation of the magnetic field constant, and the robot gradually switches back to the relaxed state. For jBj\4 mT, the robot assumes the relaxed state, and a single period of walking actuation ends when jBj = 0 mT. We reset B at the end of every gait cycle to avoid jerky motion when a changes from a 1 to a 2 . In our experiments, the relaxed state is never skipped but its duration changes according to f . The consecutive images from a single walking gait period are shown in Figure 1(f).

Actuation and feedback setup
We place our magnetized soft robot along the y-axis of the magnetic coil setup consisting of three orthogonal pairs of custom-made electromagnets (see Figure 1(b)) that can generate a 3D uniform magnetic field within a 4 × 4 × 4 cm 3 workspace with a maximum value of 15 mT. We modulate the magnetic field on the y-z plane that coincides with the center of the test environment by controlling the electric currents running through the electromagnetic coils with six motor driver units (SyRen25) and an Arduino microcontroller, which runs at 1.2 kHz operation frequency to compute the analog readings received from the current sensors, control the motor drivers and perform all necessary communications with the master PC for each single control cycle of 0.83 ms. We regularly calibrate the magnetic actuation matrix inside the workspace, i.e., the mapping between the applied electric current and the generated magnetic field, to maintain reliable and repeatable experiments.
We track the robot's gait using two high-speed cameras (Basler aCa2040-90uc, shown in Figure 1(b)). The first camera running at 120 frames per second (fps) is placed orthogonal to the axis of robot motion (i.e., y-z axis of the controller). A tracking algorithm, whose pseudo-code is given in Appendix B, uses this camera to detect and evaluate the robot's motion to identify if the robot is moving according to the walking gait definition given in Section 2.2. The second camera running at 60 fps has an isometric view of the test scene and is used to measure the distance traveled by the robot following the perspective correction of the captured image. In every experiment, we calculate the stride length of the robot by tracking the average distance covered by its center of mass in five consecutive steps. At the end of every experiment, the robot is moved back to its original starting position automatically with the tracking and the actuation commands. See Extension 1 for the gait detection, position tracking, and repositioning for robot 3 walking on paper.
The learning process and image processing run on a master PC, and all the communication tasks between different elements of the robotic system (e.g., image capture and electric current control) are executed on Robot Operating System (ROS) architecture, which allows our system to be scalable for further extensions. The automated experimental platform implemented in our work allows the physical experiments to be executed with minimum human intervention; therefore reducing the humanbased disturbances on the robot and the test surfaces. Without these interactions that can cause significant alterations on the soft millirobots, the physical learning experiments can be maintained repeatably and reliably.

Task environments
In this study, we use a wide range of different task environments to test the efficacy of our adaptive learning strategy in comparison with the limited surface experiments in Culha et al. (2020). Our goal is to emulate the in-air and liquid-immersed surface walking environments that a magnetic soft millirobot might experience during future medical operations inside the human body. To capture some of the characteristic properties of the target tissues and body fluids, we fabricate different task spaces and vary their : (1) surface adhesion, (2) surface roughness, and (3) the liquid medium viscosity properties. For each of these properties, we experimentally identify the range of values that allow successful walking gaits and systematically test the robots in these specific ranges.
We fabricate a set of flat substrates with different surface adhesion strengths by using different materials. This set of substrates consists of paper, polystyrene (PS), and modified polydimethylsiloxane (PDMS). PDMS substrates are prepared by mixing Sylgard184 (Dow Corning) with its curing agent to a 10:1 ratio, degassing, and curing at 90 C for 1 hour. PDMS is modified by adding ethoxylated polyethylenimine (80% solution, Sigma Aldrich) prior to mixing and curing to increase its adhesive properties (Jeong et al., 2016). A volume of 0 (PDMS-0), 1 (PDMS-1), and 2m; (PDMS-2) of polyethylenimine solution was added per 1 g of silicone elastomer base. The adhesion between the walking surfaces and the robot is measured in a TA Instruments Discovery HR-2 rheometer with a custom-built adhesion setup (at loading and unloading speeds of 50m m s À1 ). The resulting set of flat substrates resulted in walking surfaces with adhesion strengths between 1 to 10 kPa as shown in Figure 2(a).
Walking surfaces with changing roughness were prepared by replicating the surface texture of different grits of sandpaper. First, we fabricate the negative molds of the original surfaces by pressing a glass plate with a layer of uncured vinylsiloxane polymer (Flexitime medium flow, Heraeus Kulzer GmbH) onto the original surfaces. After curing the molds for 5 minutes at room temperature, we removed them and produced positive replicas of the original surfaces using clear casting epoxy (EpoxAcast TM 690, Smooth-On Inc., 10:3 ratio by weight) onto the mold, with another glass plate pressed on top. At the end of 24 hours of curing time, we removed the positive replicas from the molds. According to surface profile examination of the replicated surfaces (Keyence VKX260K), surface roughness values R q (root-mean-square height) changed between 7.5 and 77:2m m as shown in Figure 2 Last, we submerged the robots in different Newtonian fluids while walking on a flat surface to investigate the effect of bulk liquid medium viscosity on the walking performance. The Newtonian fluids were prepared by mixing different ratios of water and glycerol, and their viscosity is measured in a TA Instruments Discovery HR-2 rheometer. We analyze task environments with medium viscosity ranging from 1 to 90 cP as shown in Figure 2(c).

Learning approach
We adapt the learning approach from our previous work (Culha et al., 2020) that aims to optimize the walking gait controller parameters to maximize the stride length S of the robot. Here we define the reward function as which maps the parameter set u = ½B max , f , a 1 , a 2 to scalar reward values (i.e., the stride length performance of a robot). According to the definition of the reward function, we formulate the parameter learning as the (global) optimization problem where Y denotes the complete search space, u is the parameter set, and S(u) is the average stride length performance of the robot for a given u.
We define the range of the controller parameters based on the findings in Hu et al. (2018) and the physical limitations of our magnetic actuation setup. Accordingly, B max is defined between 5 and 12 mT, and the walking frequency, f , ranges from f min = 0:5 Hz to f max = 20 Hz. We limit a 1 and a 2 to ½0, 708 and ½30, 908, respectively, and select values that satisfy a 2 .a 1 to generate the walking gait in Figures 1(c)-(f). We use a step size of 1 mT for jBj, 58 for each a, and a variable step size of 0.5 Hz for f \2 and 2 Hz for f ø 2 Hz, which yield a total number of 15,600 possible parameter values in Y.

Gaussian processes
The magnetic soft millirobots in our paper did not have accurate models for kinematics or dynamics (i.e., we demonstrated the model inaccuracy of the original work of Hu et al. (2018) in our previous study in Culha et al. (2020)), therefore, it is necessary to approximate the reward function based on the data collected from physical experiments. However, the physical data has inherent uncertainty owing to the noise in the measurements and the variations during the experiments. To include these uncertainties in the model, overcome the sparsity in the data, and make probabilistic predictions at unobserved locations, we represent the reward function S(u) using GPs following the study in von Rohr et al. (2018): where m(u) is the prior mean and k(u, u 0 ) is the kernel function defining the covariance between S(u) and S(u 0 ) for u, u 0 2 Y. However, as S(u) can only be measured with noise, we define the observed stride lengthS as where n i is zero-mean Gaussian noise with variance s 2 n for each measurement i.
During one run of BO, the GP model is sequentially updated withS(u) observed from experiments. We define one learning run as a run of BO until the desired stopping criterion is satisfied (e.g., a fixed number of experiments is reached).
From the experimental data D = fu i ,S(u i )g N i = 1 , where N denotes the number of experiments in D, the stride length of the robot for an unobserved u can be predicted using the posterior mean and variance as where We select the squared exponential as the kernel function in the GPs, which is defined in (Duvenaud et al., 2011) for multi-dimensional cases as where l c 2 R d c is the length scales that defines the rate of variation in the modeled function for each dimension of the parameter space, i.e., l u c for u 2 fB max , f , a 1 , a 2 g. Roughly speaking, long length scales are used to model slowly-varying functions and short length scales are used to model quickly-varying functions. The signal variance s 2 f describes the width of distribution, e.g., high s 2 f means higher uncertainty in the predictions of the unobserved u.
We implement the GP model in our experiments using the libraries provided by GPy (GPy, 2012).

Bayesian optimization
We use BO to select the parameter set u next to be tested in the next step of the learning run using the acquisition function a acq (u) as In this study, we choose the expected improvement (EI) as the acquisition function a acq (u) due to its better performance compared with its alternatives as demonstrated in von Rohr et al. (2018). EI seeks the parameter set for the next step where the EI in reward function is the highest compared with the previously collected data and is defined in Jones et al. (1998) as whereS(u Ã ) is the highest reward function value collected so far. Analytical solution of Equation (10) is given in Brochu et al. (2010) as where F and f are the Gaussian cumulative density and probability density functions, respectively. The term Z is described as Z = Z(u) = (m(u) ÀS(u Ã ) À j)=s(u), with m(u) and s(u) computed from Equations (5) and (6). The two terms in Equation (11) define the exploitation and the exploration weights of the BO, respectively. The balance between these two terms is controlled by the hyperparameter j. As j gets higher, BO focuses more on exploration and seeks the next parameter set in regions with high prediction uncertainty. In contrast, BO focuses more on exploitation and selects the next parameter set within a close range to already explored regions. As the goal of our study is to adapt to task spaces by increasing the likelihood of finding more controller parameter sets that yield successful walking gaits under uncertainty, we choose j = 0:1 to increase the exploration tendency of the BO in our experiments.

Transfer learning
In this study, we compare four different methods of TL on our walking gait experiments: (1) transfer of all GP hyperparameters, s 2 n , s 2 f , and length scales l c for each dimension of the parameter space R d c , i.e., l u c for u 2 fB max , f , a 1 , a 2 g; (2) transfer of only the length scales l c ; (3) transfer of prior mean information m(u); and (4) the hybrid combination of length scales l c and m(u).
3.3.1. Transfer of GP hyperparameters. The choice of the types and values of GP hyperparameters influence the regression of the GP (Chen and Wang, 2018) and their transfer from prior models can change the dynamics of the learning process (Patacchiola et al., 2020;Wang et al., 2020). The hyperparameters we choose to investigate as a part of the GPs in this study can be listed as the noise in the collected data s 2 n , the signal variance s 2 f , and the length scales l c . We start the BO learning by initializing the s 2 n to the maximum variance found in the repeated experimental results in our previous work in Culha et al. (2020), and setting the signal variance s 2 f to the square of half of the body length of the robot so that the highest possible reward value (i.e., L = 3:7 mm) remained inside the 95% confidence interval of the prior. We also set the length scale values l u c to a quarter of the total range of each corresponding parameter. After starting the BO runs with these initial values, we use the log marginal likelihood estimation (Rasmussen and Williams, 2006) log p(Sju, l u c , s 2 n , s 2 f ) =À to simultaneously optimize the GP hyperparameters based on the collected data during the learning runs. We use these estimates of the selected hyperparameters as one of the TL methods in the following experiments in Section 4.3.
3.3.2.Transfer of prior mean information. In addition to the kernel, the prior mean m(u) must be chosen at the beginning of a BO run as well. Often, constant zero mean (i.e., m = 0) is the default choice as an uninformed prior mean function for maximization problems (Chen and . For the millirobot learning problem herein, we investigate the transfer of information from previous learning runs by setting the prior mean to the posterior mean of a previously trained GP model, such as from a different robot. In this way, we can approximately transfer the topology of the target function between different test scenarios, which is reasonable as long as the differences between the robots and the environments do not significantly alter the function shape. 3.3.3.Hybrid transfer. Previous methods can be combined and both the optimized estimation of the GP hyperparameters and the prior mean information can be transferred between the BO experiments. In this study, we also investigate the combination of the estimated length scales l c and the prior mean information m(u) and their transfer between the test cases in Section 4.3.

Experimental results
Our study aims to use BO and GPs to demonstrate adaptation to different task spaces while experimentally optimizing the stride length of the soft millirobots whose walking performances are prone to fabrication-, material-, and interaction-based reproducibility issues that cannot be successfully predicted with kinematic models. In that sense, we focus more on exploring a variety of walking patterns under changing task space conditions rather than continuously optimizing a specific walking gait performance. Accordingly, we design the experiments to highlight the influence of BO and GPs and TL methods on increasing the average performance of finding successful walking gaits, i.e., gaits strictly following the consecutive states described in Section 2.2 that also yield sub-optimal stride length performances, during the limited number of learning runs, instead of only finding the optimum controller parameters.
We begin with using an exhaustive search approach to generate benchmark datasets for the walking gaits on five different test scenarios using our millirobots in Section 4.1. Here, we limit the controller parameter space to two dimensions and only explore the a 1 and a 2 parameters while experimenting with three robots on a flat paper surface and with one robot (i.e., robot 3) on two additional different walking surfaces. The results of the exhaustive search show the overall structure of the walking gait function based on the range of the two controller parameter values. This statistical information serves as benchmark data to compare the learning efficacy of the BO and different TL methods in Sections 4.2 and 4.3. We choose the most effective TL method guided by the experiments on the benchmark datasets and apply it to a new set of task adaptation experiments in Section 4.4. Here we use a single robot on three different task spaces with a wide range of changing surface adhesion and roughness, and medium viscosity. In these experiments, we expand the controller space back to four dimensions (i.e., B max , f , a 1 , and a 2 ), and optimize the walking gait with the BO and the chosen TL method within a limited number of learning runs.

Generation of the walking gait benchmark datasets
In our previous work in (Culha et al., 2020), we observed that the soft millirobots we adopted from (Hu et al., 2018) experienced additional material degradation over long repeated experiments that altered their gait performances. While we investigate the influence of BO and TL methods on the improvement of learning efficiency in Section 4.2 and Section 4.3, we want to minimize this material degradation effect on the walking gait. That is why here we use an exhaustive search approach and generate five different benchmark datasets that cover the walking gait function space necessary for the BO and TL methods investigations. To this end, we test three different robots (i.e., robot 1, 2, and 3) on a flat paper surface and a single robot (robot 3) on two additional surfaces: PDMS-0 and P800-grit sandpaper replica.
To explore the walking gait function space on these five test cases, we constrain the controller space into two dimensions by using a constant B max = 10 mT and f = 1 Hz, and changing the a 1 and a 2 . We choose values from the set ½0, 708 for a 1 and ½30, 908 for a 2 with a step size of 58 that meet the condition a 2 .a 1 , which consequently generate 150 different controller parameter pair values. For each of these pairs, we repeat the experiments five times; hence, generating 750 physical experiments for each test case and report the results in Figures 3(a)-(e). The constrained dimensions reduce the necessary experiments from 390,000 (i.e., 5 repetitions for each test case using 15,600 parameter value sets) to 3,750 for the exhaustive search, and significantly avoid the possible materialdegradation over prolonged experiments.
In these experiments the robots do not necessarily follow the gait definition in Section 2.2, therefore we describe these resulting values as ''displacement measurements,''S 0 . To filter out the non-walking gaits of the robot from this dataset, which can be seen at the beginning of Extensions 2, 3, and 4 for different task spaces, we use our gait tracking feedback system defined in Section 2.3 and evaluate every test result to penalize the a controller pairs that do not generate the desired walking gait. Accordingly, we obtain the average stride length performances, S, of the successful gaits in Figures 3(f)-(j). These results show that the stride length performances are limited up to S'2:4 mm for the successful walking gaits, and the penalized motions with higher displacements (S 0 .S) do not comply with the walking gait definition. The statistical information collected from the physical experiments in these five test cases (shown in Figures 3(f)-(j)) constitutes our benchmark datasets that we use in the following BO and TL investigations. The optimum controller parameter sets found by the exhaustive search in the 2D function space are reported in Table 1. We use the mean and standard deviation values of these exhaustive search results to sample the stride length of the robots for the given a controller parameter sets while comparing the performances of different TL methods. The benchmark data is available online and can be accessed at https://github.com/sozgundemir/ softrobotwalkingdataset.

Learning the walking gait with the ''standard'' BO
We initially test the walking gait learning with a BO approach on the benchmark datasets, where the prior mean information is set to zero (i.e., m(u) = 0) and the four hyperparameters are set to initial values described in Section 3.3.1 without inheriting any other prior information. We utilize this approach, which we refer to as ''standard BO'' in the rest of this study, for the two controller parameters a 1 and a 2 and apply it separately to the benchmark datasets for the five different test cases (i.e., robots 1-3 on a flat paper surface, and robot 3 on the PDMS-0 and P800 surfaces). To be able to directly use the mean and standard deviation information from the exhaustive search results, we configure our BO to explore the same discrete controller parameter set space used in Section 4.1. We perform 100 independent learning runs with each involving 100 iteration steps for the five test cases. One iteration step of a learning run involves three steps: 1. BO selects a new parameter set u that maximizes the acquisition function based on the GP model; 2. for the selected controller parameter pair, the corresponding stride length performance is sampled from the normal distribution defined by the mean and standard deviation values found in the relevant benchmark dataset; 3. the learning system updates the GP model using this sampled data and prepares for the next iteration step of the learning run.
We report the median of the learning results with the upper and lower quartiles in Figure 4. These values represent the normalized gait performanceŜ =S=S exh , where S exh is the mean of the stride length performances of the robots for the best a controller parameters reported in Table 1. For the first four of the five test cases, the standard BO approach finds the optimum gait controller parameters in less than an average of 25 iterations out of 100 independent learning runs. For the robot 3 walking on the P800 surface, the BO finds approximately 74% (Ŝ'0:74) of the optimum stride length performance as shown in Figure 4(e). The results show that the standard BO starts the learning without any prior information and occasionally finds controller parameter sets that yieldŜ = 0 in the first 15 iteration steps. The non-monotonic optimization results in these initial steps are expected due to the statistical exploration nature of our BO approach. After 25 iteration steps, the BO maintains the exploration in the close vicinity of the optimum controller parameters it finds so far for all the test cases (i.e.,Ŝ'1 for robots 1-3 on paper and robot 3 on PDMS-0, andŜ'0:74 for robot 3 on P800). The variation of the generated walking gait performances is bounded by the standard deviations reported in Table 1. The exact normalized performance results for the standard BO approach can also be seen in Table 2.

Comparison of transfer learning methods on the benchmark datasets
In this study, we extend our previous investigation on the role of TL in learning efficiency (Culha et al., 2020) and compare four different methods while optimizing the gait controllers of our soft millirobots. Similar to Section 4.2, we apply our BO learning to the benchmark datasets generated in Section 4.1, where the controller parameter space is limited to two dimensions with a 1 and a 2 . However, unlike the ''standard'' BO, here we initialize the learning runs of the robots in all test cases with different types of Table 1. Best performing a controller parameter sets found by the exhaustive search and the corresponding stride length results S exh

Test case
Controller parameters * Stride length S exh (mm) (average 6 SD)

HP-4 transfer.
We initially transfer all of the four GP hyperparameters (i.e., noise in the collected data s 2 n , signal variance s 2 f , and length scales l u c for u 2 fa 1 , a 2 g) that are optimized with the log marginal likelihood estimation in (12) from the learning runs of the source robot to all other five test cases. Here, we initialize the BOs with this prior information and start the learning experiments. The normalized stride length performances,Ŝ, of the BO learning with this TL method (depicted as ''HP-4'') are shown in comparison with the ''standard'' BO approach in the first column of Figure 5. These results show that the transfer of all of the four hyperparameters improves the learning performance in terms of decreasing the number of iteration steps to find the optimum controller parameters within these experiments. For robot 1 on paper and robot 3 on P800, the BO manages to find the performances achieved by the ''standard'' approach in less than half of the iteration steps, which are shown in Figure 5(a) and (q), respectively. For robots 2 and 3 on paper, and robot 3 on PDMS-0 (as seen in Figure 5 (e), (i), and (m) ), the BO with this TL method does not reach the performances previously achieved by the ''standard'' BO. After finding the optimum performances, the BO maintains the exploration in their close vicinity for all the test cases.
4.3.2. HP-l c transfer. Second, we investigate the transfer of only the two length scale hyperparameters l u c for u 2 fa 1 , a 2 g optimized by the source robot. We run the BO learning experiments after all the length scale parameters are initialized with the transferred l c values. The results in the second column of Figure 5 show that this TL method (depicted as ''HP-l c '') worsens the learning performance as it increases the number of iteration steps for the BO to find the optimum performance parameters for Fig. 4. Performance of the standard BO for (a) robot 1 on paper, (b) robot 2 on paper, (c) robot 3 on paper, (d) robot 3 on PDMS-0, and (e) robot 3 on P800. The stride length performances are normalized with respect to S exh given in Table 1 for easier comparison between tests. Each figure shows the statistical results as median, and upper and lower quartiles from 100 independent BO runs with each consisting of 100 iterations. all the cases. However, in comparison with the previous ''HP-4'' method, the transfer of only the length scale hyperparameters allows the BO to explore the gait performances achieved by the standard BO approach.

Mean transfer.
As the third method, we transfer the posterior mean information, m(u), from the source robot as the prior information for the robots in the test cases. Here, we set all of the other GP hyperparameters to the initial values as described in Section 3.3.1 and restart the BO learning experiments. The results in the third column of Figure 5 show that when the BO learning starts with the prior mean information (depicted as ''Mean''), it achieves the average gait performances found by the standard BO (Ŝ'1) in fewer iteration steps for the first four test cases. Although the median of the achievedŜ remains close to the standard BO results for robot 3 on P800, this method also allows the exploration of the controller parameter sets those yield performances close to the optimum results from the exhaustive search as shown in Figure 5 (s).

Hybrid transfer.
Finally, we adopt a hybrid approach and transfer the posterior mean information, m(u), together with the two length scale hyperparameters l u c for u 2 fa 1 , a 2 g, whose results are shown in the last column of Figure 5 (depicted as ''Hybrid''). The hybrid transfer improves the learning run performance by decreasing the number of iteration steps to find the optimum performing parameter sets found by the standard BO. Similar to the results in Figure 5 (s), the hybrid transfer also allows robot 3 on P800 to investigate parameter regions that yield performances close to the optimum results from the exhaustive search. The comparative performance results of the standard BO and the four TL methods for each of the five test cases are reported in Figure 6 and Table 2. Owing to the statistical and explorative nature of the BO, the stride length performances do not monotonically increase at every consecutive iteration step (as also visible in Figure 5). That is why, to provide a clear comparison between our methods in Figure 6, we identify the iteration steps that show the ''best so far'' performance during the learning run of each approach. The first row, Figure 6(a)-(e), compares these methods in terms of the normalized error of the achieved stride length performances during the BO learning runs (e = 1 ÀŜ). For the first four cases, it can be seen that except for the ''HP-4'' method, the other three transfer methods and the standard BO manage to consistently explore the optimum stride length performances (e'0) as shown in Figure 6(a)-(d). The ''HP-4'' method does not allow the BO to find the optimum controller parameters and the results remain approximately 20% below these optimum values for three of the test cases, which are also shown in Figure 6(b)-(d). The ''HP-l c '' method shows similar performances with the standard BO in terms of the initial and the final performance error. Both the ''Mean'' and ''Hybrid'' transfer methods allow the BO to generate comparable final performances, while the mean transfer method typically starts with significantly lower error compared with the hybrid approach. For the last test case (i.e., robot 3 on P800, Figure 6(e)) we can see that the standard BO approach only manages to explore gaits that are 74% (Ŝ'0:74) of the optimum gait performance. Every TL method increases the performance yield of the BO learning, with the ''Mean'' transfer outperforming all other methods by exploring gaits that havê S'0:91, or e'0:09, on the surface of P800-grit sandpaper  (Table 1). Each figure shows the ''best-so-far'' performance results over 100 iterations. (f)-(j) Comparison of resulting convergence step of standard BO and TL methods. The convergence steps of the TL methods are calculated as the iteration step achieving the performance level of the standard BO approach. Maximum iteration step of 100 is reported for the cases that cannot reach the standard BO level.
in terms of the final performance. The comparison between the standard BO and the TL methods in terms of achievedŜ performances are given in detail in Table 2.
The second row, Figure 6(f)-(j), represents the learning efficiency performance comparison in terms of the iteration steps needed to explore the best performance by the standard BO and to achieve standard BO level performance by the four TL methods. We describe this exploration performance with ''convergence steps,'' which is calculated by finding the performance value that stays within the 5% band of the averaged remaining steps. Normally, a monotonic convergence is not expected from the statistical and explorative BO learning. However, viewing the learning runs with the ''best-so-far'' evaluation method allows us to represent the required iteration steps to achieve comparable performance, and to capture the relative data efficiency of the TL methods. Accordingly, we can see that the ''HP-4'' method fails to achieve standard BO level performance for the three cases ( Figure 6(g)-(i)) as its convergence steps are equal to the number of iteration steps in the experiments. The ''HP-l c '' method shows comparable results with the standard BO for all the cases (Figures 6(f)-(j)) in terms of the convergence steps. In comparison, both the ''Mean'' and ''Hybrid'' TL methods attain to the standard BO level performance significantly faster (i.e., fewer convergence steps) in all the test cases. The details of the convergence steps is given in the last column of Table 2.
As the ''Mean'' TL method outperforms the standard BO and other three TL methods by finding better performing parameter sets in fewer iterations, we select it to use for task space adaptation experiments in Section 4.4.

Adaptation to task spaces
The task environment is more susceptible to dynamic changes than the robot morphology, especially for medical operations inside the human body. Therefore, a quick adaptation of the robot controller is important to maintain successful robot task handling. In the following experiments, we investigate the learning efficiency of our BO approach while focusing on the three physical properties that may dynamically change during the walking task of our soft millirobots in future in vivo operations, which are (1) surface adhesion, (2) surface roughness, and (3) medium viscosity.
Here, we expand the controller parameter space exploration back to four dimensions by including the magnetic field magnitude B max and the actuation frequency f . To efficiently learn the controller parameters in this higher dimensional search space, we utilize the prior mean transfer application (i.e., ''Mean'' TL), which is shown to be the best performing TL method for our experimental scenario in Section 4.3. However, as the source robot in previous experiments explored only two dimensions of the controller space, we generate the posterior mean m post (u) required for the following experiments by performing a new set of physical experiments with the robot 3 on a flat paper surface. For these experiments, we run the standard BO for 156 iteration steps (i.e., 156 different controller parameter sets), which corresponds to 1% of the complete controller parameter space with given step sizes in Section 3. We use the posterior mean optimized at the end of these learning runs as the prior mean information for all the task space experiments. We only test robot 3 in the following task space adaptation experiments.
Again, we compare the learning efficiency of the standard BO with the prior mean transfer method on all the task spaces defined in Section 2.4. Here, the objective of these learning runs is to adapt to dynamic task spaces and learn the optimized controller parameters in as few experiments as possible especially for future medical operations. To find the number of learning steps sufficient enough for BO to find the desired walking gaits, we used the results in Figures 4, 5, and 6, which involve approximately 250,000 data points. These results show that the BO finds the optimized controller parameters that generate desired walking gaits consistently in less than 20 steps for different robots and walking surfaces. Therefore, we limit the number of steps of a learning run to 20 experiments (i.e., iteration steps), and perform three independent learning runs with the same initial conditions, yielding 60 experiments in total. One step of the learning run involves five steps: 1. BO selects a new parameter set u that maximizes the acquisition function based on the GP model; 2. the microcontroller initiates the physical experiment and regulates the magnetic field based on the selected u; 3. the cameras record the robot's motion and measure the average stride length performanceS after running the gait tracking system; 4. the learning system updates the GP model using the newly collected data from the experiment; 5. the robot returns to its initial position for the next experiment.

Surface adhesion.
We initially test the robot on five surfaces with different adhesion strengths reported in Figure 2(a). Figure 7(a)-(d) show the walking gait performances during the three independent learning runs on the two ends of the adhesion range: paper (1.34 kPa) and PDMS-2 (11.02 kPa). The left column (Figure 7(a) and (c)) show the learning runs with the standard BO approach, and the right column (Figure 7(b) and (d)) show the learning runs with the prior mean transfer method. The difference between these figures shows that the TL method improves the learning runs by finding more of the controller parameters that yield positive walking gait performances. In addition, the BO with the TL manages to explore these parameters in the earlier steps of the learning runs compared to a standard BO approach. We represent the general influence of the mean transfer method on all the adhesion surfaces with the standard interquartile range (IQR) method in Figure 7(e). In this figure, the horizontal lines represent the median of the generatedS in 60 experiments for each surface. These lines are surrounded by boxes that show the upper and lower quartiles. The error bars show the extremes in terms of the highest and lowest S performances and the circles represent the outlierS performances. The results for the learning with standard BO and BO with the chosen TL method are given side by side for each test surface in Figure 7(e) and the exact values are listed in Table 3. For example, we can see that the learning runs with the standard BO on PS, PDMS-0, and PDMS-1 surfaces typically explore controller parameters that do not yield successful walking gaits. In comparison, the BO with the TL method explores numerous successful controller parameters on these surfaces. The increased median lines in Figure 7(e) provide evidence that the prior mean transfer improves the learning of the BO in terms of increasing the number of successful walking gait generating controller parameters that are explored in a limited number of experiments. See Extension 2 for a comparison between the walking gaits on paper and PDMS-2, and Extension 5 for the details of the independent learning runs for all the surface adhesion experiments.

Surface roughness.
Next, we test the same robot on five surfaces with increasing roughness properties that are reported in Figure 2(b). The walking gait performances achieved during the learning runs for the two extreme surfaces, P800-grit (R q = 7:488m m) and P60-grit (R q = 77:195m m) sandpaper replica are shown in Figure  8(a)-(d). Similar to the surface adhesion experiments, the mean transfer method improves the learning performance of the standard BO by increasing the number of explored parameter sets that generate non-zero walking gait performances. The comparative performances of the standard BO and the mean transfer method for all the roughness test surfaces are reported in Figure 8(e) and Table 4. See Extension 3 for a comparison between the walking gaits on P800 and P60, and Extension 6 for the details of the independent learning runs for all the surface roughness experiments.

Medium viscosity.
Finally, we test our robot walking in eight different media with changing viscosity as reported in Figure 2(c), and report the results for two extreme cases, cP1 and cP90 in Figure 9(a)-(d). It can be seen that the ability to explore a wide range of controller parameter sets that generate walking gaits decreases for both BO approaches for the high-viscosity fluids (cP . 35). However, the mean transfer method still manages to increase the number of successful sets compared with the  standard BO for all the media. The overall performances of the standard and mean transfer approaches are reported as box plots in Figure 9(e) and Table 5. These results are consistent with the other two test surfaces that the TL method allows the BO to explore more of the controller sets that generate walking gaits against the changing task space properties. See Extension 4 for a comparison between the walking inside cP1 and cP90, and Extension 7 for the details of the independent learning runs for all the medium viscosity experiments.

Discussion
The displacement measurements from the exhaustive search experiments in Figures 3(a)-(c) show that even though three identical robots are tested with the same controller parameters on the same surface, they generate different walking gait performances. These initial results also confirm the observations related to performance repeatability in our previous work (Culha et al., 2020). Moreover, the influence of the task space on the robot performance can be seen clearly from Figures 3(h)-(j), where the adhesion and roughness differences between different surfaces are reflected. These observations support the necessity of a data-efficient controller learning system that is robust to the robot performance variabilities caused by the material, fabrication, and the task environment of the small-scale, medical-oriented, and untethered soft robots. Our choice on the application of BO and GPs to directly learn the controller parameters of our soft millirobot is based on three aspects. First, BO offers the efficient datadriven optimization of continuum and complex black-box functions that do not have a closed-form definition. This feature addresses our challenges with not having a deterministic model for the kinematics for our robot and requiring to achieve optimized walking gaits in a small number of experiments. Second, the investigated function can be represented with GPs within BO, which allows capturing the noise and unknown disturbances. As our robot inherits fabrication-, material-, and interaction-based performance disturbances, GPs provide us with a walking gait function representation that includes these variances. Finally, BO is a global optimization tool that avoids getting stuck at local minima, which is important for exploring the parameter space of the investigated function. Combined with the appropriate TL method, this feature allowed us to overcome local minima while adapting the controller parameters to different robots and task spaces. While we do not claim that BO is the best or only suitable optimization technique here, these three aspects successfully address the controller and modeling challenges existing for our robotic system and make the BO the choice of our application.
The experimental results in this study show that our approach of using BO with GPs and TL methods allowed a data-efficient (i.e., using as few experiments as possible) controller learning that achieves adaptation to different task spaces within a wide range (i.e., on the scale of an order of magnitude) of surface and medium properties. Our main goal is to allow the learning system to explore the controller parameter space to find more of the parameter sets that generate successful walking gaits in response to changing task environments. For this purpose, we configured our BO to favor exploration more than exploitation. That is why we do not focus on finding the optimum walking gait controller parameters for each robot or task space in our experiments. Consequently, our current approach does not establish a straightforward correlation between the change of controller parameters with respect to changing robot and task conditions. The comparative results between the standard BO and the TL methods show that both approaches can find sub-optimum parameter sets owing to the statistical nature of the learning method, whose results are given in Appendix C for the task space adaptation experiments. However, we propose that TL methods may allow the system to explore a larger portion of the function space in a fewer number of physical experiments, hence achieving data-efficiency in learning.
In terms of experimental learning efficiency, the transfer of the prior mean information outperformed the other TL methods in our experiments. The transfer of this information allowed the BO to start the parameter exploration in the function space within the regions of highperformance result expectations. Therefore, it took the BO much faster to explore the parameter spaces that generate optimum walking gaits (see Extension 8 for a sample comparison of parameter selection with the standard BO and mean transfer method). We see the same effect for the test case of robot 3 walking on P800 in Figure 5 (s). Here, this TL method allowed the exploration of the regions with higher expected results and surpassed the exploration boundary of the standard BO. The larger variance in the stride length performances explored by this TL method is caused by this exploration tendency. In comparison, we see that the HP-4 method failed to explore the controller parameters that yield optimum gaits because of the transfer of the signal variance parameter s 2 f . When this parameter was optimized s 2 f for a single robot (i.e., robot 3 on paper in our case), it resulted in a smaller value than that used in standard BO and eventually it decreased the exploration weight in (11). Thus, the BO that started with this transferred parameter value focused more on the exploitation of known regions as soon as it found a parameter set generating non-zero performance, and avoided exploring the unobserved regions for the remaining iteration steps in the learning runs. We see the influence of the hindered exploration in the first column of Figure 5. We see that the transfer of the length scale hyperparameters in HP-l c was typically ineffective in our test cases because the length scale is bound to the robot geometry and magnetic profile, and we used robots with the same geometries in the experiments. The hybrid transfer approach, which also included the transfer of length scale hyperparameters, showed similar performances with the prior mean transfer. These similar results also show the ineffectiveness of length scales hyperparameters and the dominance of prior mean information. Although the hybrid approach could be extended by also transferring the variance terms s 2 n and s 2 f along with the prior mean information, we did not include these alternatives in the TL comparison experiments for two reasons. First, the negative effect of the signal variance s 2 f on the exploration capability was already shown and, second, s 2 n , i.e., the noise in the collected dataset remained the same for all test cases as we used the same hardware setup for all our experiments. The values of the hyperparameters can also be seen in Appendix C. The choice of the parameters such as the kernels (Wilson, 2014) and hyperparameters (Chen and  can also be replaced with other methods, however, the systematic analysis of their influence on learning performance is beyond the scope of our current study. In the learning experiments that compare the standard BO and four TL approaches, we chose to represent learning performance with median and IQR instead of mean and standard deviation (as seen in Figure 4), since IQR is a robust measure of scale, as it is less sensitive to the outliers in the data. Moreover, dissimilar to standard deviation, IQR can represent the skewness in the distribution of the walking performance results, which becomes more apparent as the performance values get closer to the ends of possible performance ranges. In addition to its advantages in the statistical distribution representation, IQR does not report any unachievable result according to the gait definition in Section 2.2.

Conclusion and future work
In this study, we have investigated the use of BO with GPs to experimentally learn the controller parameters for the walking gait of a magnetic soft millirobot. We have created benchmark datasets consisting of 750 experimental results using an exhaustive search to find the walking gait function space for five different test cases. We then used these datasets to compare the effectiveness of four different TL methods to complement the standard BO learning. In these experiments that involve 104 learning steps for each test case, we have shown that the transfer of the prior mean information increased the BO learning performance the most in terms of increasing the number of explored sub-optimum controller parameters and decreasing the number of required experiments. Based on these findings, we also applied BO learning together with the prior mean transfer method on different task spaces with changing surface adhesion, surface roughness, and medium viscosity. We have shown that controller learning with a BO that utilizes prior mean transfer demonstrates successful adaptation to task spaces in a data-efficient way by exploring the function space of the robot in fewer experiments to find a larger group of controller parameters that yield successful walking gaits.
Our approach is not only limited to walking gait learning and it can further be applied to different locomotion and manipulation controllers for soft robots (Chin et al., 2020). In the future, studies focusing on small-scale fabrication with higher magnetization resolution may address the fabrication reproducibility issues (Alapan et al., 2020;Kim et al., 2018;Xu et al., 2019). However, especially for robots designed for biomedical operations, the interaction with the dynamic task environment may still have degrading robot material and performance effects. For such scenarios, a data-efficient controller learning system may adapt optimum controller parameters to these changes in the robot. For example, such an approach may be applied to endoscopic soft robots within or outside the gastrointestinal (GI) tract (Son et al., 2020;Yim et al., 2014) using a small number of trials. Our study can be further extended to involve the design parameters, such as the magnetic particle density in our robots, and guide the task-oriented design strategies for future soft mobile robots. Our approach can be used to reveal design guidelines to improve the kinematic models of the small-scale robots while utilizing the CC approximations (Webster and Jones, 2010), analytical models (Renda et al., 2014), and FEMs (Largilliere et al., 2015). However, as the BO we are using is an episodic algorithm, meaning that each suggested parameter set must be evaluated first in an experiment, the adaptation to design optimization will require the experiments to be run either in a simulation environment or an automated rapid fabrication system that can be integrated within the actuation architecture. The systematic comparison of our experimental approach to alternative optimization and control methods supported with simulations such as intelligent trial and error (Cully et al., 2015), evolution algorithms (Kriegman et al., 2020), or policy gradients (Sehnke et al., 2010) is beyond the scope of our current study but is an interesting task for future work. We believe that the benchmark datasets available in this study can be used to compare these different methods. Our longterm vision is to build fully autonomous systems that can control, track, evaluate, and optimize soft robots operating in changing complex real-world environments, with minimum human involvement. parameter sets Tables 6-9 list the exact values of the GP hyperparameters used in the TL experiments and the controller parameter sets in the experiments for the adaptation to task spaces, and Table 10 lists the nomenclature used throughout the article. Label robot's motion either as walking or not by comparing RobotState throughout the actuation sequence with the four states in Figure 1(f)     Table 9. Best-performing controller parameter sets and the corresponding stride length averages for the changing viscosity values of the test medium within 20 physical experiments in 3 independent learning runs (60 experiments in total) with and without utilizing the prior information.

Medium Type Controller parameters
Stride length S (mm) (average 6 SD) B max (mT)