Stable interpolation with isotropic and anisotropic Gaussians using Hermite generating function

Gaussian kernels can be an efficient and accurate tool for multivariate interpolation. In practice, high accuracies are often achieved in the flat limit where the interpolation matrix becomes increasingly ill-conditioned. Stable evaluation algorithms for isotropic Gaussians (Gaussian radial basis functions) have been proposed based on a Chebyshev expansion of the Gaussians by Fornberg, Larsson&Flyer and based on a Mercer expansion with Hermite polynomials by Fasshauer&McCourt. In this paper, we propose a new stabilization algorithm for the multivariate interpolation with isotropic or anisotropic Gaussians derived from the generating function of the Hermite polynomials. We also derive and analyse a new analytic cut-off criterion for the generating function expansion that allows to automatically adjust the number of the stabilizing basis functions.


Introduction
Numerical simulations are of key importance for the understanding of the behavior of plasmas in a nuclear fusion device. The fundamental model in plasma physics is a kinetic description by a phase space distribution function solving the Vlasov-Maxwell equation.
State-of-the-art kinetic simulations for magnetic confinement fusion are built upon the so-called gyrokinetic model, a reduced model in a five-dimensional phase space. Since the increase in parallel computing power renders the solution of the fully kinetic Vlasov equation in six-dimensional phase space possible, interest in solving the 6d Vlasov equation has arisen recently (cf. e.g. Muñoz et al. (2015); Grošelj et al. (2017); Kuley et al. (2015); Miecnikowski et al. (2018)). For the solution of the Vlasov equation, both gridbased and particle-based methods are commonly used. We distinguish two classes of gridbased methods, Eulerian solvers based on finite volume or discontinuous Galerkin on the one hand, and, on the other hand, semi-Lagrangian methods that update the solution by evolution along the characteristics using interpolation. The latter class of methods has the advantage that usually no time step restriction by a Courant-Friedrich-Levy (CFL) condition needs to be imposed. Recently, progress has been made to develop efficient implementations and algorithms for the solution of the Vlasov-Poisson problem with a particle-in-cell method on GPUs (Hariri et al., 2016), semi-Lagrangian methods based on compressed grids (Kormann, 2015;Kormann and Sonnendrücker, 2016;Guo and Cheng, 2016;Einkemmer and Lubich, 2018) and finite differences on adaptive grids (Deriaz, 2015). Solvers for the six-dimensional Vlasov equation have also been developed in the space plasma community. In particular, Yoshikawa, Yoshida, Umemura and coworkers have presented Vlasov-Poisson and Vlasov-Maxwell solvers in six-dimensional phase space based on semi-Lagrangian methods (Tanaka et al., 2017;Yoshikawa et al., 2013). Another example is the hybrid Vlasov-Maxwell (HVM) code based on finite differences (Mangeney et al., 2002;Cerri et al., 2018).
Typically particle-in-cell methods are used for high-dimensional simulations due to their favorable scaling with the dimensionality. Moreover, the particle pusher part is embarrassingly parallel. On the other hand, particle-in-cell methods suffer from numerical noise. Due to the high memory footprint of six dimensional grids, an efficient parallelization that scales to high-performance computer systems is essential for the success of grid-based algorithms for the solution of a fully kinetic description of a plasma. In this paper, we focus on efficient parallelization schemes for a semi-Lagrangian discretization of the Vlasov-Poisson equation in six-dimensional phase space (neglecting magnetic effects). For a Vlasov-Poisson equation on a four-dimensional phase space, two parallelization schemes have been discussed in the literature: a domain partitioning scheme with patches of four-dimensional data blocks (Crouseilles et al., 2009) as well as a remapping scheme (Coulaud et al., 1999). The idea of the remapping scheme is to work with two different domain partitions which both keep a partition of the dimensions sequential on each processor. While the latter strategy is very well adapted to a semi-Lagrangian method combined with dimensional splitting, its parallel scalability is hampered by an all-to-all communication pattern. For domain decomposition schemes, typically time step restrictions need to be introduced since the interpolation stencil needs to be localized at the boundaries of the computational domains (cf. Yoshikawa et al. (2013, Sec. 2.4) and Crouseilles et al. (2009)). This restriction is particularly severe for magnetized plasmas in a strong guide field where particles perform a fast gyromotion around the magnetic field lines causing strong non-locality in the velocity domain. In this paper, we therefore propose the use of a rotating velocity grid. Compared to three-dimensional problems where domain decomposition schemes are very successful a domain decomposition in six dimensions requires a much larger degree of communication due to the fact that the surface-to-volume ratio of a d dimensional data block increases with the dimension d. For a gyrokinetic code on the contrary one has the additional advantage that the fifth dimension of the model is merely a parameter in the gyrokinetic Vlasov equation so that communication in the Vlasov step is reduced to four dimensions. This can be particularly advantageous when parallelization over several islands of a supercomputer are considered: If the parameter dimension is the only one split between islands, the comparably slow communication over island barriers can be avoided. Good scaling properties for a semi-Lagrangian solver of the gyrokinetic model has for instance been demonstrated for the GYSELA code by  and . Another challenge compared to the gyrokinetic model is the reduced complexity of the kinetic model causing a reduced computational complexity that renders the code more memory-bound. In order to reduce the burden of the high demand on MPI-data-exchange, we propose a blocking scheme to overlap communication and computations in the advection steps.
The outline of the paper is as follows: In the next section, we introduce the physical model, the semi-Lagrangian method including the rotating mesh for the background magnetic field, and the parallelization schemes. Moreover, we discuss the impact of the parallelization on the interpolation step in the semi-Lagrangian scheme in Sec. 3. In Sec. 4, we describe our implementation of the domain partitioning scheme, followed by a discussion of our performance optimizations in Sec. 5. Sec. 6 compares Lagrange interpolation of various orders and demonstrates the effect of the rotating grid followed by a numerical demonstration of the scalability of our new implementation in Sec. 7. Finally, Sec. 8 concludes the paper.
The self-consistent field for electrons in a neutralizing ion background can be computed by the following Poisson equation, − ∆φ(x, t) = 1 − ρ(x, t), E(x, t) = −∇φ(x, t), (1) Here, f denotes the probability density of a particle in phase space defined by position x ∈ D ⊂ R 3 and velocity v ∈ R 3 , E denotes the electric field, φ the electric potential, and ρ the charge density. The magnetic field B 0 is supposed to be either zero or a constant background field aligned with the x 3 axis. Generally, the spatial domain is defined by the geometry of a tokamak or similar fusion device. In this paper, we restrict ourselves to common benchmark problems on a periodic box. The distribution has a Maxwellian shape in velocity such that it follows an exponential decay for large values of the velocity. We therefore truncate the computational domain in velocity space to a box and close the system with (artificial) periodic boundary conditions. The characteristic curves of the Vlasov equation can be defined by the following system of ordinary differential equations (ODE), Let us denote by X(t; x, v, s), V(t; x, v, s) the solution of the characteristic equations (2) at time t with initial conditions X(s) = x and V(s) = v. Given an initial distribution f 0 at time t 0 , the solution at time s > 0 is given by since the distribution function is constant along the characteristic curves.

The semi-Lagrangian method for Vlasov-Poisson
To numerically compute the solution of the Vlasov equation, we use the so-called semi-Lagrangian method. We introduce a six-dimensional grid to discretize the phase space. In each time step, the equations for the characteristics are solved for each grid point backwards in time from time t m+1 to time t m with ∆t = t m+1 − t m small. Then equation (3) is used with s = t m+1 and t 0 = t m to find the solution at time t m+1 for each grid point. Since f (m) is only known on the grid points, some interpolation method is needed to approximate the value of ). In this general form, the solution with a semi-Lagrangian method requires the solution of a system of ODE as well as interpolation.
To efficiently solve the characteristic equations, Cheng and Knorr (1976) proposed a splitting method for the Vlasov-Poisson equation (B 0 = 0) reduced to a 2d (1x-1v) phase space that splits the x and v advections. This yields the following algorithm: Given f (m) and E (m) at time t m , we compute f (m+1) at time t m + ∆t for all grid points (x i , v j ) as follows: Note that the electric field is constant for the v advection step. Therefore, the advection coefficients are independent of v for the v advection and independent of x for the x advection and the characteristics are therefore given analytically. In order to avoid three-dimensional interpolation, we use a cascade interpolation scheme replacing the three-dimensional interpolation by three successive one-dimensional interpolations. Moreover, we can use a first-same-as-last implementation that clusters step 4 of time step m with step 1 of time step m + 1.
As a consequence the main building block of the split-step semi-Lagrangian discretization of the Vlasov-Poisson problem is one-dimensional interpolation on one-dimensional stripes of the six-dimensional domain. Moreover, the interpolation step on the individual stripes has a special form: The function needs to be interpolated at a shifted value of each grid point and the value of this shift is constant for the whole stripe. For a stripe of length n with grid points x i , i = 1, . . . , n, we compute Since α is independent of x i , the interpolation formula is the same in each grid cell which can be exploited for vectorized implementation. Advections can also be reduced to one-dimensional interpolation for the Vlasov-Maxwell equation using the backward substitution method introduced by Schmitz and Grauer (2006). In this paper, we focus on the Vlasov-Poisson problem. However, we include strong background magnetic fields and discuss in the next section how they can be integrated into the split-step semi-Lagrangian scheme.

Split-step semi-Lagrangian method on a rotating mesh
In a magnetic confinement fusion device, the background magnetic field is strong compared to the self-consistent fields and causes a rapid motion around the field lines, the so-called gyromotion. Often the time scale of the gyromotion is the fastest so that we do not want to accurately resolve this time scale. On the other hand, the rotation around the magnetic axis (here the x 3 axis) causes non-locality in the perpendicular velocity plane (the (v 1 , v 2 ) plane) which is difficult to handle when working with distributed grids.
We therefore use a rotating grid that follows the circular motion of the characteristics given by Moving grids for Vlasov simulations have previously been discussed by Sonnendrücker et al. (2004).
To this end, we define a logical grid equivalent to the physical grid at initial time. Let us define the rotation matrix where B 0 = Bx 3 . The logical grid then follows the fast gyromotion with rotation frequency ω c = 2π B . For a velocity w on the logical grid, the physical value of the velocity at time t m is given by v(w) = D(t m )w. Furthermore, let us denote by f (m) the distribution function on the physical grid at time t m and by g (m) the distribution function on the logical grid at time In the advection steps, we always solve the characteristic equations in physical coordinates and then transform the resulting velocity coordinates to the logical grid. We again split the x and v advections. Then, the solution at time t of the separate characteristic equations starting at (x, v) at time t 0 is given as This yields the following advection steps on the rotating grid: v advection: Defining for given v at time t m+1 the origin of the characteristic (9) at time t m is given by For the v advection, we work with different physical grids at time t m and t m+1 , namely To find the representation of g (m+1) at a point w of the logical grid, we first transform to the representation on the physical grid, use the characteristic equation to express it in terms of the solution at time t m and finally transform back to the logical grid at time t m : Note that the displacement D(t m ) −1 AE on the logical grid is not dependent on w. In our implementation, we compute D(t m ) −1 AE and then use successive one-dimensional interpolations along the three velocity coordinates axes of the logical grid.
x advection: In this step, the transformation between the physical and the logical grid does not change. We therefore have Compared to the case with fixed grid, the displacements in x 1 and x 2 are now dependent on both v 1 and v 2 and on time. On the other hand, we still have displacements independent of x and we can use successive 1d interpolations along the three coordinate axes.

Domain partitioning
Due to the curse of dimensionality, the memory requirements of a grid in the six dimensional phase space are rather high already for coarse resolutions. Therefore, a distributed numerical solution of the problem is inevitable. Two choices of domain partitioning are considered in this paper: • Domain decomposition (Crouseilles et al., 2009): The domain is decomposed into patches of 6d data blocks, each representing a separate part of the domain. The patches are mapped to a 6d logical grid of processors. For interpolations next to the domain boundary, halo cells with a width determined by the interpolation stencil have to be introduced and filled in advance with data from neighboring processors. This classical approach is widely used in parallelizations of lower-dimensional physics and engineering problems, e.g. 2d or 3d computational fluid dynamics.
• Remap (Coulaud et al., 1999): Two decompositions of the domain are introduced, the first one distributing the domain only over the velocity dimensions (keeping the spatial dimensions local to each processor) and the second one distributing the domain only over the spatial dimensions (keeping the velocity dimensions local to each processor). For x advection steps, we use the first decomposition and for the v advection steps the second. In between the steps, the data is redistributed between the two decompositions using an all-to-all communication pattern.
The first strategy has the clear advantage over the remap method that the complexity of the communication pattern is reduced dramatically. On the other hand, the remap scheme is very well adapted to the split-step semi-Lagrangian method since the one-dimensional interpolations can be performed locally once the remapping has been applied. For the domain decomposition method the one dimensional stripes are usually distributed over separate domains. This makes the implementation more complicated and introduces an artificial time-step restriction (similar to a CFL condition) since the interpolation needs information of the function around the shifted data point x i + α in (4).

Solution of Poisson's equation
The focus of this article is on the distributed solution of the 6d Vlasov equation. However, in addition we need to solve the 3d Poisson problem (1). Since the problem is only threedimensional, the compute time spent on its solution is almost negligible. For this reason, we use a pseudo-spectral solver based on fast Fourier transforms (FFTs) for the solution of the Poisson equation and remap the solution between domain decompositions that are local along the direction where FFTs are performed. In case a full 6d domain decomposition is used (i.e. when the widths of all dimensions of the logical grid of processors are greater than 1), there are several subgroups of processors that span the whole x or v domain, respectively. As a first step, we compute the charge density by a reduction along the velocity dimension. This involves an all-to-all communication among groups of MPI processes of equal spatial domain. Then, the Poisson equation is solved on each subgroup of processors that span the whole x domain. By solving the same Poisson problem in each subgroup, we avoid another communication step for redistribution of the computed electric field.
Finally, we also include a diagnostic step in our stimulations that computes scalar quantities like mass, momentum, and energy, thus containing reduction steps over the full 6d array.

Interpolation on distributed domains
To compute the interpolated values, we can either use nodal interpolation formulas like Lagrange interpolation or global interpolants like spline interpolation. For simulations of the Vlasov-Poisson problem, cubic spline interpolation is most popular since it is well balanced between accuracy and efficiency. In combination with a domain decomposition, we however have to deal with the fact that the stripes are distributed between several processors, rendering a global interpolant impractical due to the required data exchange. Local splines as e.g. considered for the 4D Vlasov-Poisson equation by Crouseilles et al. (2009) are an interesting alternative. However, in this paper, we focus on local Lagrange interpolation.
Let us recall the special structure of the interpolation task arising from our 1d advections: The new value at grid point x i is given by the interpolated value at x i + α for some displacement α that is constant over the whole stripe. Let us decompose the displacement α into a multiple γ ∈ Z of ∆x and a remainder β ∈ [0, 1], i.e.
Depending on the sign of α, the origin of the characteristics for points close to the boundary of the local domain are displaced into a part of the domain that is stored on a neighboring process. Since the interpolation formula needs to be centered around x i + α this yields an additional need for halo data points on one side of the domain. In order to keep the data transfer limited and regular, we need to impose a CFL-like condition to restrict the displacement. The number of additional halo points needs to be kept small since each additional halo point requires the exchange of a five-dimensional facet of the six dimensional hyperrectangle.

Fixed-interval Lagrange interpolation
Lagrange interpolation with a stencil that is fixed around the original data point x i is a very simple interpolation formula for distributed domains. In this case, the interpolation formula with an odd number q of points is given by where ℓ i (·) denotes the Lagrange polynomial centered at i. Figure 1 illustrates on which data points a five-point stencil is based. In this case, we have a static data exchange pattern where q−1 2 points are needed from the processors on the left and on the right. On the other hand, we need to require |α| ≤ ∆x for stability, i.e. the scheme is rather restrictive on the time step.

Centered Lagrange interpolation
As an alternative, we consider the Lagrange interpolation for an even number q centered around the displaced point x j + α. Then, the interpolated values are given as The choice of the data points for the interpolation stencil is illustrated in Figure 2 for a four-point-formula. In this case, we need to exchange a layer of max( q 2 − γ, 0) points for the processor on the left and a layer of max( q 2 + γ − 1, 0) points to the right. As long as 1 − q 2 ≤ γ ≤ q 2 , i.e. |α| ≤ q 2 ∆x, the total number of points that need to be exchanged per stripe is always q + 1. However, γ changes depending on the value of the other coordinates-and for v advections also with time. Therefore, the communication pattern is different for different stripes. On the other hand, for Vlasov-Poisson the largest displacements usually appear for the x advections on grid points with high velocities.
For an x d advection, d = 1, 2, 3, the displacement α = ∆tv d is very simple and-for constant time steps-constant in time. In case we require |α| ≤ q 2 ∆x (to retain the minimal communication), the domain can be split in q blocks of different ranges of v d with the same interpolation stencil and, hence, the same data exchange pattern. This way, we can relax the time step restriction but at the same time keep a regular pattern of both data exchange and computations for the advections. Note that this is only true for the case without background magnetic field. The rotation of the grid with the magnetic field changes the displacement of the x advection steps on the logical mesh.
Having detailed on the mathematical background, we now turn towards a discussion of the aspects and challenges of an efficient implementation and parallelization.

Implementation and parallelization
Representing the six-dimensional distribution function on a grid consumes a large amount of memory. At the same time the arithmetic intensity of a Vlasov solver is relatively low compared to, for instance, a gyrokinetic solver due to the less complex structure of the equations. Therefore, the number of compute nodes needed for a simulation is determined mostly by the memory requirements.
For our implementation, we use object-oriented Fortran (2003), the Message Passing Interface (MPI) for distributed-memory parallelism, and OpenMP directives and runtime functions to add shared-memory parallelism. The developments were made within the framework of the library SeLaLib 1 . The 6d distribution function is discretized on a 6d Fortran array. Since a central idea of our method is dimensional splitting, the advection algorithms exclusively work on one-dimensional stripes of the 6d data. Since these stripes are non-contiguous in memory for any direction but the first dimension, non-contiguous 1d slices are copied from the 6d array into a contiguous buffer before the interpolated values are computed. The performance of these strided memory accesses can be greatly improved by cache blocking as is discussed in section 4.2.
For the domain-decomposition-based parallelization approach, the 1d stripes are distributed over multiple processes. The following section discusses how the halo data is stored in order to perform the interpolations.

Distributed-memory parallelism
The domain decomposition approach requires a layer of halo cells around the processor-local data points in order to be able to conveniently compute the interpolants. In the discussion below as well as in our Fortran-based implementation, we consider global indices used locally in each MPI process.
A straight-forward way to handle the halos would be to include the cells into the 6d array of the distribution function. Synonymously, this can be regarded as to work with 6d arrays that overlap between neighboring processors. Padding the 6d array with the halo cells has the advantage that the data is laid out contiguously in memory in the first dimension stripe-by-stripe, i.e. the interpolation routines can work directly on the array in this special case. Stripes along higher dimensions are conveniently accessed via Fortranstyle linear indexing, however, one has to keep in mind that the elements are laid out in memory in a strided fashion. It is important to avoid Fortran array slicing operators which cause temporary arrays to be used. Performance can be improved dramatically by implementing a cache blocking scheme using 2d buffer arrays, see below. Moreover, by using halos there is no need for special treatment of periodic boundary conditions during the advection step. p0 p1 p2 p3 p4 p5 p6 p7 p8 (b) Local arrays with indices that do not overlap between different processes. Figure 3: Schematic diagram of a data layout example, simplified to two dimensions and distributed over 9 MPI processes labeled p0 . . . p8. The square tiles represent the processor-local parts, the blue and cyan blocks the halo data needed for the advections along the first and the second dimension. The red square shows the data array stored by processor p4 for the two different data layouts. Layout (a) allocates the halo cells as part of the data array. Layout (b) allocates the halos as independent arrays. For layout (a), the gray blocks in the corners are unused.
A second possibility is to allocate the halo cells separately from the 6d array of the distribution function, i.e. there is no index overlap on the 6d array between neighboring processes. Note that in this situation the halo buffers are identical to the MPI receive buffers. Figure 3 shows these two possibilities. In a 6d array which includes the halo cells, one also has to allocate the corner points (displayed gray in Figure 3(a)) as well, even though they are not used by any of the 1d interpolators. As is well known, the volume of a hypercube mostly concentrates in the corners, therefore it is desirable to avoid memory allocation there. Moreover, in case the algorithm uses a different number of halo cells on different blocks of data, the 6d array has to include the maximum number of halo cells in any direction. Therefore we have chosen the second approach, using halo cells allocated separately from the 6d array of the distribution function. It avoids the aforementioned disadvantages and is in particular superior with respect to memory efficiency. Moreover, we can also exploit the fact that we only need the halo cells along one dimension at a time. Once allocated the halo buffers can be reused.
If we assume a MPI-process-local grid of size N 6 and a halo width of w cells on each side of the domain, the basic memory requirements for the domain decomposition scheme are N 6 + 2wN 5 (for the core 6d array and two halo buffers of size wN 5 on each side). Note that an additional send buffer of the size of a single halo buffer is needed. In a neighbor-to-neighbor communication, two data blocks of size wN 5 need to be communicated for each one-dimensional advection step. For the remap scheme, on the other hand, two copies of the local N 6 buffer are needed and, in addition, MPI send and receive buffers. Between each block of x and v advections, data blocks of size 1 p N 6 have to be communicated to each of the p − 1 other MPI processes, i.e. a total fraction of p−1 p of the local data block is sent. In practice the actual memory requirements may be even higher due to MPI-internal buffers. 2048.00 288.00 Table 1 compares the theoretical memory requirements for a typical process-local number of points per dimension for the two memory layouts. Note that the remap parallelization uses two 6d data arrays for the two remap data layouts. The comparably low memory consumption of the domain decomposition implementation is especially advantageous on systems where fast memory is a scarce resource, e.g. on certain manycore chips.
Moreover, based on the numbers from table 1 one can give a straightforward estimate of the resolution possible on a cutting-edge HPC system with ∼ 100 GB memory per two-socket node. Considering the domain decomposition algorithm and putting two MPI processes per node with 32 6 − 40 6 points each, a grid size of 128 6 − 160 6 would fit on 2048 nodes. Further increasing the resolution, e.g., in velocity space, problems of size 128 3 · 256 3 − 160 3 · 320 3 would fit on 16384 nodes.

Data access for 1d interpolations
On the distributed domain, the advection along a dimension takes the form shown in Algorithm 1 at the example of x 3 . Note that for advections along the dimensions 2 to 6 we have to deal with increasingly large strides when the 1d interpolation buffers are filled, causing a severe performance penalty due to cache misses as confirmed by profiling.
The cache efficiency and the resulting performance can be greatly improved by a cacheblocking scheme similar to the schemes used to accelerate, e.g., dense linear algebra operations. The blocking is based on a 2d buffer array. Interpolations are performed along the first (contiguous) dimension. In the second dimension the array is large enough to store at least a cache line of data. Algorithm 2 summarizes the loop rearrangements. A similar blocking has been implemented for all advection steps. For the advections along x 2 to x 6 , we extract the one-dimensional stripes in blocks along the first dimension.
Copy strided data to contiguous send buffer; MPI communication of halos; for i6 do for i5 do for i4 do for i2 do for i1 do Copy 1d stripe over i3 from 6d array and halo buffers into buffer; Interpolation along x3; Copy 1d stripe with updated values back to 6d array; end end end end end Algorithm 1: Advection along x 3 without cache blocking.

Shared memory parallelization
Both implementations, remap and domain partitioning, are carefully parallelized using OpenMP directives and runtime functions to exploit the shared-memory architecture of prevalent multicore CPUs using threads, in addition to the distributed-memory parallelization which employs MPI processes. A significant advantage of introducing a hybrid parallelization in addition to MPI is that it allows to reduce the memory consumption and the communication volume significantly. Instead of running one MPI process per available processor core, each allocating and exchanging halo cells, it is superior to launch only one or two MPI processes per socket, each with a proportionate number of threads pinned to the cores of that socket. All threads thereby share the halo cells.
Let us illustrate this effect by giving a simple numerical example. Consider a 64 6 simulation using 7 point Lagrange interpolation that shall be run on 64 compute nodes with 64 cores each, implying a grid size of 32 6 points per node. If a plain MPI setup is chosen, each node would run 64 MPI processes with a local grid size of 16 6 points, totaling up to 4096 MPI processes. On the other hand, we might consider an (extreme) hybrid setup running only 1 MPI process per node with a local grid size of 32 6 points. The 64 processes of the plain MPI setup would allocate 11 GB of memory per node for the distribution function and the halo cells, and communicate 36 GB per time step, 18 GB of which beyond the node over the network. In contrast, the hybrid setup would require 9.5 GB per node and communicate 18 GB over the network. We conclude that, first, it may be advantageous to use as few MPI processes as possible from the memory and communication point of view. Second, while the hybrid setup eliminates intra-node communication the inter-node communication volume stays the same compared to the plain MPI case, with larger message sizes though.
A potential disadvantage of a naive hybrid approach is due to the fact that a significant fraction of the threads would be idle during the data-intense halo exchanges, however, by introducing an advanced pipelining scheme we are able to hide the communication times to a large extent as is discussed in the following section.

Performance optimization
Performance optimization work aims at maximizing the node-level performance simultaneously with the parallel scalability which are conflicting goals to some degree. In the scope of this paper, we target recent x86 64 systems with multi-core or many-core CPUs as found in the vast majority of today's HPC systems. Details on the systems under closer consideration are given in Table 2 in the following section where performance results will be discussed. Fig. 4 shows a breakdown of the costs of the various operations involved during a time step of the domain decomposition solver running on a single core without any parallelism. The profile is clearly dominated by the advection computations ("A") including the Lagrange interpolation. Going from the direction of the first to that of the sixth dimension, the cost of the advection monotonically increases. This effect is caused by the fact that memory accesses become more and more strided. It is important to note that the effects of the striding are already mitigated by the cache blocking scheme which preserves a cache line, once loaded. Moreover, the prefetch efficiency of the processor appears to deteriorate with increasing strides. The preparation of the halo buffers ("H"), which also involves copies from strided into contiguous memory (however of much less data compared to step "A"), is by far less time-consuming. However, neighbor-to-neighbor MPI communication is included in "H" which becomes important when the parallelization spans multiple nodes. Finally, the Poisson-solve step and the diagnostics account for roughly 4 and 2 percent of the total runtime, respectively. Lagrange interpolation on a single Haswell core, where the letter "A" denotes advection, the letter "H" denotes a halo-exchange operation, the letter "P" denotes the Poisson solve operation, the letter "D" denotes the diagnostic computation, and the direction is given by its number.

Performance profile
In addition to simple and lightweight timing facilities, we used the tools Amplifier and Advisor from the Intel Parallel Studio XE package to obtain low level information on the performance and limitations of various parts of the code. Based on that information, code improvements were implemented, the most important of which are detailed on below.

Single-core performance
Without considering MPI communication, the major factors limiting performance of both the 6d Vlasov implementations, the remap and the domain decomposition, are due to the fact that the vast majority of the memory accesses-except the ones along the first dimension-are strided. A cache blocking scheme mitigates this issue significantly, as illustrated by performance numbers below. Nevertheless the code remains memory bound. The increase in run time from "A1" to "A6" in Fig. 4 reflects the aspect of the increasingly strided memory accesses.
In addition, SIMD vectorization is a key factor to achieve performance on modern CPUs. While in early days (SSE2) only a factor of two was lost when vectorization was ignored for double-precision operations, the potential loss has grown to a factor of 4 (AVX) or 8 (AVX512) on more recent CPU models. We have implemented Lagrange interpolation routines such that the compiler is able to generate vectorized code which we verified carefully using compiler reports and performance tools. Arrays are aligned to 64 byte boundaries, though the large 6d array of the distribution function is not padded in order not to waste memory. In general, the compiler is able to generate vectorized code for most of the loops.
Running 100 time steps of a 16 6 (32 6 ) case with 7pt Lagrange interpolation on a single Skylake core, the domain decomposition code achieves a floating point operation rate of 3.9 (2.8) GFLOP/s, which translates to 2.5 (2.5) GFLOP/s on the Haswell core. 2 This value represents about 6.8% of the Haswell core's theoretical peak performance. Note that the measurement covers the complete run including startup and shutdown phases and includes inevitable memcopy operations that do not perform any FLOPs at all. Around 90% of the floating point instructions issued are vectorized. These numbers once more illustrate the main performance challenge of 6d Vlasov codes resulting from memory boundedness due to strided memory accesses in combination with a moderate arithmetic intensity.
Finally, to quantify the effect of the cache blocking algorithm, the aforementioned test runs with 100 time steps take on the Haswell core in total about a factor of 2.4 longer for both cases with the cache blocking disabled. The higher the dimension to be interpolated along, the more effective and important the cache blocking becomes in general, accelerating certain parts of the code such as the loop over x 6 by up to a factor of 20, as measured using performance profilers.

Node-level performance
A modern compute node provides several cores that are organized in non-uniform memory access (NUMA) domains such that groups of cores share L3 caches and memory channels these cores can access fastest. Optimizing for the NUMA domains by careful process and thread pinning at runtime turns out to be important. Typically, MPI processes are pinned to sets of cores on sockets, and threads are pinned to individual physical cores from these sets. When overlapping communication and computation as introduced in the following subsection, it turns out to be advantageous only to pin the processes to constrain the threads within NUMA domains, and in addition, to use hyperthreads.
As the typical structure of the code are 6-fold nested loops, a standard loop-based OpenMP parallelization strategy proved rather successful. From benchmark measurements at various resolutions, it was concluded that the runtime is minimized when the outer two loops are collapsed into a single one to increase the granularity of the parallel workload, and when static loop iteration scheduling is used. Each thread is able to benefit from cache blocking and vectorization in the inner loops. Virtually any workload in the code is parallelized using that technique. As a result, the application scales well over a single node in pure OpenMP as shown in Sec. 7.

Distributed-memory performance
Semi-Lagrangian 6d Vlasov solvers are unavoidably intense in terms of memory and data communication volume (cf. Sec. 4.1). For the domain decomposition approach, typical sizes of single MPI messages are in the range of O(0.1)-O(1) GB, while modern interconnects achieve a bandwidth of up to ≈ 10 GB/s per node. It is therefore important to mitigate the cost of the data transfers as much as possible, firstly by careful planning of the process grid, by overlapping communication and computation, and in addition by means of data reduction.
As outlined in Algorithm 2, each advection step starts with MPI communication in order to fill the halo buffers in the neighbor processes, before the interpolations are performed. In hybrid-parallel setups the initial MPI communication would only keep a single thread busy while all the other threads were idle. The trend towards systems with increasingly more cores per socket suggests to use multiple threads per MPI process in order to overlap ("pipeline") the communication with useful computation.

Simultaneous communication and computation.
In order to implement pipelining of communication and computation, we block the data along a dimension different from the one we intend to interpolate along, and perform data exchange and computation simultaneously on the resulting independent blocks, as illustrated in Fig. 5.   Here, we consider the Lagrange interpolation with fixed interval because in this case we do not have to handle additional blocking due to asymmetric data exchange. Moreover, in order to avoid a second layer of blocking, we do not consider blocks with different halo patterns. Anyway, the overhead introduced by not minimizing the halo widths for some blocks is less problematic when the communication is overlapped with computations.
For each data block, the advection computation consists of three steps: Copy (generally non-contiguous 6d) data into coherent buffer (C); MPI sendrecv() communication (M); computation of interpolated values (I).
For each data block, these three steps need to be performed in the given order, but there is no dependency between different blocks. Nevertheless, we have to enforce some ordering in order to avoid a capacity overload of resources such as the maximum number of simultaneous hardware threads. Given the 3-fold structure of the advection, we propose a straight forward pipelining scheme using 3 lanes as shown in Fig. 6.
In order to keep the overhead of the start-up and the final phase as small as possible, we overlap communication and computation of advections for different dimensions as well.
To give an example let us start with the x 1 advection. Once we have reached the communication step ("M") of its last block, we can already start to copy ("C") and communicate ("M") the data needed for the following x 2 advection in the first data block-provided that the x 2 dimension is contiguous in each block.

Implementation details on the pipelining scheme.
Our pipelining implementation relies on a thread-safe MPI library and on OpenMP threadsrequirements that are provided by most modern compilers and libraries. The steps "C", "M", and "I" can be regarded as tasks with interdependencies.
In the following, we provide details on the implementation, referring to Algorithm 3. Initially, a list of all blocks is built, where each list element contains meta data such as block indices, the number of nested threads for the steps "C" and "I", and two OpenMP locks, one for the "C" step and one for the "M" step. In the scheme proposed in the following, the orchestration of the tasks is explicitly controlled using a parallel region with a fixed number of three threads which uses internal loops with static scheduling and a chunk size of 1 such that the ordering is deterministic. For N blocks, thread i, with i = 0, 1, 2, manages the work on the blocks ℓ = i + 1 + 3k, k = 0, . . . , ⌊ N −i−1 3 ⌋, performing the steps "C", "M", and "I" one after the other. In the steps "C and "M", we use nested OpenMP parallelism to make use of all the available threads. Apart from managing locks and calling the tasks the outer loop does nothing. In order to make best use of the network resources, we ensure that the resources are dedicated to one "C" and one "M" operation at a time, i.e. only a single communication "M" is allowed to take place at a time. Therefore, thread i initially locks all the "C" and "M" operations of thread mod(i + 1,3) and only releases the lock on "M ℓ+1 "/"C ℓ+1 ", when it has finished "M ℓ "/"C ℓ ". As a result, the pipelining scheme with overlapping tasks as shown in Fig. 6 arises.
At runtime, we have to specify two parameters that influence the performance of the implementation: the number of blocks and the number of threads used by each of the "C" and "I" tasks. The smaller the blocks are chosen the shorter the start-up and finishing phases become during which communication and computation cannot be overlapped. On the other hand, the blocks should not be chosen too small in order to keep the overhead low. As the second parameter, one needs to decide how to assign the available threads to the "C" and "I" steps. Note that the first "C" and the last "I" step may use more threads since they do not fully overlap with other operations. It turns out that the use of simultaneous multithreading ("hyperthreading") in concert with the pipelining is beneficial. A detailed discussion including benchmark results is presented in Section 7.
An obvious choice for the advections in x is to block the 6d distribution function along the 6th dimension v 3 which corresponds to the slowest varying loop index resulting in blocks that are laid out contiguously in memory. Thus our pipelining scheme combines the three x advection steps without any inner global synchronization point. In the same way, the v advection block can be combined. Here the blocking of the data is performed along x 3 .

Optimization of the halo-exchange communication.
As pointed out before, the 6d Vlasov solver is a highly communication intensive application, rendering good large-scale parallel scalability a difficult goal to achieve. From our experiments, we find that the implementation can scale well even when spanning multiple islands on a HPC system if the process grid is laid out in an optimum way and communication and computation are pipelined.
The HPC systems under consideration have the network topology of a pruned fat tree. An island of the system is designed such that each node of the left half of the island is theoretically able to communicate with the corresponding node of the right half of the island simultaneously at the same bandwidth, termed the bisectional bandwidth. Going beyond an island, the bisectional bandwidth is reduced. Here, the blocking factor is decisive, which is, e.g., 1:4 on a system used below, meaning that the inter-island bisectional bandwidth is only a quarter of the intra-island one. We are therefore challenged with at least four levels of interconnection speeds between MPI processes, namely communication via shared memory on the same socket, intra-node communication via shared memory between different sockets, intra-island and inter-island communication via the network.
In addition to the pruned fat tree topology, there are other topologies, e.g., high dimensional torus interconnects. We would expect the 6d domain decomposition algorithm to benefit highly from such networks because a reduced bisectional bandwidth is avoided, whereas on a pruned fat tree it turns out to be one of the major challenges to scalability.
It is crucial to optimize the Vlasov solver and also choose the problem setup for the network topology of the machine as well as possible in order to prefer fast communication paths for the largest messages, which is discussed in the following.

Process grid optimization.
The ordering of the 6d process grid can be chosen such that communication between remote processes is kept as small as possible. A batch system lays out the MPI processes in a certain pattern, often placing the ranks consecutively on the nodes, one island after the other. When constructing the 6d process grid, the MPI ranks are placed in row-major (C) order, indexing processes p like p[i, j, k, l, m, n], with n being the fastest varying index. Consequently, the neighboring processes are closest in x 6 , direction n, whereas they are separated increasingly on the network when going to the x 1 , direction i. Depending on the shape of the process-local part of the 6d grid and on the (potentially different) interpolation orders along x and v, the process grid layout can be chosen such that inter-island communication is minimized. Despite this optimization, certain directions still communicate mainly between islands. In our experiments, it turned out to be beneficial to transpose the grid such that neighboring processes are closest in x 1 . In this case, the processor groups that solve the Poisson problem are close. On the other hand, the reductions over velocity to compute ρ combine more remote processes. When overlapping communication and computations this ordering is especially advantageous since the communication is more expensive the larger the stripe, i.e. the longer it takes to gather the data for the advection.
To go one step further, we have experimented with communication patterns between islands that are more balanced in the time dimension. This is done by mapping consecutive MPI ranks onto blocks of 2 6 processes, i.e. by rearranging the 6d process grid completely. Our implementation uses known hostname schemes of HPC systems to perform the rearrangement. As the consequence of such rearrangement the advection in each direction would perform inter-island communication to some fraction, which would be useful to mitigate situations without rearrangement when only one or few directions are communicated between islands not hiding well behind computation. However, within the scope of this paper we did not enable such blocking because it did not turn out beneficial, the reason being that the per-process computational workload (the local partition of the 6d hypercube) was typically too small in relation to the halos.
It is important to point out that these process grid optimizations do not reduce the total amount of data that is to be communicated. We will turn towards possibilities to reduce the communicated data volume in the following section.

Reduction of the communication volume.
The MPI messages sent between neighbor processes to fill the halos use by default 8 bytewide double precision numbers. We have implemented the option to halve the communication volume and time by sending these messages in single precision, leading to an improved parallel scalability especially in inter-island scenarios. However, one has to keep in mind that an additional error is introduced into the computation which requires careful validation. We therefore consider single precision messages an experimental tool.
When running the pipelined code in a hybrid fashion it turns out that-depending on the balance between threads and processes-a fraction of the threads is idle waiting for communication to finish. To continue along the lines of single precision messages, we have therefore experimented with floating point compression in order to further reduce the communication volume and time while utilizing the threads that would otherwise be idle. Naturally, a simple stream compression algorithm is not suitable to compress floating point data, rather a lossy algorithm tailored towards numerical data is required. The ZFP floating point compression algorithm, of which an implementation is freely available as a library, compresses blocks of 64 double precision numbers, taking the desired precision as a parameter (Lindstrom, 2014). The compression ratio achieved depends on the similarity of the numbers in the blocks. Adjusting the precision to match single precision, we typically observe a compression factor of ∼ 4 for the halo data, improving parallel scalability over islands significantly. In particular the compression step is rather expensive as will be shown briefly in the following. The compression option might become more relevant in the future when the per-node computational power continues to grow faster than the network bandwidth.
In section 7, we present performance studies, detailing on the various challenges and the respective optimization approaches to tackle them.

Numerical experiments
In this section, we consider representative benchmark problems with and without a background magnetic field and discuss accuracy and CFL-like conditions for the various Lagrange interpolations on distributed grids.

Vlasov-Poisson simulations
We first consider two test problems with no magnetic field, weak Landau damping, where the chosen resolution yields rather good accuracy, and a bump-on-tail instability with larger phase-space error, and compare the accuracy of the Lagrange interpolators of various order.
The initial value of the weak Landau damping problem is given as The grid resolution is chosen to be 16 3 × 64 3 and we consider the error in the electrostatic potential energy in the time interval [0, 30] compared to a reference simulation on a grid with 20 3 × 80 3 points and a time step of ∆t = 0.005 with Lagrange interpolation of order 8 in space and 7 in velocity.
The bump-on-tail test case has the initial value and is solved on a mesh with 32 points per direction until time 15. The reference solution is simulated on a mesh with 40 points per direction, a time step of ∆t = 0.0125, and Lagrange interpolation of order 8 in space and 7 in velocity. Figure 7 shows the error in the electrostatic energy as a function of the time step for various orders of the Lagrange interpolator for the weak Landau damping example. Comparing the curves, we see that the error for the largest time step ∆t = 0.3 is almost the same for all considered interpolation formulas, that is the temporal error dominates. Since we use a second order Strang splitting method, the error reduces proportional to ∆t 2 as we reduce the time step until the interpolation error starts to dominate. The lower the order of the interpolation stencil the earlier this happens. Note that, once we have reached a time step where interpolation errors dominate, a further reduction of the time step may even yield an increase of the error.
For this example, the displacement of the x advection exceeds one cell size for a time step above 0.13. Therefore, fixed Lagrange interpolation (odd order) is only stable for time steps smaller than 0.13. On the other hand, centered Lagrange interpolation shows good results for time steps beyond this. The results also show that for accuracy reasons the time step should be chosen such that the displacement is on the order of the cell size or a bit above. Hence, a combination of centered Lagrange interpolation with blocked communication as described in Section 3.2 for the x advections and fixed Lagrange interpolation for the v advection is an efficient choice for the Vlasov-Poisson equation. Figure 8 shows the results for the bump-on-tail test case. In this case, the spatial error is larger which is why larger time steps compared to the value of ∆t ≈ 0.073 (which corresponds to the maximum time step where the displacement is bounded by one cell size) are advantageous. As in the previous case, we see that order 5 does not give satisfactory results and the use of Lagrange interpolation of order 6 in x and 7 in v gives best results. Note also that the absolute error is shown in the figure and the maximum value of the electric field energy in the considered interval [0, 15] is about 205. Figure 9 shows the accuracy as a function of the total CPU time for a simulation on a single node of the DRACO cluster with 32 MPI processes for the Landau case. We can see that the computing time increases with the order but the increase is very small. On the other hand, the increasing halo cells for increasing order will impact the performance more strongly when MPI communication between nodes is involved.

Simulations with rotating mesh
As an example of a simulation with a strong background magnetic field B 0 , let us consider a simulation with initial value given as In this case, the gyrofrequency is given by ω c = 2π B . In order to understand the time-step restrictions of our distributed memory parallelization with the rotating grid, we estimate the maximum displacement in the various advection steps. The displacement of an x 1 advection at time t is given by Here, we use v i,max to denote the (in modulus) largest value of the velocity on the computational grid. For the displacement of the x 2 advection, the same estimate can be derived. The displacement of the x 3 advection is given by ∆tv 3 and can thus be estimated by v 3,max ∆t. The displacement of the velocity advections depends on the electric field. The electric field induced by the initial condition is given as If the electric field is damped in time, we can estimate the electric field by αk ⊥ k 2 ⊥ +k 2 for the perpendicular and αk k 2 ⊥ +k 2 for the parallel direction. Let us consider the following parameters, B = 20π (ω c = 0.1), k ⊥ = k = 0.5, α = 0.01. Using 16 grid points along each spatial dimension and 32 points along the velocity dimensions as well as a velocity domain limited to [−6, 6], the grid spacing takes the values of ∆x i = 4π 16 = π 4 and ∆v i = 12 32 = 0.375. A linear dispersion relation gives ω = 1.1900 − 0.2843i as a coefficient in the temporal Laplace transform, i.e. the electric field is damped. Figure 11 shows the solution with time step ∆t = ω c /20 compared to the solution with frequency and damping rate predicted by the dispersion. We see that the simulation is in good agreement with the dispersion, except for the first oscillation as usual.
The maximum displacement along x 1 and x 2 direction is given by ∆t6 √ 2 and hence the displacement is restricted to one cell size if ∆t ≤ 9.256 · 10 −2 , i.e. slightly below one period of the gyration. On the other hand, the displacement of the velocity advections is bounded by 0.01∆t such that the displacement is smaller than the grid size for all ∆t ≤ 37.5. Figure 10 shows the electrostatic energy from simulations with various time steps. In the x advection steps, we use centered Lagrange interpolation with 6 points and in the v advections a fixed 7-point Lagrange interpolation formula. Note that we use symmetric halo cells of sufficient size for the x 1 , x 2 directions since a static blocking as in the pure Vlasov-Poisson case is not possible on the rotating mesh. The results show that rather good results can be obtained for time steps above the gyrofrequency. In particular, we can use time steps close to ω c /2 which would yield a completely nonlocal stencil if we would not rotate the mesh. However, the time step cannot be a multiple of ω c since then the magnetic field cancels out in the propagator on the rotating mesh.

Performance benchmarks
To systematically compare and evaluate the performance and the scalability of the remap and the decomposition implementations a series of runs was performed, going from a single  compute node to a HPC cluster, and further up to multiple islands on a supercomputer. For most of our basic tests and during iterative performance optimization work, we used a two-socket Intel Haswell-type node, the building block of the DRACO HPC cluster at MPCDF 3 . Moreover an Intel Xeon Phi KNL node was available, running in "flat" mode with the 16 GB of MCDRAM available as a separate memory domain. All the runs performed on the KNL used the fast MCDRAM exclusively. Finally, we present largescale runs performed on the SandyBridge partition of the SuperMUC HPC system of the Leibnitz Supercomputing Centre, covering up to 8 islands with 64k physical cores. Table  2 provides details on the specifications of the compute nodes. The Landau damping test case was chosen in each run. We consider a 7-point Lagrange interpolation for both x and v advections since it proved to be a good choice in terms of accuracy and flexibility according to the numerical comparison presented in the previous section. Note that the 6-point centered Lagrange interpolation has the same halo width and therefore shows a similar performance. Any wall clock time given refers to the computation of 5 time steps. The initial time step is excluded from the time measurement to compensate for the initialization overhead of the MPI library which can be significant.
To compile and link the code, recent versions (16 and 17) of the Intel compiler were used throughout this work, with the optimization flags set to -O3 -xHost -ipo-separate -qopenmp. On DRACO and on the Xeon Phi node, Intel MPI is used whereas on Super-MUC, IBM PE is used. We first look at the performance on a single node, focussing on both process (MPI) and thread (OpenMP) parallelism.

Node-level performance
To evaluate the performance of the OpenMP-based parallelization in comparison to MPI we present scalings on a single compute node in this section.
Each run was repeated 5 times to average out small variations between individual runs which were found to be on the order of up to 5%. A problem size of 32 6 was chosen since it is quadratic and represents a reasonable (though moderately large) size for a per-socket workload on the relevant systems, the net size of the distribution function being 8 GB in double precision.
Moreover, the simulation of a 32 6 hypercube fits completely into the high-bandwidth memory of the KNL node, at least for the domain decomposition implementation considered mainly in this paper. The same or a comparable size for the per-socket workload will be used for the large-scale runs presented below. The aim of this section is to show that the domain decomposition implementation delivers excellent parallel performance on a single node in both MPI and OpenMP. This feature paves the way to efficient hybrid-parallel large-scale simulations performed on many-core distributed-memory HPC clusters. Benchmark runs will be discussed in the next section.  Figure 12 shows strong scalings on the Haswell-type node. For both, the domain decomposition and the remap case, a scan in the number of MPI tasks keeping the number of threads fixed to 1, and for comparison, a scan in the number of OpenMP threads with a single MPI process are shown.
For each series the optimum pinning strategy is chosen, which has been determined experimentally a priori. The MPI processes are pinned in a round-robin fashion between the two sockets. Doing so the memory bandwidth of both sockets is used as soon as more than one process is run. Only MPI messages are transferred via the link between the sockets. On the other hand, the threads for the OpenMP scan are pinned in a compact fashion to physical cores, filling the first socket completely before going to the second socket. The compact OpenMP thread pinning strategy reduces the traffic over the link between the two NUMA domains significantly. Note that a thread-aware first-touch memory allocation and handling is not possible when traversing the 6d array of the distribution function in all the 6 directions. Fig. 12 shows that for the domain decomposition runs both the MPI and OpenMP results are very similar and scale virtually ideally up 8 cores. They continue to scale well up to the full 32 physical cores of the node. Adding hyperthreads does not improve either case, the wall clock times even rise marginally, indicating that the CPU pipelines are already used efficiently.
The remap code performs significantly worse than the domain decomposition implementation. Running on the full node with 32 MPI processes it is about a factor of 2.5 slower. Moreover it scales worse, achieving a speedup in MPI of about 11.2 compared to the speedup of about 18.8 for the domain decomposition code. In OpenMP the remap code does not scale well which is due to the simple cause that the remap operation takes place within the MPI library as a (trivial) all-to-all operation which is inherently not threaded. To have a fair comparison, the latter finding is one of the main reasons why we mostly present pure MPI or only weakly-hybrid (using 2 hyperthreads per process, for example) runs in the sections below on medium-and large-scale runs when we compare the remap to the domain decomposition codes.
For comparison and to include results from a many-core platform, we show numbers based on the identical 32 6 test case from domain decomposition runs performed on the Xeon Phi node. As seen in Fig. 13, the strong scaling curves follow an ideal scaling up to 16 cores in both MPI and OpenMP, and continue to scale well going up to the 64 available physical cores. Scaling further into the hyperthreading regime gives some benefit with 2 threads per core, going further to 4 threads per core the performance even deteriorates slightly. The speedups reported here are approximately 33.0 for the best MPI case on 64 cores, and approximately 59.7 for the best OpenMP case running on 128 hyperthreads. Note at this point that the remap code requires more than 16 GB of memory in the given setup and, hence, could not be considered on the KNL node. Given the result on the Haswell node it is not likely to outperform the domain decomposition code on the KNL. For similar reasons, we could not run more than 64 MPI processes using hyperthreads  which does not make sense from a technical point of view either. While the plain MPI and plain OpenMP strong scalings are well-suited tools to probe the parallelization efficiency individually, the HPC application user is interested in the smallest time-to-solution which typically requires to choose a mix of both the parallelization strategies at runtime which we focus on next. Table 3 compares the node-level performance of the domain decomposition runs between the Haswell and the Xeon Phi node. Merely based on the ranked timings, the performance advantage of the KNL node relative to the HSW node is roughly 1.4. Comparing the fastest runs with more than one MPI process-which would be a practically more relevant case-the advantage of the KNL melts down to only a factor of 1.05.
In summary, both, the distributed-memory and the shared-memory parallelization of the domain decomposition implementation were found to scale very well on a single node. OpenMP threads and MPI processes can virtually be used interchangeably when running the domain decomposition implementation. This is an important finding relevant to the larger setups on many compute nodes that rely on hybrid parallelization, e.g. running one process per socket and keeping all the available cores busy with threads, as presented in the following section. However, in practice for a given setup, it may not be beneficial to arbitrarily swap threads and processes due to implicit changes in the partitioning of the numerical grid and the process grid which may taint the performance in some cases.

Parallel performance 7.2.1 Medium-scale runs.
In this section, we focus on the parallel performance of the 6d Vlasov code on distributedmemory HPC clusters. We mainly consider the domain decomposition implementation but draw some comparison to the remap code as well.
First we discuss medium-sized scaling runs performed on the DRACO HPC machine at MPCDF, mainly to investigate the scalability of the hybrid implementation. It consists of Haswell nodes as detailed in Table 2 in the previous section. The nodes are grouped into islands of 32 machines and are connected via an FDR14 InfiniBand network. The maximum job size is 32 nodes with 1024 physical cores in total, supporting up to 2048 simultaneously active threads. Note that the minimum number of MPI processes for which the domain decomposition implementation communicates with all the process neighbors via MPI messages is 64 which is therefore taken as the baseline for all scalings.
To be able to include the domain decomposition code with pipelining, i.e. overlapping of computation and communication, into the comparisons in a fair way, we allow each process to use two hyperthreads even in a plain MPI run, since the pipelining scheme relies on having multiple active threads per process to work properly. For the other two codes under consideration the effect of the two hyperthreads is negligible as shown in the previous section on the single-node performance.
We now turn towards a strong scaling before the more relevant weak scaling scenario is considered in greater detail. Fig. 14 shows strong scalings in plain MPI and also for hybrid setups, going from 64 cores on two nodes up to the full island size of 32 nodes. The test setup uses a resolution of 32 4 64 2 , thereby following up on the resolution of 32 6 per CPU socket as previously used for the single-node tests. The remap implementation is significantly slower than both the domain decomposition codes, at the same time it scales best by achieving a speedup of about 8.6 of 16, closely followed by the domain decomposition code at about 8.3. On the full island, i.e. on 32 nodes, the domain decomposition code is faster by a factor of about 2.8 compared to the remap code, a result very similar to the outcome on the full node. Initially starting out at virtually the same wall clock time as the domain decomposition case, the pipelining code scales worse, illustrating the overhead of the pipelining scheme when decreasing the workload per MPI process.
The hybrid parallelization turns out to be efficient when going beyond a single compute node as well. Fig. 14 shows strong scalings for hybrid runs of the domain decomposition codes in direct comparison to the MPI runs, reducing the number of MPI processes by a factor of 1/2 or 1/4 and increasing the number of threads by the inverse factor. Note that hyperthreads are used in the example, i.e. two threads share the same physical core. As can be seen, the hybrid curves follow the plain MPI curves very closely. For the largest runs the hybrid case with 8 threads per MPI process turns out to perform best for the domain decomposition code without pipelining.
An important effect induced by the multidimensional domain decomposition has to be  Figure 15: Weak scaling on DRACO using 7pt Lagrange interpolation, keeping a local problem size of 32 6 points per CPU socket. Note that the largest runs have a global problem size of 64 6 points for the distribution function.
recalled at this point to explain the deterioration of the strong scaling. As the global problem size is kept constant, the fraction of the distribution function that has to be communicated between neighbors increases as the strong scaling is performed due to the decrease in local grid size. This effect can be mitigated to some degree by performing hybrid runs. On the other hand, for the remap implementation the total volume of communicated data stays constant, whereas the number of MPI messages increases quadratically with the number of processes. These strong scaling curves are mainly given for reasons of completeness. As high memory demands in combination with low arithmetic intensity are the main challenges to be tackled when dealing with the 6d Vlasov-Poisson problem we turn towards the weak scaling properties of the codes which are more relevant to Physics simulations on large grids. Fig. 15 shows a weak scaling, starting on two nodes and keeping a workload of 32 6 points per CPU socket. Looking at the plain MPI runs (with 2 hyperthreads per process to enable comparison to the pipelining implementation), the remap code is significantly outperformed by the domain decomposition codes, with the pipelining implementation being the fastest and scaling best. The parallel efficiencies are approximately 0.88 for the pipelining implementation, 0.79 for the non-pipelining one, and only 0.55 for the remap code. On the full island, the pipelining code is about a factor of 4.6 faster than the remap code.
In addition to the plain MPI cases, the plot shows a selection of hybrid cases, exchanging processes with threads, but operating on the same workload for comparability. The graph yields the important finding that hybrid setups running 4 or 8 threads per process may be in fact the best choice for the pipelining code, which is in this situation able to utilize several threads for each the advection and the buffering tasks in parallel, while performing MPI communication at the same time, thereby hiding parts of the communication behind useful computation. As can be seen in addition for the pipelining implementation, the hybrid parallelization performs very well over a broad range of configurations. Even running only two MPI processes per socket (labeled "16 tpp") turns out to be very close to the case with one MPI process per physical core (labeled "2 tpp"), less aggressive hybrid setups being located in between. However, reducing further to only a single process per socket (labeled "32 tpp") causes the time to solution increase significantly which is caused by the fact that a single simultaneous MPI operation per socket is not sufficient to saturate the network interface. Finally, it has to be recalled that each hybrid configuration uses a different, optimal, process grid on top of the same numerical problem setup, which leads to different communication patterns that contribute as well to the variation observed between plain MPI and hybrid runs.
Concerning the weak scaling, the rise in the wall clock time is attributed to a significant part to the fact that halos in more and more spatial directions are communicated via the network instead of being shared via the memory on the node, as the system size is increased. An illustration and in-depth explanation will be given in the following section dealing with scalings on SuperMUC using a component-wise breakdown of the run time.
To illustrate the behavior and performance of the pipeline implementation, time traces in a task-style are presented in Fig. 16. In Sec. 5.4, the pipelining scheme was explained by means of the schematic Fig. 6 which is hereby complemented with timing data. The run under consideration uses 64 MPI processes in total, employs two processes per node, one per socket, and uses OpenMP threads to keep all the cores busy. The plot shows timings obtained from the control loop which implements the pipelining scheme using three lightweight threads, indicated here by three lanes. Temporal overlap is possible between the lanes within the x advection and the v advection blocks where the number of nested threads for each task is set such that at most the number of available hardware threads is utilized. As can be seen from Fig. 16 the communication operations shown in black overlap quite well with useful computation being performed at the same time, though some minor jitter is present. The fact that the v advections and the x advections cannot be overlapped due to their blocking along different dimensions becomes visible from empty lanes when the transition from v (blue) to x (red) occurs. The configuration of the process grid in this example is 2 6 , with a local numerical grid of 16 · 32 5 and a global grid of 32 · 64 5 . This implies that there is more communication going on along the first dimension which can be seen well from the first four black blocks of each x advection phase in Fig. 16 which turn out to be wider than the rest. Overall, the computations of the v advections (blue) are more expensive than the computation of the x advections which is due to strided memory accesses, as was already shown on Fig. 4. Note also that the time spent in the Poisson step is comparably large and can vary quite a bit. The reason is that it contains the reduction step which requires the synchronization of comparably many processes and, hence, potential imbalances due to system jitter during the advection steps are included in the Poisson timing. In the following section, we discuss runs on a supercomputer at large scale, also shedding more light on the effects which limit the parallel scalability.

Large-scale runs.
To further evaluate the performance of the domain-decomposition-based solver, we present runs performed on the SuperMUC HPC system of the Leibnitz Supercomputing Center 4 .
On the so-called phase 1 partition of the SuperMUC system, each node is equipped with two Intel E5-2680 CPUs (SandyBridge-EP) with 8 physical cores, each supporting two threads per core (cf.  islands are available. In the following, we present scalings going up to 8 islands with 64k physical cores and 128k hardware threads. The resolution of the test case considered for the weak scaling on SuperMUC is 32 4 · 16 2 per socket, which is chosen-for reasons of the available memory per node-a factor of four smaller than the size used per Haswell socket on DRACO. Note that the setup implies a local resolution of a 16 6 hypercube per process for plain MPI runs with 8 processes per 8-core socket, as shown below in Table 18 which details on the grid parameters involved in the scaling. Starting from 64 processes on 4 nodes guarantees that MPI communication takes place in all the 6 dimensions initially. Fig. 17 shows weak scalings for the remap and the two domain decomposition codes. There is one MPI process per physical core with 2 hyperthreads each. These extra hardware threads are enabled in order to include the pipelining implementation into the comparison.
The results show that the domain decomposition codes both scale quite well. The plain domain decomposition implementation turns out to be faster than the pipelining one up to 512 MPI processes when the latter takes over. Going beyond the island boundaries leads to an increase in the wall clock times which is moderate at two islands but becomes more significant when going to 4 or 8 islands. When scaling up, the increase in the wall clock time is mainly caused by the fact that an increasingly larger fraction of the halo data is exchanged over the network between nodes and finally between islands as the problem size is increased. A run time analysis of the building blocks of the code is discussed below and sheds more light on this issue.
Compared to the domain decomposition codes, the remap code is slower, as already known from the previous tests. In addition it scales slightly worse. More details on its scalability limitations will be given below. Due to a restriction in the implementation, the Poisson step currently prevents to use more than 4k MPI processes.
In addition, the weak scaling plot Fig. 17 contains data from two experiments aimed at increasing the parallel scalability by reducing the size of the MPI halo messages. The first experiment (labeled "32bit halos") simply sends the halo data in single precision, thereby halving the communication volume. The wall clock time is significantly reduced and also the scalability is improved to some degree, especially when going from one to two islands. The second experiment (labeled "zfp") takes one step further by applying a lossy compression to the halo data, thereby effectively reducing it to about a quarter of its size while preserving single precision accuracy. This comes at a significant computational cost but at the same time leads to a virtually flat scaling with nearly 100% parallel efficiency, even when crossing island barriers. Interestingly, at 4096 MPI processes, the run time of the remap code is already very close to the one of the domain decomposition code with ZFP compression enabled. As hardware standards evolve it is to be expected that the penalty of the ZFP compression becomes less and less important for hybrid runs on present and future systems with greater relative compute power compared to the network bandwidth.
Hybrid setups, i.e. swapping MPI processes in favor of OpenMP threads for the domain decomposition code, did not turn out to be better than plain MPI cases on SuperMUC with its comparably small number of cores, in contrast to DRACO as shown before. As argued already for ZFP, the hybrid feature will become more important in the future.
Let us now turn towards an in-depth analysis of the scaling of the building blocks of the domain decomposition code when performing the weak scaling. Fig. 18 shows a breakdown of the individual components of the implementation without overlap of communication and computation. Table 4 complements Fig. 18 with information on the grid configuration of the runs, essential for a better understanding of the behavior.
The advection computations (A1-A6) scale virtually ideally. However, from the communicationintensive operations such as the halo exchanges and the Poisson step we clearly see the effects of the network topology. In the runs performed, the grid of MPI processes is laid out in column-major order, i.e. MPI processes are closest in the 1st dimension and are farthest apart in the 6th dimension as shown in Table 4 for each problem size of the weak scaling.  Figure 18: Weak scaling of the building blocks of the implementation for the plain MPI domain decomposition run shown in Fig. 17, scanning from 64 up to 65536 MPI processes. The letter A indicates the advection computations whereas H indicates the halo-exchange operations, for both of which the dimension is given. The letter D represents the diagnostics and P the Poisson solver. In particular, PC means the charge density computation whereas PF means the field solver component, respectively. The X advections and the V advections were grouped together. The transitions going from 1 to 2, from 2 to 4, and from 4 to 8 islands are indicated by vertical lines.  In particular, the table gives information on how far processes, that are neighbors in the process grid in a logical sense, are separated on the machine in units of processes and nodes, respectively. Note that on the machine and in combination with the configuration we are considering, there are 16 cores (and MPI processes) per node such that any x 1 -x 2 plane in any run discussed here fits into a node. This means that halo exchanges in x 1 or x 2 direction take place in shared memory via the MPI library. In Fig. 18 this fact is evident from the virtually ideal scaling of the H1 and H2 curves. The operation H2 is likely to be faster because the batch system distributes the processes in a round-robin fashion between the two sockets such that the H1 communication is done via the link between the two NUMA domains whereas the H2 communication takes place within the domain. On the contrary, neighbors in the 5th and 6th dimension communicate exclusively over the network for all the runs under consideration, causing H5 and H6 to be a rather costly operation. As can be seen the communication in x 6 is the main reason for the slowdown when crossing island barriers. The fraction of halo data communicated between islands rises from 25% on 2 islands to 50% on 4 islands and finally to 100% on 8 islands, as the "distance in nodes" values in Table 4 indicate for the x 6 communication. The H3 and H4 curves show the same behavior for the transition from intra-node to inter-node communication, as the rise at 256 and 1024 cores, respectively, indicates. The Poisson step is broken down in two parts, the reduction step (PC) and the actual solution of the Poisson problem (PF). The latter part PF is negligible and scales well (except for some fluctuations especially visible in this step due to the small numbers). The reduction step, on the other hand, shows a similar increase as the v halo exchange steps since it includes communication along all the velocity dimensions.
The diagnostics computation D is initially negligible, however, going to larger process counts its significance rises, especially when crossing the island boundaries. Note that the diagnostics is computed purely local to a process, and only in a final step an array of 12 double precision numbers is summed (reduced) globally to rank 0 via MPI, causing the increase in time observed in Fig. 18.
Finally, it has to be pointed out that the runs could be performed only once, and hence, some fluctuations are to be expected, e.g. in the D curve at 1k and 8k in Fig. 18 (and not being present in the D curve in Fig. 19 below, for comparison).
For comparison, Fig. 19 shows the temporal breakdown of the building blocks of the remap code. Starting with 4 nodes (64 MPI processes) at a resolution of 32 6 the remap operations R1 and R2 dominate from the beginning over the compute-intense advection computations, with their run time fraction steadily increasing in the following. The advection computations A1-A6 scale ideally as to be expected. The Poisson solve step turns out to take about the same time as the x advection computation, whereas the diagnostics is negligible in the range under consideration.  Figure 19: Weak scaling of the building blocks of the implementation for the remap run shown in Fig. 17. The letter A indicates the advection computations whereas R1 and R2 indicate the remap operations involving all-to-all communication, with R1 mapping from the representation contiguous in V to the representation contiguous in X, and R2 inversely. The letter D represents the diagnostics and P the Poisson solver. To improve the clarity of the plot the X advections and the V advections were grouped together.

Intel manycore.
Finally, we have also run the same experiment on the Intel KNL partition of the Marconi Fusion cluster at CINECA. We report results from weak scaling starting with a resolution of 32 6 points on a single KNL node, distributed over 64 MPI processes and allowing for 4 hyperthreads per MPI process. The configuration was chosen such that the local grid is a regular hypercube of 16 6 points per MPI process. Fig. 20 shows the wall clock time for the weak scaling experiment, running the domain decomposition code with and without pipelining. A breakdown of the timings for the various steps shows a similar pattern as on SuperMUC with virtually flat curves for the compute-only steps and increasing communication times when intra-node communication is replaced by inter-node communication.
The H6 communication becomes expensive starting at 32 nodes (2048 cores). At the same number of nodes, the pipelining scheme becomes beneficial by hiding communication partly behind computation.

Summary and Conclusions
In this paper, we have addressed the design, performance optimization, and parallel scalability of a semi-Lagrangian Vlasov-Poisson solver in six-dimensional phase space. The algorithm relies on dimensional splitting, i.e. consecutive interpolation along each of the dimensions. We have addressed two parallelization schemes, the remapping method and a 6d domain decomposition approach, the latter replacing the all-to-all-type of communication of the former with a peer-to-peer (next neighbor) communication pattern. The domain decomposition method turns out to be superior to the remap method with respect to memory, per-core and parallel performance. Parallel scalability is demonstrated in a number of numerical experiments for various setups, going up to 64k physical cores on a supercomputer. By design, the domain decomposition uses halo cells which in practice introduces restrictions on the time step. However, these are mitigated by using a onesided blocked communication scheme for the Vlasov-Poisson case and a rotating mesh that follows a background magnetic field. Using pipelined communication and computation the domain decomposition method allows to efficiently overlap the communication with useful computation.
So far, we have only addressed Lagrange interpolation which suits the distributed computations very well due to its locality. In the future, local splines or discontinuous Galerkin interpolation schemes will also be considered. Furthermore, extensions to Vlasov-Maxwell and more complex geometries are natural enhancements. Due to its high efficiency and parallel scalability, the new code enables simulations in six-dimensional phase space with grid sizes and resolutions large enough for computing relevant physics cases and paves the way to systematically comparing gyrokinetic to fully kinetic simulations.