A CUDA fast multipole method with highly efficient M2L far field evaluation

Solving an N-body problem, electrostatic or gravitational, is a crucial task and the main computational bottleneck in many scientific applications. Its direct solution is an ubiquitous showcase example for the compute power of graphics processing units (GPUs). However, the naïve pairwise summation has O ( N 2 ) computational complexity. The fast multipole method (FMM) can reduce runtime and complexity to O ( N ) for any specified precision. Here, we present a CUDA-accelerated, C++ FMM implementation for multi particle systems with r − 1 potential that are found, e.g. in biomolecular simulations. The algorithm involves several operators to exchange information in an octree data structure. We focus on the Multipole-to-Local (M2L) operator, as its runtime is limiting for the overall performance. We propose, implement and benchmark three different M2L parallelization approaches. Approach (1) utilizes Unified Memory to minimize programming and porting efforts. It achieves decent speedups for only little implementation work. Approach (2) employs CUDA Dynamic Parallelism to significantly improve performance for high approximation accuracies. The presorted list-based approach (3) fits periodic boundary conditions particularly well. It exploits FMM operator symmetries to minimize both memory access and the number of complex multiplications. The result is a compute-bound implementation, i.e. performance is limited by arithmetic operations rather than by memory accesses. The complete CUDA parallelized FMM is incorporated within the GROMACS molecular dynamics package as an alternative Coulomb solver.

The originally proposed FMM uses a spherical harmonics representation of the inverse distance r À1 between particles. For distant interactions (far field), which can be strictly defined, it uses multipole expansions built by clustered particle groups. The expansions are shifted and then transformed into Taylor moments by applying linear operators in a hierarchical manner to achieve linear scaling with respect to the number of particles. The complexity of the operators is Oðp 4 Þ, where p is the order of the multipole expansion. White and Head-Gordon (1996) and Greengard and Rokhlin (1997) proposed rotational operators, that align the transformation axis to reduce the operator complexity to Oðp 3 Þ. Cheng et al. (1999) used plain wave expansions to further reduce the complexity of the operators to Oðp 2 Þ, however a few Oðp 3 Þ translations are still required. The algorithm has been developed further to support oscillatory kernels e ikr =r (Ying et al., 2004). Fong and Darve (2009) parametrized the inverse distance function using Chebyshev polynomials and proposed a "black-box" FMM, which uses a minimal number of coefficients to represent the far field.
In atomistic molecular dynamics (MD) simulations, Newton's equations of motion are solved for a system of N particles (Allen and Tildesley, 1989) in a potential that accounts for all relevant interactions between the atoms. The integration time step is limited to a few femtoseconds such that the fastest atomic motions can be resolved. To reach the time scales many biomolecules operate on, millions of time steps need to be computed (Bock et al., 2013;Lane et al., 2013;Paul et al., 2017;Schwantes et al., 2014). This can easily require weeks or even months of compute time even on modern hardware (Kutzner et al., 2019). Hence, to speed up the calculation of an MD trajectory, the execution time for each individual time step has to be reduced. This can be achieved with better algorithms, with special-purpose hardware (Shaw et al., 2014), by introducing heterogeneous parallelization, e.g. harnessing SIMD, multi-core, and multi-node parallelism Hess et al., 2008;Páll et al., 2015) and by using GPUs (Páll and Hess, 2013;Salomon-Ferrer et al., 2013). Here, we utilize GPUs for our FMM implementation.
The electrostatic contribution to the inter-atomic forces is governed by Coulomb's law where r ij ¼ x i À x j is a vector distance between atoms carrying partial charges q i , q j at positions x i , x j , e 0 is the vacuum permittivity and jj Á jj 2 is the Euclidean distance. The calculation of nonbonded forces, i.e. Coulomb and van der Waals forces, is usually by far the most timeconsuming part of an MD step. The van der Waals forces decay very quickly with distance r, so calculating them up to a cutoff distance suffices. The Coulomb forces, however, decay only quadratically with r, and the use of a finite Coulomb cutoff can therefore lead to severe simulation artifacts (Patra et al., 2003;Schreiber and Steinhauser, 1992). Direct evaluation of the electrostatic interactions in a typical biomolecular simulation system becomes prohibitive for two reasons. First, the OðN 2 Þ scaling of a direct evaluation hinders its usage already at small system sizes, e.g. for N % 50; 000 particles. Second, the usually employed periodic boundary conditions (PBC) make such calculation even impossible. Biomolecular simulation, therefore, requires an efficient Coulomb solver that properly accounts for the full, long-range nature of the electrostatic interactions.
To this aim, several FMM implementations have been developed. A standard FMM was included as an electrostatic solver for the NAMD package (Board et al., 1992;Nelson et al., 1996). Ding et al. (1992a) proposed the Cell Multipole Method (CMM) to simulate polymer systems of up to 1.2 million atoms. In further work, they combined CMM with the Ewald method showing a considerable speedup with respect to a pure Ewald treatment (Ding et al., 1992b). Niedermeier and Tavan (1994) introduced a structure-adapted multipole method. Eichinger et al. (1997) combined the structure-adapted multipole method with a multiple-time-step algorithm. Andoh et al. (2013) developed MODYLAS, a FMM adoption for very large MD systems and benchmarked it on the K-computer using 65,536 nodes. Very recently, it was extended to support rectangular boxes . Yoshii et al. (2020) developed a FMM for MD systems with twodimensional periodic boundary conditions. Shamshirgar et al. (2019) implemented a regularization method for improved FMM energy conservation. Gnedin (2019) combined fast Fourier transforms (FFTs) and the FMM for improved performance.
Considering efficient parallelization approaches, Gumerov and Duraiswami (2008) pioneered the GPU implementations of the spherical harmonics FMM with rotational operators. Depending on accuracy they achieved speedups of 30-70 with respect to a single CPU. Different GPU parallelization schemes for the "black-box" FMM (Fong and Darve, 2009) were implemented by Takahashi et al. (2012). Yokota et al. (2009) parallelized ExaFmm on a GPU cluster with 32 GPUs achieving a parallel efficiency of 44% and 66% for 10 6 and 10 7 particles, respectively. Even more GPUs (256) were used by Lashuk et al. (2009) for a system of 256 million particles. Rotational based Multipole-to-Local operators were efficiently parallelized with GPUs by Garcia et al. (2016). Task-based parallelization approaches to the FMM were proposed in Blanchard et al. (2015) and Agullo et al. (2016). A review of fast multipole techniques for calculation of electrostatic interactions in MD systems can be found in Kurzak and Pettitt (2006). However, the early adoptions of FMMs in MD simulation codes were mostly superseded by particle Mesh Ewald (PME) (Essmann et al., 1995) due to its higher single-node performance. As a result, PME currently dominates the field. It is based on the FFT, which inherently provides the PBC solution. Nevertheless, PME suffers from a scaling bottleneck when parallelized over many nodes, as the underlying FFTs require all-to-all communication (Board et al., 1999;Kutzner et al., 2007Kutzner et al., , 2014. In addition, large systems with nonuniform particle distributions become memory intensive, since PME evaluates the forces on a uniform mesh across the whole computational domain. In the era of ever-increasing parallelism and exascale computers, it is time to revisit the FMM, which does not suffer from the above mentioned limitations. To this end, we implemented and benchmarked a single-node full CUDA parallel FMM. Our implementation has been tailored for MD simulations, i.e. it targets a millisecond order runtime for one MD step by careful GPU parallelization of all FMM stages and by optimizing their flow to hide possible latencies. It was also meticulously integrated into the GROMACS package to avoid additional FMM independent performance bottlenecks. Here, we present three different parallelization approaches. The implementation is based on the ScaFaCos FMM (Arnold et al., 2013), which utilizes spherical harmonics to describe the r À1 function. We use octree grouping to describe the interaction hierarchy. Such grouping is achieved by recursive subdivision of the cubic simulation box into eight equal subboxes. It has a major advantage: the far field operators can be precomputed for the whole simulation box, allowing for efficient parallelization. Additionally, the PBC computation becomes negligible as the PBC operators reduce to a single operator appliance. Moreover, a strict error control of the approximation (Dachsel, 2010) can be applied.
Here, we focus on the CUDA parallelization of the Multipole-to-Local (M2L) operator, which is most limiting to the overall FMM far field performance. An overview of the parallelized FMM, including all stages and complete runtimes, can be found in Kohnke et al. (2020b).

The fast multipole method
We consider a system of N ) 1 particles. Following Hockney and Eastwood (1988), the challenge is to most efficiently evaluate where x j and x k are positions of particles j and k, respectively and q k is the charge of the k-th particle. For a direct solution of Eq.
(2), interactions between all pairs of particles ðj; kÞ with j 6 ¼ k need to be computed. This leads to two nested loops and OðN 2 Þ calculation steps.

Mathematical foundations
Expansion of the inverse distance between arbitrary particles x j and x k , j 6 ¼ k yields are spherical harmonics and Y Ã their complex-conjugate, q and are polar and azimuthal angle, respectively, and P lm are the associated Legendre polynomials P lm ðyÞ :¼ ðÀ1Þ m ð1 À y 2 Þ m=2 d m dy m P l ðyÞ ð 5Þ where P l are ordinary Legendre polynomials The normalized associated Legendre polynomials form an orthonormal set of basis functions on the surface of a sphere. The j-th and k-th dependent parts of the right hand side of Eq. (3) are chargeless multipole moments and chargeless local moments respectively. The moments weighted with corresponding charges q j and q k , respectively, can be summed yielding charged multipole moments and charged local moments This allows to evaluate the potential at arbitrary particles at x j , j ¼ 0; :::; J À 1 due to a distant discrete charge distribution of K particles with positions x k , k ¼ 0; :::; K À 1 in terms of charged local moments and chargeless multipole moments with This calculation is referred to as far field. To achieve convergence in Eq. (3), has to be fulfilled for all distinct index pairs j and k. Application of addition theorems for regular and irregular solid harmonics (Tough and Stone, 1977) yields translation and transformation operators for the expansions. The moments ! lm of a multipole expansion about a common origin a of particles x j , j ¼ 0; :::; J À 1 can be translated to a new origin a 0 with where A ! is the Multipole-to-Multipole operator.
Further, the multipole expansion !ðaÞ can be transformed into a local expansion ðrÞ at r with jjrjj 2 > jjx j jj 2 , j ¼ 0; :::; J À 1 with where C ! is the Local-to-Local operator.

Algorithm
Applying the operators defined in the previous section requires truncation of Eq.
(3) to a finite multipole order p, which controls the accuracy of the solution approximation. Such expansions have a triangular shape with indexing shown in Figure 1. The truncation yields !; ; A; C 2 K pÂp :¼ ða lm Þ l¼0;:::;p; m¼Àl;:::;l j a lm 2 C 1,1 1,-1 2,1 2,2 2,-2 2,-1 Figure 1. Indexing of triangular shaped matrices. The used indexing scheme is based on standard matrix index notation: The first subscript is a row number, the second one is a column number, which can be negative. In case of ðl; mÞ notation, l ! jmj and l p.   ! are distributed to all boxes of the octree by translating them level-wise from d to d À 1, d ¼ D; :::; 1 with the A operator. Subsequently, during the Multipole-to-Local (M2L) stage, the multipole moments ! are transformed into local moments by applying the operator M. A detailed description follows in the next section. After M2L, in the Local-to-Local (L2L) stage, the local moments are shifted from the octree root to the leaves with the C operator. The interactions between particles occupying the same lowest level boxes and between neighboring boxes (near field) are evaluated directly (P2P), Eq. (2). Finally, the far field forces and potentials are evaluated at particle positions in the tree leaves. Additionally, the number of directly interacting boxes can be defined with a well separation criterion, which controls how many layers of adjacent boxes interact directly on the lowest tree depth. In the following we will only discuss the case with one well separated layer of boxes.

The Multipole-to-Local (M2L) transformation
We will now explain the M2L transformation and its execution hierarchy. Let D be a fixed depth of an octree and p a multipole order. The multipole or local expansions in box i at depth d ¼ 0; :::; D will be denoted by ! d i and d i , respectively. For each d i there exists an interaction set L d i , jL d i j ¼ 208. Figure 4 shows the interaction set L d i , which contains the indices of multipole expansions ! d j in all children boxes of the direct neighbors of d i 's parent box. The direct neighbors share at least one common vertex, edge or face with each other. A particular d i is calculated from all ! d j , j 2 L d i , omitting d i 's direct neighbor boxes in order to satisfy Eq. (12). This results in 189 Oðp 4 Þ M2L transformations. Let be the set of level-wise operator sets M d :¼ fM d j!i jM transforms j-th multipole moment to i-th local moment at level dg. All M2L operations performed in the FMM octree yield Figure 5 shows one Oðp 4 Þ M2L transformation. It contains Oðp 2 Þ dot products between an ! and a part of the corresponding operator M.

Implementation
We focus our GPU parallelization efforts on the M2L operator, as it is the most time-consuming FMM far field operator (see Figure 3 above and Figure 12 in Kohnke et al., 2020b). In PBC, it requires 189 transformations per box, whereas both M2M and L2L, which translate the moments between different tree levels, require only a single transformation per octree box (except for the root box). Since these transformations are of the same complexity, M2L involves 94:5Â the number of operations as M2M and L2L combined. The second most time-consuming part is the P2P near field computation, which will not be discussed in this paper, was optimized as laid out in Páll and Hess (2013). Proper choice of D and p allows to balance the near and far field contribution, which minimizes the overall runtime. In case of parallel implementation these stages can run concurrently.

CUDA implementation considerations
We will now briefly outline the CUDA programming model, see Nickolls et al. (2008) for details. A typical GPU consists of a few thousand cores that are grouped into larger units called multiprocessors. CUDA threads are organized in blocks. Threads within a block are grouped into subunits called warps, each consisting of 32 threads. For optimal performance, threads within the same warp should execute the same instruction, otherwise the execution is serialized. This type of parallelization is called Single Instruction Multiple Threads (SIMT). Once a block of threads is spawned, it occupies the multiprocessor until the respective computations are completed. Dynamic scheduling is performed warp-wise, thus thread blocks should consist of at least several warps to hide memory and arithmetic latencies within a multiprocessor. Blocks are organized in grids. Each block of a grid and thread of a block is identified with its unique 1D, 2D or 3D index. The dimensions of the grid and the blocks can be chosen independently. To identify threads, CUDA provides the 3D structures gridDim, blockDim, blockIdx and threadIdx. We will use the following abbreviations: B a :¼ blockDim.a, G a :¼ gridDim.a, Bid a :¼ blockIdx.a and tid a :¼ threadIdx.a, a 2 fx; y; zg. The hierarchy of threads described above affects the memory access and communication between threads. Whereas all threads can access global memory, this access should be minimized as it has a latency of a few hundred cycles. The memory within a block can be shared via shared memory. If no bank conflicts occur fetching shared memory is only slightly slower than register access (20-40 cycles). Synchronization of threads is possible only within a block. Since CUDA-6.0, threads within the same warp are able to share their content via the shuffle instruction by directly sharing their registers.

Sequential FMM and data structures
Our CUDA implementation is based on a Cþþ11 version of the sequential ScaFaCos FMM (Arnold et al., 2013). It provides class templates with a possibility to use diverse memory allocators. With CUDA Unified Memory (Knap and Czarnul, 2019) the usage of original data structures became feasible by harnessing the Cþþ memory allocators.
To allow for an efficient manipulation of triangular shaped data (see Figures 1 and 5), we have implemented a dedicated triangular_matrix class that stores the moments and operators. It provides the indexing logic and utilizes a 1D vector of complex values (std:: vector<complex>) for this purpose. For symmetry reasons, it suffices to store one half of the triangular matrix for the moments, as the entries on the left (m < l) and right side (m > l) are identical except for the signs. The signs are computed on the fly at negligible costs from the parity of indices. The overall size of the matrices depends on the multipole order p. Exploiting symmetry, ðp 2 þ pÞ=2 complex values are stored for the expansions and ðð2pÞ 2 þ 2pÞ=2 for the M operator.
Let D be a fixed depth of the tree. Thus, there are d ¼ 0; :::; D levels in an octree. For Multipole-to-Local operations, as described in The Multipole-to-Local (M2L) transformation section, an underlying tree implementation is needed. Listing 1 shows a very basic approach for traversing an octree of depth D. The function indexðx; y; z; dÞ applies the lexicographic approach to compute a unique 1D box index in the octree: where dimðdÞ :¼ 2 d is the number of boxes in each orthogonal direction and nbðdÞ :¼ Þ=7ß is the number of all boxes in an octree of depth d. The parent box index is easily obtained as indexðx=2; y=2; z=2; d À 1Þ.
Listing 2 shows the sequential form of the M2L transformation. The first four for-loops (lines 3-9) traverse the octree as shown in Listing 1. omega and mu store the pointers to triangular_matrix objects for the multipole and local moments, respectively. The next three for-loops determine all multipole expansions ! d j 2 L d i that are needed for the calculation of d i , i¼ 1; :::; 8 d . Figure 6 shows the complete 2D operator set M for the WELL separation criterion w ¼ 1. For each d ¼ 1; :::; D the set M d requires storing 343 pointers. A unique Listing 1. Loops for traversing an octree in 3D space (pseudocode).
Listing 2. The loops that start the M2L operators in the octree traverse the whole tree, compute the M2L interaction set and launch one M2L translation for each interaction in the computed set (pseudocode). mapping function opindexðx; y; zÞ returns a 1D index for each M d j 2 M d , j ¼ 0; :::; 343 À 1, d ¼ 1; :::; D with

Three CUDA parallelization approaches
The previous section described the basic sequential FMM. We will now present three different parallelization approaches. Approach (i) is conceptually straightforward, nevertheless it achieves decent speedups compared to a sequential CPU implementation with only minor parallelization work. It directly maps for-loops to CUDA threads, leaving the sequential program structure nearly unmodified. Approach (ii) performs well for high accuracy demands (high multipole orders pT12, double precision), however it scales poorly for smaller p. Approach (iii) minimizes the number of arithmetic operations by exploiting the symmetry of the M operator. It scales well in the broad range 0 p 20, however it requires additional data structures to minimize bookkeeping and to utilize symmetries. (1). The complete M2L operation in 3D requires 11 loops as shown in Listing 2. Listing 4 shows the comparison of the FMM loop structure and its naïve CUDA parallelization counterpart. Since CUDA provides a 3-component vector threadIdx to control the parallel execution of the threads, the main idea is to map the loops directly to the CUDA structures. To this end, we use a transformation between 1D and 3D indices. Any sequence of n indices i ¼ 0; ::; n À 1 can be transformed into n m-dimensional tuples of indices ðx 0 ; :::; x mÀ1 Þ with x j ¼ ði=m j Þmod m; j ¼ 0; :::; m À 1. As our FMM operates on cubic domains, the number of boxes is dim(d) in each orthogonal direction for depths d ¼ 1; ::; D. The loop over the M2L interaction set L is of a fixed size ð6 Â 6 Â 6Þ on each depth d. The M2L operation contains four for-loops of size p. The iteration over the tree levels is performed by  the CPU. Since the M2L operations are level independent, the kernels are spawned asymmetrically for each level of the octree enabling overlapped execution. The last forloop in Listing 3 is performed sequentially by each thread. It accumulates partial sums

Naïve parallelization approach
of the complete dot product This reduces the number of atomic writes by a factor of OðpÞ.
The naïve strategy allows a rapid FMM parallelization. Replacing the existing serial FMM loops with the corresponding CUDA index calculations leads to speedups that make the FMM algorithm applicable for moderate problem sizes. No additional data structures and code modification are required. However, the achieved bandwidth and parallelization efficiency is still far from optimal on the tested hardware.

CUDA dynamic parallelism approach (2).
A substantial performance issue of the naïve approach is integer calculation, which introduces a significant overhead even for large p. For d ¼ 1; :::; D, p 3 Â 216 Â 8 d threads are started, where each computes a valid pair of 3D source and target box indices to perform OðpÞ complex multiplications and additions lm ¼ lm þ M lþj;mþk ! jk . This leads to Oðp 3 Þ redundant source and target box index computations. A possible mitigation of the expensive index computations is Dynamic Parallelism (Jones, 2012). It allows to spawn kernels recursively, what simplifies hierarchical calculations. The dynamic approach exploits Dynamic Parallelism to avoid the expensive bookkeeping calculations of the naïve approach.
The determination of d i for i¼ 0; :::; 8 d and d ¼ 1; :::; D is done on the host as given in Listing 1. To this aim, the octree is traversed in 3D to precompute the coordinates ðx ; y ; z Þ of d i and the origin coordinates ðx L ; y L ; z L Þ of L d i . Together with 1D index i of d i , they are passed as arguments to a parent kernel spawned for each d i . Let P d J :¼ fj 0 ; :::; j 7 g be the set of indices of all boxes at depth d contained in the parent box of ! d j . Thus, for an arbitrary d i it holds: Listing 5 shows the parent kernel, that is engaged only in octree operations. To better utilize concurrency, it is started with B x ¼ B y ¼ B z ¼ 3 for 6 Â 6 Â 6 interaction sets L d i . The parent kernel consists of threads that can be uniquely identified with ðtid x ; tid y ; tid z Þ tuples. Each thread precomputes one 3D source positions ðx ! ; y ! ; z ! Þ of ! d j 0 , j 0 2 P d  Figure 7 illustrates the dynamic kernel. Each parent kernel spawns 26 child kernels with One child kernel computes eight Oðp 4 Þ M2L transformations between one target d i and all ! d j , j 2 P d J . Listing 6 shows child kernel computations, which can be divided in two parts. In the first part, ! d j , j 2 P d J and the operator M d j!i are determined. Since indices of ! and M are provided by the parent kernel, the ð2 Â 2 Â 2Þ grid facilitates a straightforward way to determine eight different ! d j , j 2 P d where j 0 is the index passed by the parent kernel. The operators M d j are obtained correspondingly, by replacing dimðdÞ with 7 and j 0 with the operator index passed by the parent kernel.
To decrease the number of global memory accesses, shared memory is used to cache ! and M. This is advantageous, since one M2L operation executes Oðp 4 Þ steps on Oðp 2 Þ data structures. The triangular shaped matrices are converted to 1D arrays in shared memory, allowing consecutive addressing in the for-loops performing the reduction step. The shared memory storage index s i of each moment ! lm 2 ! d j is calculated with s i ¼ l 2 þ l þ m, where l :¼ tid y and m :¼ tid x . A similar approach holds for the operator M d j , however, since M 2 Oð2p 2 Þ, threads need to be reused to write the elements into shared memory .
In our implementation, the direct neighbor operator is given the size p ¼ 0. This allows to skip the remaining nearest neighbor interactions of d i for P d J Ö L d i by checking for the condition p ¼ 0.
In the second part, the j_k_reduction() function computes the M2L operation. To mitigate the waste of Listing 5. Parent kernel in the dynamic approach. It determinates valid ! coordinates and spawns child kernels performing M2L computations (pseudocode). threads due to the triangular shape of the , ! and M and to minimize the number of atomic global memory writes, each thread executes the two innermost loops, Eq. (22), sequentially. However, a straightforward approach leads to warp divergence, since threads that correspond to m > l indices of the target moments need to be skipped. Splitting the innermost loop, such that it is partly performed by threads m > l circumvents this issue. Figure 8 and Listing 7 show a possible splitting scheme of the M2L operation. It uses ðp þ 1Þ Ã ðp þ 2Þ threads, where some unique thread pairs are mapped to the same target index tuple ðl; mÞ of a target element lm 2 d i . These compute a distinct part of the reduction. The described dynamic approach allows for further optimization, as it splits the computation in two independent parts. The parent kernels handle the octree position evaluation, whereas the child kernels implement the M2L computation. The efficiency of the this approach is satisfactory for high multipole orders. The necessity of an efficient parallelization also for small p leads to the next approach.

Presorted list-based approach with symmetric operators (3).
In this approach, the FMM interaction pattern is precomputed for higher efficiency. Additionally, operator symmetries are exploited to reduce both the number of complex multiplications as well as global memory access.

Octree interactions precomputation.
The pattern of interactions between the octree boxes is static, hence it can be precomputed and stored. This step does not need to be performance-optimal, as it is done only once at the start of a simulation that typically spans millions of time steps. In a PBC octree configuration, for each ! d i , d ¼ 1; :::; D, i¼ 0; :::; 8 d À 1 there exists an interaction set R d i , jR d i j ¼ 208. It consists of all the indices j of local moments d j that a multipole ! d i is contributing to. Note that the index sets R d i and L d i (defined in section The Multipole- Listing 6. Child kernel in the dynamic approach (pseudocode).
to-Local (M2L) transformation) are identical. For higher efficiency, the sets R d i are precalculated and stored as listsR Within the omega class, each ! i stores 189 pointers to its targets j k and the corresponding pointers to M2L operators. We make sure that the internal list orders preserve the validity of Eq. (23), so that it suffices to store direct pointers to the target moments and operators instead of their indices. We will use the index list notationR Oðp 2 Þ CUDA blocks are spawned to handle Oðp 4 Þ interactions. The remaining Oðp 2 Þ operations are executed sequentially by each thread. B x ¼ 8 dÀ1 is the number of boxes on octree level d À 1. This value fits CUDA architecture requirements for the blocksize particularly well, as it is always an even multiple of warpsize. For d > 4, B x exceeds the largest allowed blocksize of 1024, so we replicate kernel launches for consecutive ! in strides of size 1024.  Let G s , s ¼ 0; :::; 7 denote the eight possible groups governed by the position of ! within the parent box. The 8 d pointers to ! are reshuffled such that eight consecutive sequences of 8 dÀ1 pointers in memory belong to the same group G s . Rigorously, for ! d i where d ¼ 1; :::; D and i¼ 0; :::; 8 d it holds ! d i k 2 G s ; k ¼ s8 dÀ1 ; :::; ðs þ 1Þ 8 dÀ1 À 1; s ¼ 0; :::; 7. With reshuffled ! pointers the CUDA parallelization proceeds as shown in Figure 10. One kernel is started for each G s , s ¼ 0; :::; 7. Each thread tid x within a block evaluates one dot product (Eq. (22)). The pointers to the targets d j k and to M d j k are accessed with precomputed and presorted listsR d i andM d i without any additional integer operation, hence Bid z j k . The particular moments ð lm Þ d j k , j k 2 G s are evaluated in a parallel CUDA block with l ¼ Bid x and m ¼ Bid y . As these are block variables, skipping of the m > l part does not lead to warp divergence . Additionally, only the relevant part of the triangular operator matrix ( Figure 5) needs to be loaded into shared memory to be accessed by all threads tid x within the block.
A further improvement of the kernel is gained by rearranging the moments in memory. Threads tid x of an l; m block access the same moments ! lm of consecutive ! i , with i ¼ tid x . For warpwise coalesced memory access, the arrangement of the moments in memory is switched from Array of Structures (AoS) to Structure of Arrays (SoA). The moments ð! lm Þ d i , i¼ 0; :::; 8 d À 1 are stored in SoA triangular matrices such that for fixed l; m, the i indexed elements are contiguous in memory.  emerges directly from their definition (Eqs.5-6). It allows to reduce the size of the operator set M d , as shown in Figure 11. In 3D, the complete operator set spans a cube with the operators originating from its center to all 7 3 À 3 3 subcubes. The reduced operatorM d contains 56 M2L operators ! i ! j x (x ¼ 0; :::; 55) of one of the octants. Let the octant of the cube with parameters q; 2 ½0; 1 2 p in spherical coordinates be the reference octant. The generation of particular operator moments with symmetrical functions

Operator
where e im ¼ cosðmÞ þ isinðmÞ ð 26Þ yields three operator symmetry groups containing orthogonal operators that differ only by their sign. Figure 12 shows the symmetry groups inM d .
To make efficient usage of the operator symmetry, the listsM d i for each ! i are again reordered such that The corresponding listsR i need to be resorted as well, to preserve Eq. (23). A bitset B is added to the M operator class to store signs of its elements. As these are complex values, it takes two bits to store the signs. The bitset is indexed in an array like manner with the most significant bit as the zero-th element. With u ¼ 2ðl 2 þ lÞ þ 2m, BðuÞ and Bðu þ 1Þ represents the sign of the real and complex part of M lm , respectively. Since bitsets are precomputed during the operator initialization phase, they do not introduce any performance degradation whereas their additional memory footprint is negligible. Figure 13 shows an example of an M2L computation with bitsets. A single operator access from global memory computes 1, 2, 4, or 8 target moments depending on the operator type T as given in Eq. (27). The target moments t g , with g ¼ 0; :::; b, b ¼ 1; 2; 4; 8 for any source ! are z x y Figure 12. Grouping of the M2L operators according to their symmetry properties. The reduced operator set as shown in black in the upper left panel (a 2D version is shown in the right panel of Figure 11) is sorted into three groups (red, blue, green) depending on whether the operator is aligned with an axis (red), or within one of the xy; xz; or yz planes (blue), or none of that (green). Each of the red operators (in x, y, and z direction) has one symmetrical counterpart (in Àx, Ày, and Àz direction, respectively). Each of the blue operators has four symmetrical counterparts each (one in each quadrant of the plane). Each of the remaining operators has eight symmetrical counterparts each (one in each octant of the cube).
where M lþj;mþk are the elements of the reference operator M t 0 . The split products change their signs for ð lm;jk Þ t g , The sign function changes the sign of x by shifting the u-th bit of the bitsetB to the left most bit position and by evaluating the x ÈB shif ted subsequently. This creates no warp divergence since the sign change is a result of the arithmetic and logical operations.
The constant size of the lists, see Eq. (28), allows to implement the M2L kernel for the symmetry groups T a x , x ¼ 1; 2; 3; 4 as a function template, resulting in a single kernel that efficiently treats different groups T a x . Listing 8 shows the kernel configuration for different symmetry groups. For different operator groups G s , s ¼ 0; :::; 7, the kernels are replicated. The computation of the index i of ! i is straightforward as pointers to ! i are contiguous for any G s . For each symmetry group T a x , x ¼ 1; 2; 3; 4 one kernel with distinct template parameters is started, where the first parameter describes the cumulative offset of a particular T a x inM d i and the second one is the number of symmetrical operators within the current T a x . The size of a x , x ¼ 1; 2; 3; 4 is set to G z . The kernels are launched for all configurations T a x Â G s asymmetrically to utilize concurrency. Listing 9 shows the implementation of the symmetric kernel. At the beginning, the reference operator M t 0 and the bitsets of all orthogonal operators B t g are loaded into shared memory . Depending on the group T a x , x ¼ 1; 2; 3; 4 different number of bitsets is loaded. The if statement, that tests the value of the template Listing 8. Configuration and launches of the symmetric M2L kernel (pseudocode).
parameter group_type, is resolved at compile time. The second part of Listing 9 shows the split complex multiplication implementation. The double nested for-loop computes 1; 2; 4 or 8 ð lm Þ t g depending on the symmetry group T a x . The if statement within the innermost loop is resolved at compile time as well.

Benchmarks and discussion
We will now benchmark the performance and analyze the scaling behavior of the three different parallelization approaches described above. Figure 14 sketches the FMM scaling behavior with respect to the number of particles N, which is OðN Þ when the tree depth D is chosen properly. However, locally the FMM scales like Oðn 2 Þ, with n being the average number of particles in the boxes at the lowest level D. For a fixed multipole order p, at constant depth, a fixed number of Oðp 4 Þ far field operations are performed. In the regime of small N, the Oðp 4 Þ far field part completely dominates the FMM runtime, which is therefore essentially independent of N. At some critical N, the scaling curve switches to a quadratic behavior, because the P2P computations start to dominate the overall runtime. To benefit from the optimal linear scaling for growing N, the depth needs to be chosen properly. Varying p affects the slope of the overall linear scaling.

Benchmarking procedure
All performance tests were executed on a workstation with an Intel Xeon CPU E5-1620@3.60 GHz with 16 GB physical memory and a Pascal NVIDIA GeForce GTX 1080 Ti with 3584 CUDA Cores. This GPU has a theoretical single precision peak performance of 11.6 TFLOPS and maximal bandwidth of 484 GB/s. The device code was compiled with NVCC 9.1. All kernel timings were measured with the help of cudaEvent s and represent the average runtime of 100 runs.
In our performance comparisons we focus on D ¼ 3, as it provides sufficient parallelism to get proper performance metrics, which are also valid for D ¼ 4. For higher depths, the computation requires more kernel spawns due to limitations of blocksize, which leads to performance decrease. On the tested GTX 1080 Ti GPU a depth of D ¼ 3 is suitable for particle counts of 4 Â 10 4 -3 Â 10 5 , whereas higher N requires D ¼ 4 and D ¼ 5 for optimum performance. The current implementation of the symmetric parallelization approach allows for a maximum depth of D ¼ 5 at which up to N % 1:2 Â 10 6 particles can be handled efficiently. The limitation is caused by memory optimization, in which redundant pointers are stored to minimize the costs of scattered global memory writes. It can be switched off allowing for D ¼ 6 and system sizes up to N % 10 8 . On the tested Pascal GPU this optimization increases performance by about 10%, while on a Turing GPU the effect of the optimization is negligible (Kohnke et al., 2020a).

Microbenchmarking
To evaluate the different parallelization approaches in context of the underlying hardware, we estimated the GPU performance bounds for the M2L transformation operation. To this aim, we implemented two benchmarking microkernels, which execute exactly the number of arithmetic operations and memory accesses as the M2L operation does. However, additional possible performance bottlenecks (Nickolls et al., 2008) Figure 14. Qualitative sketch of the FMM scaling behavior. The optimal linear scaling (black dashes) with particle number N is achieved if and only if the tree depth D (as indicated by the colored numbers) is properly chosen for each N. For a constant D, for small N, FMM run time is dominated by the far field computations, whereas for growing N, ultimately OðN 2 Þ scaling results (red dashes). noncoalesced memory accesses, shared memory bank conflicts and atomic writes are eliminated. The microkernels were then used to determine the effective runtime bounds for our three different parallelization approaches. Figure 15 shows the absolute runtimes of the microkernels. To get the maximal theoretical throughput of the M2L kernel, we assumed the execution of three global memory accesses, eight bytes each, to perform one complex multiplication and one global addition, i.e. Oðp 4 Þ memory accesses for Oðp 4 Þ arithmetic operations. This results in a clearly memory-bound kernel with Oðp 4 Þ scaling in the examined range of p. For the lower bound we assumed an idealized scenario: for each box in the octree, Oðp 2 Þ memory accesses are performed for the moments and operators, whereas the data needed for Oðp 4 Þ operations is assumed to be available in registers. The full Oðp 4 Þ memory access approach utilizes nearly the full bandwidth of the GPU, achieving 370 Gb/s. However, the computation performance is only about 89 GFLOPS, which is % 0:8% of the GPU's peak performance. The second, Oðp 2 Þ memory access approach, varies depending on p. For p < 10 we observe subquadratic scaling of the execution time, indicating that the Oðp 4 Þ arithmetic operations are fully hidden. For p > 10 the curve shifts to the Oðp 4 Þ regime, achieving up to 8 TFLOPS, i.e. 70% of the GPU's peak performance. Compute and memory utilization are balanced at p ¼ 10, which is where the curve switches from Oðp 2 Þ to the Oðp 4 Þ slope.

Performance comparison
We will now discuss the efficiency of the three proposed parallelization approaches. Figure 16 shows the absolute executions times of the different kernels for D ¼ 3 and D ¼ 4. In the whole p range, the naïve kernel's theoretical arithmetic intensity (see Roofline model, Williams et al., 2009), is a lot smaller than the ratio R ¼ 23:95 FLOPS/byte obtained from the GPU's FLOP rate (11.6 TFLOPS) divided by its memory transfer rate (484 GB/s). This indicates that the kernel is bandwidth limited. However, additional integer computation is required for the calculation of 3D octree indices of the interaction sets L. Figure 17a shows that for the naïve kernel, much less than 10% of the issued instructions are useful floating point operations. A large computational overhead emerges from a high number of integer operations in the innermost for-loop. Here, each of OðpÞ complex multiplications requires 31 integer additions, 16 integer multiplications, 9 modulo operations and 11 integer divisions. In addition, performance is significantly reduced by warp divergence, since different threads in a block resolve the condition m > l (line 36 of Listing 4). This effect is labeled as no operation in Figure 17a. Avoiding warp divergence would require different kernels for each 0 p 20, since mapping of indices to threads differs at each p. Figure 17b shows, how well the GPU is utilized by the naïve kernel. As both memory and compute achieve roughly 50% of maximal possible utilization, the performance is likely limited by the latency of arithmetic or memory operations. The maximum number of achieved FLOPS (p ¼19) is at 2% of the GPU capability, see Figure 18 (top). The effective bandwidth reaches nearly 500 GB/s, which is more than the maximum memory throughput of the GPU. As seen in Figure 16, the naïve kernel achieves runtimes similar to the memory-bound microkernel. For values p > 6, the kernel is slightly faster than the memory-bound reference kernel, and that is for the following reason. The fact that the innermost for-loop of the naïve kernel is executed sequentially allows cache reuse. Each element ! lm can be reused 189 times for a different M2L transformation and each element of the operator M is reused 8 d times at tree depth d. This leads to local cache throughput of roughly 3,500 GB/s, which approaches the maximal theoretically possible cache bandwidth of the GPU.

Naïve kernel.
Additionally, the achieved occupancy per each Streaming Multiprocessor (SM) is at 46% of a possible maximum of 50% at this kernel configuration, a limit caused by the number of registers (64) used in the kernel.

Dynamic kernel.
This kernel utilizes Dynamic Parallelism to minimize the index overhead computation introduced by the naïve kernel. The sizes of the child kernels ð2; 2; 2Þ allow for utilization of concurrency on the GPU, since the work in the child kernels is fully independent. On the underlying hardware the maximum number of resident grids per device is limited to 32. Figure 19a shows the relative costs emerging from launching the child kernels, which become irrelevant only for large p. At small p, the latencies dominate the computation time, leading to an almost constant runtime for p 10 that can be seen in Figure 16 for the dynamic kernel. Figure 19b shows the instruction distribution for the dynamic kernel. From p % 3 on, the fraction of floating point operations is significantly larger than for the naïve kernel. However, the large number of integer operations and warp divergence still limits the performance. Another issue is the small block size of the child kernels, which limits the SM occupancy for different p values. For p < 5, e.g. each block consists of only one warp . This limits the SM utilization, as 32 blocks but 64 warps can be executed simultaneously. For p ! 5, the occupancy is only limited by the register usage, achieving nearly 100% of the theoretical possible occupancy. Limiting the register usage, however, increases the local memory traffic and does not further enhance the performance. As shared memory usage is an essential part of the dynamic kernel, we tested how its utilization affects the overall performance. Figure 20a and Figure 20b show the shared memory throughput and the GPU utilization of the dynamic kernel, respectively. For p < 12 the kernel is clearly compute-bound, hence the shared memory operations are fully hidden. At p ¼ 12-15 we can observe a balance between memory and compute operations. Shared memory is limiting only for p > 15, however the achieved throughput of 6000 GB/s is at the limit of the underlying GPU.
The overall performance of the kernel gets considerably higher compared to the naïve kernel for p > 6, achieving a maximum of 1600 GFLOPS for p ¼ 20, see Figure 18 (top). Nevertheless, memory and FLOPS peaks are achieved only for p > 15. At p < 5, the dynamic kernel performs worse than the Oðp 4 Þ benchmarking microkernel mainly due to kernel launch latencies mentioned above.

Symmetric kernel.
The symmetric M2L kernel is composed of four subkernels started asynchronously for each symmetry group T a x , x ¼ 1; 2; 3; 4 (see Eq. (27)). Figure 21 shows the achieved speedup compared to the standard implementation. Additionally, it also shows the speedups of each symmetric part T a x . As expected, the eight-way symmetric kernel performs best, followed by the four-way and then the two-way symmetric kernel. The overall speedup of the symmetric kernel combines the speedups of the subkernels. However, the achieved overall speedup is not directly proportional to the number of the symmetrical counterparts, because the additional bit shifting and sign changing operations introduce a growing overhead for larger symmetry. In addition, register utilization is larger for the kernels with higher symmetry, harming the achieved SM occupancy. From here on, we will combine the metrics for all subkernels, referring to them as one symmetric kernel. These metrics take into account the overlapping execution of the symmetric subkernels. Figure 22a shows how the hardware is utilized by the symmetric kernel. This kernel is compute-bound over nearly the complete p range. As seen in Figure 22b, the fraction of floating point operations is significantly larger than in both previous approaches. Warp divergence is eliminated completely. The no operation part, resulting from pipeline latencies becomes negligible for p > 6. In this range, the gridsizes (ðp þ 1Þ Ã ðp þ 2Þ=2, 7, 1) and (ðp þ 1Þ Ã ðp þ 2Þ=2, 21, 1) of particular symmetric subkernels are too small to provide enough blocks to utilize the complete device, so that pipeline latencies become an issue. However, with larger p hardware utilization increases. The register usage of the symmetric subkernels varies between 48% and 71% (not shown). Based on the kernel configuration, the theoretical maximum possible average SM occupancy for all subkernels is 57%. The kernels achieve 50% and are mainly limited by register usage. Further occupancy optimization is unlikely to further increase performance markedly, as kernels with a large register usage do not require optimal occupancy (Volkov, 2010).
As Figure 18 (top) shows, the FLOP rate of the symmetric kernel is much higher in the range 1 < p < 16 compared to the other two kernels. However, the FLOP rates achieved by the symmetric and dynamical kernel for p ! 15 are similar. Nevertheless, the symmetric scheme clearly outperforms the dynamic one when comparing the absolute execution times, see Figure 16. Figure 18 (bottom) demonstrates that, compared to the non-symmetric kernel, the symmetric kernel needs fewer floating point operations for the M2L stage. Figure 23 shows the absolute performance ratio of the symmetric kernel compared to the compute-bound reference microkernel. It achieves roughly 30% of the reference microkernel performance in 2 < p < 21. Additionally, for both depths, the kernel shows an ideal scaling as the showed ratio remains nearly constant for p > 5. Hence, the scaling of the symmetric kernel follows the compute-bound reference microkernel scaling. It achieves a subquadratic scaling for p < 10 and switches slowly to Oðp 4 Þ regime, that is limited only by arithmetic throughput, compare Figure 15 and Figure 16.

Conclusions
Here, we have presented three different CUDA parallelization approaches for the Multipole-to-Local operator, which is performance limiting for the overall FMM performance. The first approach preserves the sequential loop structure and does not require any special data structures. It makes use of CUDA Unified Memory to achieve decent speedups compared to a sequential CPU implementation. It is useful, e.g. for rapid prototyping or for simulation systems with small to moderate numbers of particles. However, it comes with a large computational overhead due to additional integer operations.
The second approach, which exploits CUDA Dynamic Parallelism, avoids this drawback and achieves very good performance at high accuracy demands, i.e. for large multipole orders. Its main drawback is a lack of performance at low multipole orders, for which the first scheme performs better.
The third approach uses abstractions of the underlying octree and interaction patterns to allow for enhanced, efficient GPU utilization and it exploits the symmetries of the Multipole-to-Local operator. As a result, it scales perfectly with growing multipole order p maintaining a very good performance in the whole benchmarked multipole range (0 < p < 20) Our FMM implementation has been optimized for biomolecular simulations and has been incorporated into GROMACS as an alternative to the established PME Coulomb solver (Kohnke et al., 2020a). We anticipate that, thanks to the inherently parallel structure of the FMM, future multi-node multi-GPU implementations will eventually overcome the PME scaling bottlenecks (Board et al., 1999;Kutzner et al., 2007Kutzner et al., , 2014.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was supported by the DFG priority programme Software for Exascale Computing (SPP 1648). A special thanks goes to Jiri Kraus from NVIDIA who supported this project in the early stage of its development and to R. Thomas Ullmann who took part in writing the FMM-GROMACS interface and unit tests.