General framework for re-assuring numerical reliability in parallel Krylov solvers: A case of bi-conjugate gradient stabilized methods

Parallel implementations of Krylov subspace methods often help to accelerate the procedure of finding an approximate solution of a linear system. However, such parallelization coupled with asynchronous and out-of-order execution often makes more visible the non-associativity impact in floating-point operations. These problems are even amplified when communication-hiding pipelined algorithms are used to improve the parallelization of Krylov subspace methods. Introducing reproducibility in the implementations avoids these problems by getting more robust and correct solutions. This paper proposes a general framework for deriving reproducible and accurate variants of Krylov subspace methods. The proposed algorithmic strategies are reinforced by programmability suggestions to assure deterministic and accurate executions. The framework is illustrated on the preconditioned BiCGStab method and its pipelined modification, which in fact is a distinctive method from the Krylov subspace family, for the solution of non-symmetric linear systems with message-passing. Finally, we verify the numerical behavior of the two reproducible variants of BiCGStab on a set of matrices from the SuiteSparse Matrix Collection and a 3D Poisson’s equation.


Introduction
Solving large and sparse linear systems of equations appears in many scientific applications spanning from circuit and device simulation, quantum physics, large-scale eigenvalue computations, and up to all sorts of applications that include the discretization of partial differential equations (PDEs) as described by Barrett and et al. (1994).In this case, Krylov subspace methods fulfill the roles of standard linear algebra solvers (Saad, 2003).The Conjugate Gradient (CG) method can be considered as a pioneer of such iterative solvers operating on symmetric and positive definite (SPD) systems.Other Krylov subspace methods have been proposed to find the solution of more general classes of nonsymmetric and indefinite linear systems.These include the Generalized Minimal Residual method (GMRES) by Saad and Schultz (1986), the Bi-Conjugate Gradient (BiCG) method by Fletcher (1976), the Conjugate Gradient Squared (CGS) method by Sonneveld (1989), and the widely used BiCG stabilized (BiCGStab) method by Van der Vorst (1992) as a smoother converging version of the above two.Moreover, preconditioning is usually incorporated in real implementations of these methods in order to accelerate the convergence of the methods and improve their numerical features.
One would expect that the results of the sequential and parallel implementations of Krylov subspace methods to be identical, for instance, in the number of iterations, the intermediate and final residuals, as well as the sought-after solution vector.However, in practice, this is not often the case due to different reduction treesthe Message Passing Interface (MPI) libraries offer up to 14 different implementations for reduction, data alignment, instructions used, etc.Each of these factors impacts the order of floatingpoint operations, which are commutative but not associative, and, therefore, violates reproducibility.We aim to ensure identical and accurate outputs of computations, including the residuals/errors, as in our view this is a way to ensure robustness and correctness of iterative methods.In this case, the robustness and correctness have a threefold goal: reproducibility 1 of the results with the accuracy guarantee as well as sustainable (energy-efficient) algorithmic solutions.
The implementation of Krylov subspace methods on massively parallel systems reveals their scalability problems.Mainly, because the synchronization of global communications, especially the reductions, delays parallel executions.The most common solution has been the developments of communication-avoiding and communication-hiding methods and, also, the use of new MPI functions to hide the communications, overlapping their execution with the computation of iterative methods.In Cools and Vanroose (2017), the authors propose a general framework for deriving pipelined Krylov subspace methods, in which the recurrences are reformulated to make the parallelization easier.Again, these changes impact on the robustness and correctness of the iterative methods.
In general, Krylov subsbpace methods are built from three components: sparse-matrix vector multiplication Ax (SPMV), DOT product between two vectors (x, y), and scaling a vector by a scalar with the following addition of two vectors ydαx + y (AXPY).If a block data distribution is used, only AXPY is performed locally, while SPMV needs to get some elements from the other processes, using point-topoint or the MPI_Alltoallw() collective MPI operations, before completing the computation, and DOT products requires communication and computation, for example, via the MPI_Allreduce() collective, among MPI processes.Although SpMV has the highest amount of floating-point operations (flops), at large scale DOT products become the most time-consuming component of Krylov subspace methods due to the required global communication.This justifies the use of pipelined versions of Krylov subspace methods.
In this paper, we aim to re-ensure reproducibility of Krylov subspace methods in parallel environments.Our contributions are the following: • We propose a general framework for deriving reproducible Krylov subspace methods.We follow the bottomup approach and ensure reproducibility of Krylov subspace methods via reproducibility of their components, including the global communication.We build our reproducible solutions on the ExBLAS (Collange and et al., 2015) approach and its lighter version.• Even when applying our reproducible solutions, we particularly stress the importance of arranging computations carefully to be executed deterministically, for example, avoid possibly replacements by compilers of a*b + c in the favor of fused multiply-add (fma) operation or postponing divisions in case of data initialization (i.e.divide before use).For instance, we provide customized AXPY(-like) operations using fma, which reduces round-offs to one or two per AXPY(-like) operation.Furthermore, we refer to the 30-year-old but still up-to-date guide 'What every computer scientist should know about floating-point arithmetic' by Goldberg (1991).• We optimize the SPMV implementation by reducing the number of elements received in each process, changing the use of MPI_Allgatherv() to the combination of MPI_Alltoallw().In the reproducible versions of dot product, we rely upon only one collective operation, namely MPI_Allreduce(), instead of MPI_Reduce() plus MPI_Bcast() using ExBLAS data.• We verify the applicability and performance of the proposed methodology on the preconditioned BiCGStab (PBiCGStab) and the pipelined PBiCG-Stab method.We derive two reproducible variants of each method and test them on a set of large Sui-teSparse matrices and a 3D Poisson's equation.
This journal article extends our previous conference paper (Iakymchuk et al., 2022).In particular, we include the pipelined preconditioned BiCGStab as another test case, optimize SpMV and reduce the number of global collectives in reproducible versions, as well as validate the implementations on larger matrices.Other information is also added to make the paper self-contained.
This paper is structured as follows.Section 2 reviews several aspects of computer arithmetic as well as the Ex-BLAS approach.Section 3 proposes a general framework for constructing reproducible Krylov subspace methods.Section 4 introduces the PBiCGStab and the pipelined PBiCGStab methods, describing their MPI implementation in detail.Later, we evaluate the two reproducible implementations of PBiCGStab and pipelined PBiCGStab in Section 5. Finally, Section 6 reviews related work, while Section 7 draws conclusions and outlines future directions.

Background
At first, we briefly introduce the floating-point arithmetic that consists in approximating real numbers by numbers that have a finite, fixed-precision representation.These are composed of a significand, an exponent, and a sign: where b is the basis (2 in our case), M is the precision, and e stands for the exponent that is bounded (e min ≤ e ≤ e max ).
The IEEE 754 standard (IEEE Computer Society (2008)), created in 1985 and then revised in 2008 and in 2019, has led to a considerable enhancement in the reliability of numerical computations by rigorously specifying the properties of floating-point arithmetic.This standard is now adopted by most processors, thus leading to a much better portability of numerical applications.The standard specifies floating-point formats, which are often associated with precisions like binary16, binary32, and binary64, see Table 1.Floating-point representation allows numbers to cover a wide dynamic range that is defined as the absolute ratio between the number with the largest magnitude and the number with the smallest non-zero magnitude in a set.For instance, binary64 (double-precision) can represent positive numbers from 4.9 × 10 À324 to 1.8 × 10 308 , so it covers a dynamic range of 3.7 × 10 631 .
The IEEE 754 standard requires correctly rounded results for the basic arithmetic operations (þ, À , × , =, √, fma).It means that they are performed as if the result was first computed with an infinite precision and then rounded to the floating-point format.The correct rounding criterion guarantees a unique, well-defined answer, ensuring bit-wise reproducibility for a single operation; but correct rounding alone is not necessary to achieve reproducibility.Emerging attention to reproducibility strives to draw more careful attention to the problem by the computer arithmetic community.It has led to the inclusion of error-free transformations (EFTs) for addition and multiplicationto return the exact outcome as the result and the errorto assure numerical reproducibility of floating-point operations, into the revised version of the 754 standard in 2019.These mechanisms, once implemented in hardware, will simplify our reproducible algorithmslike the ones used in the ExBLAS by Iakymchuk et al. (2015), ReproBLAS by Demmel and Nguyen (2015), OzBLAS by Mukunoki et al. (2019) librariesand boost their performance.
There are two approaches that enable the addition of floating-point numbers without incurring round-off errors or with reducing their impact.The main idea is to keep track of both the result and the error during the course of computations.The first approach uses EFT to compute both the result and the rounding error, storing them in a floatingpoint expansion (FPE).This is an unevaluated sum of p floating-point numbers, whose components are ordered in magnitude with minimal overlap to cover the whole range of exponents.Typically, FPE relies upon the use of the traditional EFT for addition that is twosum (Knuth, 1969) and for multiplication that is twoprod (Ogita et al., 2005).The code of these two operations are, respectively, shown in Algorithm 1 and Algorithm 2. The second approach projects the finite range of exponents of floating-point numbers into a long vector so called a long (fixed-point) accumulator and stores every bit there.For instance, Kulisch and Snyder (2011) proposed to use a 4288-bit long accumulator for the exact DOT product of two vectors composed of binary64 numbers; such a large long accumulator is designed to cover all the severe cases without overflows in its highest digit.
Algorithm 1: Error-free transformation for the summation of two floating-point numbers.
Algorithm 2: Error-free transformation for the product of two floating-point numbers.
The ExBLAS project (Iakymchuk et al., 2015) is an attempt to derive a fast, accurate, and reproducible BLAS library by constructing a multi-level approach for these operations that are tailored for various modern architectures with their complex multi-level memory structures.On one side, this approach is aimed to be fast to ensure similar performance compared to the non-deterministic parallel versions.On the other side, the approach is aimed to preserve every bit of information before the final rounding to the desired format to assure correctrounding and, therefore, reproducibility.Hence, ExBLAS combines together long accumulator and FPE into algorithmic solutions as well as efficiently tunes and implements them on various architectures, including conventional CPUs, Nvidia and AMD GPUs, and Intel Xeon Phi co-processors (for details we refer to Collange et al., 2015).Thus, ExBLAS assures reproducibility through assuring correct-rounding.
The corner stone of ExBLAS is the reproducible parallel reduction, which is at the core of many BLAS routines.The ExBLAS parallel reduction relies upon FPEs with the twosum EFT and long accumulators, so it is correctly rounded and reproducible.In practice, the latter is invoked only once per overall summation that results in the little overhead (less than 8%) on accumulating large vectors.Our interest in this paper is the DOT product of two vectors, which is a crucial fundamental BLAS operation.The EXDOT algorithm is based on the reproducible parallel reduction and the twoprod EFT: the algorithm accumulates the result and the error of twoprod EFT to same FPEs and then follows the reduction scheme.We derive its distributed version with two FPEs underneath (one for the result and the other for the error) that are merged at the end of computations.These and the other routinessuch as matrix-vector product, triangular solve and matrix-matrix multiplicationare distributed in the ExBLAS library 2 .

General framework for reproducible Krylov solvers
This section provides the outline of a general framework for deriving a reproducible version of any traditional Krylov subspace method.The framework is based on two main concepts: 1) identifying the issues caused by parallelization and, hence, the non-associativity of floating-point computations and 2) carefully mitigating these issues primarily with the help of computer arithmetic techniques as well as programming guidelines.The framework was implicitly used for the derivation of the reproducible variants of the Preconditioned Conjugate Gradient (PCG) method in Iakymchuk et al. (2020aIakymchuk et al. ( , 2020b)).
The framework considers the parallel platform to consist of K processes (or MPI ranks), denoted as P 1 , P 2 , …, P K .In this framework, the coefficient matrix A is partitioned into K blocks of rows (A 1 , A 2 , …, A k ), where each P k stores one row-block with the k-th distribution block A k 2 R p k ×n , and n ¼ P K k¼1 p k .Additionally, vectors are partitioned and distributed in the same way as A. For example, the residual vector r is partitioned as r 1 , r 2 , …, r K and r k is stored in P k .Besides, scalars are replicated on all K processes.

Identifying sources of non-reproducibility
The first step is to identify sources of non-associativity and, thus, non-reproducibility of the Krylov subspace methods in parallel environments.As it can be verified in Figure 1, there are four common operations as well as message-passing communication patterns associated with them: sparse matrix-vector product (SPMV) which requires some communications, via Alltoallw collective, so that each process has the needed elements to compute the computation, DOT product with the Allreduce collective, scaling a vector with the following addition of two vectors (AXPY and AXPY-like), and the application of the preconditioner.Hence, we investigate each of them.
In general, associativity and reproducibility are not guaranteed when there is perturbation of floating-point operations in parallel execution.For instance, invoking the MPI_Allreduce() collective operation cannot ensure the same result (its execution path) as it depends on the data, the network topology, and the underlying algorithmic implementation.Under these assumptions, AXPY (-like) and SPMV are associativity-safe as they are performed locally on local slices of data.The application of preconditioner can also be considered safe, for example, the Jacobi preconditioner, until all operations are reduction-free; more complex preconditioners will certainly raise an issue.Thus, the main issue of non-determinism emerges from parallel reductions (steps S2, S6, and S7 in Figure 1).

Re-assuring reproducibility
We construct our approach for reassuring reproducibility by primarily targeting DOT products and parallel reductions.Note that the non-deterministic implementation of the Krylov subspace method utilizes the DOT routine from a BLAS library like Intel MKL followed by MPI_Allreduce().Thus, we propose to refine this procedure into three steps: • exploit the ExBLAS and its lighter FPE-based versions to build reproducible and correctly rounded DOT products; • extend the ExBLAS-and FPE-based DOT products to distributed memory by employing MPI_Allreduce().This collective acts on either long accumulators or FPEs.For the ExBLAS approach, we apply regular reduction, since the long accumulator is an array of long integers.Note that we may need to carry an extra intermediate normalization after the reduction of 2*2 KÀ1 long accumulators, where K = 64 À 52 = 12 is the number of carry-safe bits per each digit of the long accumulator.For the FPE approach, we define the MPI operation that is based on the twosum EFT.Thus, at this point, the choice of the reduction algorithm underneath MPI_Allreduce() does not have an impact on the computations as every bit of information is stored; • rounding to double: for long accumulators, we use the ExBLAS-native Round() routine.To guarantee correctly rounded results of the FPE-based computations, we employ the NearSum algorithm from Rump et al. (2008).It is worth mentioning that the rounding operation is performed locally and does not require any communication.In the previous versions of the code as in Iakymchuk et al. (2022), we split the reduction into three steps: MPI_Reduce(), rounding, and MPI_Bcast().However, this is negligible as we re-assure control of the reduction operation and, hence, eliminate the performance penalty of using two collectives with one extra synchronization.
It is evident that the results provided by ExBLAS DOT are both correctly rounded and reproducible.With the lightweight DOT, we aim also to be generic and, hence, we provide the implementation that relies on FPEs of size eight with the early-exit technique.This way the working precision of the computations using FPEs is increased up to 8*52 bits as mentioned in Hida et al. (2001) for the double-double arithmetic.Additionally, we add a check for the FPE-based implementations to cover a case when the condition number and/or the dynamic range are too large and we cannot keep every bit of information.Then, the warning is thrown, containing also a suggestion to switch to the ExBLAS-based implementation.But, note that these lightweight implementations are designed for moderately conditioned problems or with moderate dynamic range in order be accurate, reproducible, but also high performing, since the ExBLAS version can be very resource demanding, especially on the small core count.To sum up, if the information about the problem is known in advance, it is worth pursuing the lightweight approach.

Programmability effort
It is important to note that compiler optimization and especially the usage of the fused-multiply-and-add (fma) instruction, which performs a*b + c with the extended precision and the single rounding at the end, may lead to some non-deterministic results.For instance, in the SPMV computation, each MPI rank computes its dedicated part d k of the vector d by multiplying a block of rows A k by the vector p.Since the computations are carried locally and sequentially, they are deterministic and, thus, reproducible.However, some parts of the code like a*b + c*d*e and a + = b*cpresent in the original implementation of PBiCGStabmay not always yield to the same result (Wiesenberger et al., 2019).This is due to the fact that, for performance reasons, the C++ language standard allows compilers to change the execution order of this type of operation.It also allows merging multiplications and summations with fused multiply-add (fma) instructions.

Reproducible BiCGStab
The classic Biconjugate Gradient Stabilized method (BiCGStab) by Van der Vorst (1992) was proposed as a fast and smoothly converging variant of the BiCG (Fletcher, 1976) and CGS (Sonneveld, 1989) methods.We present here the preconditioned BiCGStab (PBiCGStab) and the pipelined preconditioned BiCGStab (p-PBiCGStab), their design and implementation with Message Passing Interface (MPI).
For both methods, we consider a linear system Ax = b, where the coefficient matrix A 2 R n×n is sparse with n z nonzero entries; b 2 R n is the right-hand side vector; and x 2 R n is the sought-after solution vector.Additionally, and for simplicity, we integrate the Jacobi preconditioner (Saad (2003)) in our implementations, which is composed of the diagonal elements of the matrix (M = diag(A)).In consequence, the application of the preconditioner is conducted on a vector and requires an element-wise multiplication of two vectors.
As described in Section 3, the framework includes a reproducible implementation of the most common operations in a parallel implementation of a Krylov subspace method.Therefore, we next perform a communication and computation analysis of a message-passing implementation of the PBiCGStab solver.From there, we derive the reproducible version by following the guide from Section 3.
For clarity, hereafter we will drop the superindices that denote the iteration count in the variable names.Thus, for example, x j becomes x, where the latter stands for the storage space employed to keep the sequence of approximations x 0 , x 1 , x 2 , … computed during the iterative process.Taking into account these previous considerations, we analyze the different computational kernels (S1-S12) that compose the loop body of a single PBiCGStab iteration in Figure 2.
4.1.1.Sparse matrix-vector product (S2, S6).This kernel needs as input operands: the coefficient matrix A, which is distributed by blocks of rows, and the corresponding vector (b p or b q), which is partitioned and distributed using the same partitioning as A. For simplicity, we just explain below how S2 is computed.
In theory, prior to computing this kernel, we would need to obtain a replicated copy of the distributed vector b p in all processes using MPI_Allgatherv(), denoted as b p → e, so that vector e would be the only array that is replicated in all processes.But not all elements of e are required in all processes to compute the local SPMV, only those column indexes which are in the A k and are not in b p k , denoted as e k .Therefore, the communication pattern is defined by the matrix pattern and the matrix distribution, whereas the gathering of e k can be done using MPI_Alltoallw().As the matrix pattern and distribution are not changed within the loop, the communication structures can be defined before the loop, simplifying the communication step.
The computation can then proceed in parallel, yielding the vector result s in the expected distributed state with no further communication involved.At the end, each MPI process owns the corresponding piece of the computed vector.To ensure the reproducibility of this computation, the local DOT products between the sparse rows of b A k and e k are based on fma as outlined in 3.3.
Figure 2. Formulation of the PBiCGStab solver annotated with computational kernels and communication.The threshold τ max is an upper bound on the relative residual for the computed approximation to the solution.In the notation, hÁ, Ái computes the DOT (inner) product of its vector arguments.
4.1.2.DOT products (S3, S7, S10, S11).The next kernel in the loop body is the DOT product in the step S3 between the distributed vectors r 0 and s.Here, each process can compute concurrently a partial result P k : ρ k ¼ hr 0 k , s k i and when all processes have finished this partial computation, these intermediate values have to be reduced into a globallyreplicated scalar αdσ/(ρ 1 + ρ 2 + / + ρ K ).We can apply the same idea to the DOT products in the steps S7, S10, and S11, yielding a total of five process synchronizations (in MPI, via MPI_Allreduce()) since all scalars are globallyreplicated.But, the number of synchronization can be reduced to four, considering that communications in S10 and S11 can be merged in a single MPI_Allreduce().
The easiest solution to compute ρ k is to call to the DOT routine from the Intel MKL or similar libraries, however this will not guarantee reproducibility even when fma are used internally.Thus, we enforce reproducibility by applying our two ExBLASbased strategies, following the guideline as in Section 3.2.
AXPY(-like) vector updates (S4, S8, S9, S12): The next kernel is the AXPY-like kernel in the step S4, which involves the distributed vectors q, r, s and the globally-replicated scalar α.The operations in the steps S8, S9, and S12 follow the same idea because all scalars are globally-replicated.In this type of kernels, all processes can perform their local parts of the computation to obtain the result without any communication: While AXPY(y = αx + y) can directly rely on the MKL library routine, AXPY-like (z = αy + x) requires, at least, two routines in order to be implemented (SCAL/COPY + AXPY).Looking for a robust and correct solution, the use of MKL routines is a bad alternative since each one introduces a rounding error.Additionally, this alternative is more expensive because some vector must be traversed more than once.Instead, we propose to rely on fma that computes each element of the solution of both axpy and axpy-like with a single rounding and only one pass through the vectors.Therefore, in the reproducible versions, we provide our own implementations for SPMV, AXPY, and AXPY-like (do not rely on any external BLAS libraries) and, hence, have the overall control of computations, assuring their correct rounding and reproducibility.

Application of the preconditioner (S1, S5
).The kernel in the step S1 consists of applying the Jacobi preconditioner M, scaling the vector p by the diagonal of the matrix.Therefore, it can be executed in parallel by all processes because each of them stores a different set of the diagonal elements (those related with the piece of the matrix that it stores) and the corresponding set of the vector elements: The same procedure can be applied on the step S5 to scale the vector q, resulting in b q.
There is no routine in the MKL library to implement the element-wise product of two vector, therefore, an ad hoc implementation has to be done.Reproducibility is ensured if this code is based on fma and the order of operations is deterministic as mentioned in Section 3.3.Cools and Vanroose (2017) propose two main steps for deriving the pipelined version of a Krylov subspace method:

Message-passing parallel p-PBiCGStab implementation
• Communication-avoiding: In which the number of global communications is reduced, rearranging the original recurrences.Usually more terms appear in the new recurrences and, therefore, there are more vector operations.• Hiding communications: Since global communications are the most time-consuming component of Krylov subspace methods at large scale, the alternative to reduce their impact on the performance of parallel implementations is their simultaneous execution (overlapping) with SPMV.This technique is implemented using non-blocking collectives, such as MPI_Iallreduce(), which require the use of MPI_Wait() to check when the communication is complete.
The algorithmic description of the pipelined preconditioned BiCGStab (p-PBiCGStab) is presented in Figure 3.The loop body consists of two SPMV (S10 and S18), two preconditioner applications (S9 and S17), six DOT products (S8 [ S11 and S16 [ S19), 18 AXPY/AXPY-like operations (S1-S7 and S12-S15), and a few scalar operations (Cools and Vanroose (2017)).It is worth mentioning that the pipelined PBiCGStab may show different convergence behavior compared to the standard PBiCGStab due to the different way floating-point operations are performed and, thus, the round off errors are propagated and accumulated differently.
The analysis of the computational kernels of the algorithm is very similar to the described above for the parallelization of PBiCGStab in Section 4.1.The only difference is how the DOT products are implemented.
Although, there are six DOT in Figure 3, only two global synchronizations are required because more than one reduction is complete in each step.Therefore, before the synchronization is initiated, the partial result related to the corresponding reductions has to be computed locally in each process.Then, obtained values are stored in local vectors which are used to compute the global values using collectives.The overlapping requires the use of non-blocking collectives which decompose each reduction in two steps: the first step (S8 and S16) properly executes MPI_Iallreduce() starting the global communication, which continues while other steps are performed, for example, S9 and S10.When the global values have to be used, the second step has to be done, calling MPI_Wait(), since execution can only continue if the global communication is completed.We follow here the 'golden rule' of the non-blocking communicationstart as soon as the data are available and wait right before they are needed.

Experimental results
In this section, we report a variety of numerical experiments to examine the convergence, scalability, accuracy, and reproducibility of the original and two reproducible versions of PBiCGStab and p-PBiCGStab.In our experiments, we employed IEEE754 double-precision arithmetic and conducted them on the dual Intel Xeon Gold 6240R CPU @2.4 GHz nodes with 48 cores and 384 GB of memory each at Fraunhofer ITWM.Nodes are connected with the HDR Infiniband.

Evaluation on the suitesparse matrices
We carried out tests on a range of different linear systems from the SuiteSparse matrix collection on a single node using 2, 8, 16, 32, and 48 (full) cores.Table 2 lists a set of tested matrices with the number of rows/columns N and the number of nonzeros nnz.We aim to show the reproducibility, accuracy, and performance of our algorithmic implementations on matrices with various loads, that is, number of nonzeros, as well as complexities.The right-hand side vector b in the iterative solvers was always initialized to the product Ad, d ¼ 1= ffiffiffiffi N p ð1, …, 1Þ T , where N is the number of rows/columns of A. However, in both ExBLASand FPE-based versions, marked as ReproPBiCGStab in the table, we computed b = Ad, d = (1,…,1) T and then scaled b by 1= ffiffiffiffi N p .In all implementations, iterations were started with the initial guess x 0 = 0.The parameter that controls the convergence of the iterative process is kr j k 2 /kr 0 k 2 ≤ 10 À6 .We want to specify that kr j k 2 ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jðr j , r j Þj p since some works use kr j k 2 ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi jðr 0 , r j Þj p .
Figure 3. Formulation of the pipelined PBiCGStab solver annotated with computational kernels and communication.The threshold τ max is an upper bound on the relative residual for the computed approximation to the solution.In the notation, hÁ, Ái computes the Dot (inner) product of its vector arguments.
Table 2 reports the number of required iterations to reach the stopping criterion as well the final true residual for PBiCGStab and ReproPBiCGStab; the latter marks both ExBLAS-and FPE-based variants as they report identical results independently from the number of cores/MPI processes used.We also report the initial residual (kr 0 k 2 ) which can serve as an indicator in combination with the final true residual of how the convergence unfolds.For the original version, we display the number of iterations on one (iter1) and 32 cores (iter32) as they differ.In fact, there is a variability of the results between the other core counts too.Notably, the two reproducible variants show a tendency to deliver more reliable accuracy of the approximate result (the final true residual) and/or converge faster.For instance, the reproducible variants require significantly less iterations for the vas_stokes_2M, orsreg_1, rdb3200L, Transport, tmt_unsym matrices.The reproducible variants are slightly slower for only two matrices, namely ML_Geer and atmosmodj.For the other matrices, mostly symmetric matrices, the results are comparable between reproducible and non-reproducible versions in terms of the number of iterations; however, there is a fluctuation in the final true residual for the original non-deterministic version.
The table also reports the overhead of the reproducible versions against the original non-deterministic version as the normalized mean time on 48 MPI processes.The two reproducible versions perform well with the overhead under 3x for the majority of the test matrices.The FPE version generally shows better performance than the ExBLAS version: one third of the test matrices show the overhead under 2x.
Table 3 shows similar results for the non-deterministic pipelined PBiCGStab and its reproducible variants.The tendency of reproducible variants to converge faster and to deliver more reliable accuracy is preserved.For instance, the reproducible variants require a lower number of iterations for five matrices: atmosmodl, atmosmodm, atmosmodd, orsreg_1, tmt_unsym, and ML_Geer.The reproducible variants require more iterations for four matrices.It is not unusual for rounding errors to cancel in stable Table 2. Convergence of the PBiCGStab and its reproducible versions (ReproPBiCGStab, identical results reported for both) on a set of the SuiteSparse matrices.The initial guess is x 0 = 0.The number of iterations required to reach the tolerance of 10 À6 on the scaled residual, i.e. kr j k 2 /kr 0 k 2 , is reported along with the corresponding true residual kb À Ax j k 2 .iterX stands for runs on X MPI processes.The last two columns show the overhead of the reproducible versions with 48 cores/MPI processes, for example, 1.09x for the add32 matrix.

Matrix
Prec algorithms (see Higham (2002), e.g.page 19) yielding faster convergence of the method.As a consequence, it may happen that a computation with more precision takes more time to converge than a one with less precision.With the pipelined PBiCGStab, we were able to converge to the approximate solution of vas_stokes_2M under the tolerance of 10 À6 .When the required tolerance is increased to, for example, 10 À8 or higher, the pipelined methods may not converge for vas_stokes_2M, ML_Geer, and tmt_unsym.
We leave this as a future work and foresee to investigate this correlation between the requested accuracy and the abilities of the solvers, potentially employing some healing techniques like residual replacement as well as more advanced preconditioners.
In addition, the table exhibits the overhead of the reproducible pipelined BiCGStab variants against the original version on 48 MPI processes.The two reproducible versions show the overhead of 3x for most of the tested SuiteSparse matrices.As for the standard BiCGStab method, the FPE version generally shows better performance than the ExBLAS: three quarters of the cases exhibit the overhead under 3x; for the rest, the overhead never exceeds 4x.
Figure 4 presents the convergence history in terms of the residual computed at every iteration of both the standard and pipelined PBiCGStab methods.The depicted two matrices, namely orsreg_1 and tmt_unsym, represent the beneficial scenarios for the reproducible variants, when they reach the approximate solution in significantly less iterations than their non-deterministic variants.In fact, these results demonstrate a sort of desired scenario when the reproducible variants converge to the solution faster despite yielding more costly computations per each iteration.In the case of these two matrices, which may not be generic, the standard and pipelined PBiCGStab non-deterministic variants require more iterations on various MPI processes.Moreover, the number of iterations to reach the approximation of the solution fluctuates significantly among runs of Table 3. Convergence of the pipelined PBiCGStab and its reproducible versions (p-ReproPBiCGStab, identical results reported for both) on a set of the SuiteSparse matrices.The initial guess is x 0 = 0.The number of iterations required to reach the tolerance of 10 À6 on the scaled residual, that is, kr j k 2 /kr 0 k 2 is reported along with the corresponding true residual kb À Ax j k 2 .iterX stands for runs on X MPI processes.The last two columns show the overhead of the reproducible versions with 48 cores/MPI processes, for example, 1.07x for the add32 matrix.

Matrix
Prec the same non-deterministic variant on a different process count.
Figure 5 demonstrates the strong scalability resultswhen the problem is fixed but the number of allocated resources variesfor the original and both ExBLAS-and FPE-based standard and pipelined PBiCGStab variants on the Queen_4147 matrix.The figures in the left column report the mean execution time for the entire loop of the solver among five samples, while the figures in the right column show the performance overhead of the reproducible versions.We select the Queen_4147 due to the large number of nonzero elements, 316 millions.As we observed, the smaller number of nonzeros leads to the worse scalability, especially on the large core count, and higher overhead for reproducible variants, but never more than 8x.For small matrices like orsreg_1, a lower number of cores is a preferable option to reach an approximation to the solution with the sustainable resource utilization.In these experiments, MPI communication is performed within a node, most likely being exposed to intra-node communication via shared memory.All variants show good scalability results for Queen_4147 with 28x (24x), 29x (29x), and 31x (31x) speed up on 48 MPI processes, when compared to the one process runs, for the original, FPE, and ExBLAS variants of the standard PBiCGStab (pipelined PBiCGStab), respectively.The reproducible variants demonstrate higher/better speedup due to extra floating-point operations.The overhead of the ExBLAS and FPE variants compared to the standard variant is reduced to nearly 2.5x and 2.3x, accordingly, on 48 MPI processes; the pipelined versions exhibit slightly higher overhead on the small core count.The scalability on the other matrices from Tables 2 and 3 shows variable patterns and overhead.
Note that the average execution time per loop for many matrices from Tables 2 and 3 is not sufficient for distributed memory computations.This is due to the fact that the potential performance gain from extra nodes is demolished by communication.  2 for details on matrices.Note that the last iteration is not shown.

Scalability
We leverage a sparse SPD coefficient matrix arising from the finite-difference method of a 3D Poisson's equation with 27 stencil points.We perturb the matrix with the values 1.0 À 0.0001 below the central point to create the unsymmetric 27-point stencil aka the e-type model (Cools and Vanroose, 2017).Given that the theoretical cost of PBiCGStab is t c ≈ 4nnz + 26n floating-point arithmetic operations, where nnz denotes the number of nonzeros of the original matrix and its size n, the execution time of the method is usually dominated by that of the SPMV kernel.Therefore, in order to analyze the weak scalability of the method, we maintain the number of nonzero entries per node.For this purpose, we modified the original matrix, transforming it into a band matrix, where the lower and upper bandwidths (bandL and bandU, respectively) depend on the number of nodes employed in the experiment as follows: With 32 nodes, the bandwidth ranges between 100 and 3200.With this approach we can then maintain the number of rows/columns of the matrix equal to n=4M (4,019,679), while increasing its bandwidth and, therefore, the computational workload proportionally to the hardware resources, as required in a weak scaling experiment.
The right-hand side vector b in the iterative solvers was always initialized to the product of A with a vector containing ones only; and the PBiCGStab iteration was started with the initial guess x 0 = 0.The parameter that controls the convergence of the iterative process was set to 10 À6 .
Figure 6 reports the results of both strong and weak scaling for the reproducible variants against the original version.For the strong scaling, we fix the problem to 64M nonzeros and vary the number of nodes/cores used, while  2; plots in the first column report the measure time, while plots in the second show the overhead.

28
The International Journal of High Performance Computing Applications 38(1) for the weak scaling the work load per node is kept constant as 4M nonzeros by varying the bandwidth with respect to the number of nodes involved; presumably, there is enough load to hide the impact of communication.We select median time among five runs to limit the impact of the outliers.We run the tests within a single allocation for 32 nodes to make sure that there is no additional unnecessary perturbations to the measured time.For the strong scaling tests, the standard and pipelined PBiCGStab variants show a similar convergence behavior.However, for the standard variants the global reductions are not overlapped with computations and may show higher overhead in case of the FPE reproducible version due to a more complex reduction operation.For the standard reproducible versions, the overhead on 32 nodes is 37.8% and 40.2% for the FPE and ExBLAS versions, accordingly.For the pipelined reproducible versions, the performance penalties are similar with 38.0% for the FPE version and 35.9% for the ExBLAS.The weak scalability experiments show expected behavior with the slightly declining line of the execution time and the overheads around 35%.

Accuracy and reproducibility
In addition, we derive a sequential version of the PBiCGStab as in Figure 2 that relies on the GNU Multiple Precision Floating-Point Reliably (MPFR) library (Fousse and et al. (2007))a C library for multiple (arbitrary) precision floating-point computations on CPUsas a highly accurate reference implementation.This implementation uses 2048 bits of accuracy for computing DOT products, 192 bits for internal element-wise product, and performs correct rounding of the computed result to double precision.Table 4 reports the intermediate and final (except from the original version that takes longer) scaled residual on each iteration of the PBiCGStab solvers for the orsreg_1 matrix, as in Table 2, under the tolerance of 10 À6 on eight MPI processes.We also add the results of the original code on one core/process to highlight the reproducibility issue.The results are presented with all digits using hexadecimal representation.We report only few iterations, however the difference is present on all iterations.The sequential MPFR version of PBiCGStab confirms the accuracy and reproducibility of parallel ExBLAS and FPE variants by reporting identical number of iterations, intermediate residuals, and both the final true and initial scaled residuals.However, the MPFR variant of PBiCGStab converges to the approximate solution in 4.04e-01 s, while the ExBLAS and FPE variants take 3.94e-02 and 3.33e-02 s (10.24x and 12.14x faster), accordingly, on eight MPI processes; the overhead of MPFR is 2.14x and 2.68x for ExBLAS and FPE using one MPI process.The original version of PBiCGStab shows the discrepancy from few digits on the initial iteration and up to almost the entire number on the final iterations; the count of required iterations also differs from the reproducible and MPFR variants.
We extend our study of accuracy and reproducibility to provide more details on the execution time of the MPFR version of PBiCGStab by comparing it against the two reproducible versions, namely, FPE and ExBLAS.Table 5 reports the execution time of the MPFR version and its overhead against the FPE and ExBLAS version on a set of SuiteSparse matrices.On a single process, the MPFR version generally requires 2x more time.This gap grows with the number of parallel resources used.For instance, on 16 cores/MPI processes, the MPFR overhead can be as large as 40x compared to the FPE version with the identical accuracy of both.However, the reproducible versions are not only the faster way for accurate and reproducible computations (e.g. for numerical verification), but also they are as accurate as the MPFR implementation of PBiCGStab.

Related work
To enhance reproducibility, Intel proposed the 'Conditional Numerical Reproducibility' (CNR) option in its Math Kernel Library (MKL).Although CNR guarantees reproducibility, it does not ensure correct rounding, meaning the accuracy is arguable.Additionally, the cost of obtaining reproducible results with CNR is high.For instance, for large arrays the MKL's summation with CNR was almost 2x slower than the regular MKL's summation on the Mesu cluster hosted at the Sorbonne University (Collange and et al., 2015).Demmel andNguyen (2013, 2015) implemented a family of algorithmsthat originate from the works by Rump et al. (2010Rump et al. ( , 2008) ) for reproducible summation in floating-point arithmetic.These algorithms always return the same answer.They first compute an absolute bound of the sum and then round all numbers to a fraction of this bound.In consequence, the addition of the rounded quantities is exact; however, the computed sum using their implementations with two or three bins is not correctly rounded.Their results yielded roughly 20% overhead on 1024 processors (CPUs only) compared to the Intel MKL dasum(), but it shows 3.4 times slowdown on 32 processors (one node).Ahrens, Nguyen, and Demmel extended their concept to few other reproducible BLAS routines, distributed as the ReproBLAS library (http://bebop.cs.berkeley.edu/reproblas/),but only with parallel reproducible reduction.Furthermore, the ReproBLAS effort was extended to reproducible tall-skinny QR (Nguyen and Demmel (2015)).
The other approach to ensure reproducibility is called ExBLAS, which is initially proposed by Collange et al. (2015).ExBLAS is based on combining long accumulators and floating-point expansions in conjunction with error-free transformations.This approach is presented in Section 2. Collange et al. (2015) showed that their algorithms for reproducible and accurate summation have 8% overhead on 512 cores (32 nodes) and less than 2% overhead on 16 cores (one node).While ExSUM covers wide range of architectures as well as distributed-memory clusters, the other routines primarily target GPUs.Exploiting the modular and hierarchical structure of linear algebra algorithms, the ExBLAS approach was applied to construct reproducible LU factorizations with partial pivoting (Iakymchuk et al., 2019).Mukunoki and Ogita presented their approach to implement reproducible BLAS, called OzBLAS (Mukunoki et al., 2019), with tunable accuracy.This approach is different from both ReproBLAS and ExBLAS as it does not require to implement every BLAS routine from scratch but relies on high-performance (vendor) implementations.Hence, OzBLAS implements the Ozaki scheme (Ozaki et al., 2012) that follows the fork-join approach: the matrix and vector are split (each element is sliced) into submatrices and sub-vectors for secure products without overflows; then, the high-performance BLAS is called on each of these splits; finally, the results are merged back using, for instance, the NearSum algorithm.Currently, the OzBLAS library includes dot products, matrix-vector product (gemv), and matrix-matrix multiplication (gemm).These algorithmic variants and their implementations on GPUs and CPUs (only dot) reassure reproducibility of the BLAS kernels as well as make the accuracy tunable up-to correctly rounded results.
The proposed framework was implicitly used to derive the reproducible preconditioned Conjugate Gradient (PCG) variants with the flat MPI (Iakymchuk et al., 2020b) and hybrid MPI plus OpenMP tasks (Iakymchuk et al. 2020a).The reproducible PCG variants were primarily verified on the 3D Poisson's equation with 27 stencil points showing the good scalability and low performance overhead (under 30% for both the ExBLAS and lightweight variants) on up to 768 cores of the MareNostrum4 cluster.

Conclusions
Parallel Krylov subspace methods may exhibit the lack of reproducibility when implemented in parallel environments as the results in Tables 2-4 confirm.Such numerical reliability is needed for debugging and validation & verification.In this work, we proposed a general framework for reconstructing reproducibility and re-assuring accuracy in any Krylov subspace method.Our framework is based on two steps: analysis of the underlying algorithm for numerical abnormalities; addressing them via algorithmic solutions and programmability hints.The algorithmic solutions are build around the ExBLAS project, namely: ExBLAS that effectively combines long accumulator and FPEs; FPEs for the lightweight version.The programmability effort was focused on: explicitly invoking fma instructions to avoid replacements by compilers; customized and fma-based axpy and axpy-like operations instead of MKL or similar BLAS libraries; as well as to postpone the division to the moment where it is required.
As test cases, we used the preconditioned standard and pipelined BiCGStab methods and derived two reproducible algorithmic variants for each of them.It is worth mentioning that the two BiCGStab methods are in fact different algorithms with different set of operations yielding non-identical computation path and the divergent way rounding errors are propagate; this difference can be witnessed by the convergence history in Figure 4 even when using the reproducible variants.The reproducible variants deliver identical results of the standard and also pipelined PBiCGStab, which are confirmed by its MPFR version, to ensure reproducibility in the number of iterations, the intermediate and final residuals, as well as the sought-after solution vector.We verified our implementations on a set of the SuiteSparse matrices, showing the performance overhead of nearly 2.0x for the ExBLAS and FPE-based versions, with a noticeably lower overhead for the latter.Tests with the 27-point stencil on 32 nodes show a low performance overhead of 35 %-40%.The code is available at https://github.com/riakymch/ReproPBiCGStab.
With this study we want to promote reproducibility by design through the proper choice of the underlying libraries as well as the careful programmability effort.For instance, a brief guidance would be 1) for fundamental numerical computations use reproducible underlying libraries such as ExBLAS, Re-proBLAS, or OzBLAS and 2) analyze the algorithm and make it reproducible by eliminating any uncertainties and nondeterministic order of computations that may violate associativity such as reductions and use/non-use of fma and postponing divisions until actually needed.Additionally, we try to argue the need for the bit-wise reproducible and correctly rounded results for iterative solvers as they will anyway be enhanced on next iterations as we do not reach the desired tolerance and, thus, do not exploit the full obtained bit-wise results.This becomes more evident with the mixed-precision approaches, which we foresee to pursue.
Our future work is to investigate the residual replacement strategy in the pipelined Krylov subspace solvers such as the pipelined PBiCGStab (p-PBiCGStab) (Cools and Vanroose (2017)) and to study if such strategy can be mitigated by the higher precision provided by long accumulator and FPEs.We believe that there is a potential of using higher precision provided by long accumulator and FPEs in order to mitigate the different way rounding errors are propagate as well as to cope with the attainable precision loss in p-PBiCGStab.
Hence, a compiler might translate a*b + c*d to two multiplications t1 = a*b and t2 = c*d, and a subsequent summation t1 + t2; it might generate a single multiplication t = c*d with a subsequent fma(a, b, t), which gives a slightly different result; or it may even compute t = a*b first and then use the fma(c, d, t).Thus, we advise to instruct compilers to use fma explicitly via std::fma in C++ 11, assuming the underlying architecture supports fma.Another important observation is to carefully perform divisions and initialization of data.For instance, the choice of b in the Krylov solvers is the value b = Ad, with d ¼ 1= ffiffiffiffi N p ð1, …, 1Þ T .In this case, we suggest to compute b = Ad for d = (1,…,1) T first and then scale b by 1= ffiffiffiffi N p , as we observed a slightly faster convergence (up to 7%) for the Krylov solver.

Figure 1 .
Figure1.Preconditioned conjugate gradient method with annotated BLAS kernels and message-passing communication.

Figure 4 .
Figure 4. Residual history of the standard PBiCGStab and its reproducible variants (first row), and the pipelined PBiCGStab and its reproducible variants (second row); orsreg_1 results are shown in the first column, while tmt_unsym in the second column, see Table2for details on matrices.Note that the last iteration is not shown.

Figure 5 .
Figure5.Strong scaling results of the standard PBiCGStab and its reproducible variants (first row), and the pipelined PBiCGStab and its reproducible variants (second row) for the Queen_4147 matrix, see Table2; plots in the first column report the measure time, while plots in the second show the overhead.

Figure 6 .
Figure 6.Strong (top row) and weak (bottom row) scalability of the reproducible PBiCGStab variants; the standard PBiCGStab results are shown in the left column of plots, while the pipelined PBiCGStab results in the right column.

Table 1 .
Parameters for three IEEE arithmetic precisions.

Table 4 .
Accuracy and reproducibility of the intermediate and final residual against the Multiple Precision Floating-Point Reliably (MPFR) library for the orsreg_1 matrix, see Table2.

Table 5 .
Execution time of the sequential MPFR version of PBiCGStab under the tolerance of 10 À6 and its comparison against the FPE and ExBLAS reproducible versions on a set of the SuiteSparse matrices, see Table2; iterX stands for runs on X MPI processes and the values in iterX columns show how many times FPE/ExBLAS is faster.