Reducing Global Synchronization in the Biconjugate Gradient Method.




Starting from a specific implementation of the Lanczos biorthogonalization algorithm, an iterative process for the solution of systems of linear equations with general non-Hermitian coefficient matrix is derived. Due to the orthogonalization of the underlying Lanczos process the resulting iterative scheme involves inner products leading to global communication and synchronization on parallel processors. For massively parallel computers, these effects cause considerable delays often preventing the scalability of the implementation. In the process proposed, all inner product-like operations of an iteration step are independent such that the implementation consists of only a single global synchronization point per iteration. In exact arithmetic, the process is shown to be mathematically equivalent to the biconjugate gradient method. The efficiency of this new variant is demonstrated by numerical experiments on a PARAGON system using up to 121 processors.