Distributed Memory Reduction Operations in Presence of Process Desynchronization

Despite decades of exponential growth in computational power, humans continue to find new problems that eclipse available computational resources. This unrelenting pursuit for computational power has brought about supercomputers consisting of millions of individual computing units. Writing programs that would efficiently utilize the computational power of such complex machines has turned out to be a major challenge. As of today, most \ac{HPC} applications continue to be based on the distributed memory programming paradigm, through the use of \ac{MPI}. One of the principal drivers behind the research in this dissertation was the coupling of multi-scale and multi-physics iPIC3D space weather simulation with in-situ raytraced visualization for real-time simulation steering. This application was developed by the Leuven Intel ExaScale Lab as a research prototype for the type of HPC applications projected to run on exascale machines of the 2018-2020 timeframe. Due to the nature of the coupling, load imbalance and process desynchronization arose as one of the leading causes of performance shortcomings. In particular, image compositing and other collective reduction operations were identified as most critical to visualization performance. This dissertation proposes three new reduction algorithms that are resilient to process desynchronization. Two of the algorithms achieve this using side information to preconstruct reduction schedules, while the third algorithm constructs dynamic reduction schedules at runtime. These algorithms jointly address both the case where the input data is atomic and the case where it is can be arbitrarily segmented. Compared to non-blocking collective operation, imbalance robust collectives have the advantage that they require no changes to legacy code, and are fully transparent to the user. Finally, a new benchmarking suite capable of injecting customized process desynchronization was developed. The extensive experimental assessments presented in this dissertation indicate that the proposed algorithms conclusively outperform the state of the art.

File Type: pdf
File Size: 4 MB
Publication Year: 2016
Author: Marendic, Petar
Supervisors: Jan Lemeire, Peter Schelkens
Institution: Vrije Universiteit Brussel
Keywords: MPI, reduction, message passing, process skew, load imbalance, system noise, distributed memory, high performance computing, Infiniband, process arrival time