Providing malleable capabilities to MPI applications
FLEX-MPI is a runtime system that extends the functionalities of the MPI library by providing malleable and load balance capabilities to MPI applications. FlexMPI also includes an external controller that manages the execution of multiple (malleable) applications. This management includes capabilities for dynamically changing the application’s number of processes, application monitoring, process-core binding, and I/O scheduling.
FlexMPI source code is available in the download section. We provide the FlexMPI library and the external controller as well as a basic example of a MPI application that alternates CPU, communication and I/O phases. The code repository includes a user manual with a full description of the installation and execution processes.
The controller executes applications A, B and C with 2, 1 and 4 processes, respectively.
The controller creates 1 and to 2 new processes in applications B and C, respectively.
The controller changes the process-core binding of one process of Application C.
The controller activates the FlexMPI’s monitoring service for Application B.
Finally, the controller coordinates the I/O access to the storage subsystem.
To provide MPI applications of malleable capabilities that permit dynamically increase or decrease the number of processes at execution time.
To transparently perform the application load balance by redistributing the data when a malleable operation is carried on. Alternatively, load balance can also be triggered by external user commands.
To provide monitoring capabilities to the applications. The metrics collected include the execution time of each CPU, communication and I/O phase, the amount of data involved in the I/O operations, and other metrics (MIPS, FLOPS, cache misses, etc.) obtained using performance counters.
An external controller is included to collect monitoring information from the running applications. In addition, the controller is also able to send controls commands to the executing applications. These commands can be provided by the user or by third-party programs and include (among others) actions to spawn or remove processes, to perform process-core binding, and to change the performance counters used to monitor the application.
Flex-MPI includes an implementation of Clarisse’s I/O scheduling algorithm that coordinates the I/O of the executing applications in a transparent manner.
The following figure shows the framework architecture. FlexMPI library is executed with the running applications. FlexMPI receives commands from the external controller and collects and sends several application performance metrics. This is done by using PAPI library and internal timers. Note that FlexMPI includes many features not described in this summary. For a detailed description see the PhD. Thesis Optimization Techniques for Adaptability in MPI Applications.
The controller consists of three components: the monitoring, the management and the communication modules. The monitoring module receives and stores the application performance metrics, that include CPU, communication and I/O times. These metrics can subsequently be used by different optimization policies.
The management module includes a resource manager that assigns the application processes to the existing compute nodes. This is done when a new application is executed or when new processes are created (by means of malleability). The other component of this module is the I/O scheduler, that coordinates the application I/O providing exclusive I/O access to each application.
Finally, the communication module receives the user commands. A full description of these commands can be seen in FlexMPI User Manual. The communication interface is responsible of sending these control commands to FlexMPI library (running with the applications).
At Carlos III University of Madrid: David E. Singh, Jesús Carretero, Javier García Blas and Florín Isaila.
At Barcelona Supercomputing Center: María-Cristina Marinescu
At industry: Gonzalo Martín and Manuel Rodríguez-Gonzalo
This work has been partially funded by the Spanish Ministry of Science and Education under the contracts MEC 2011/00003/001, TIN2010-16497, TIN2013-41350-P and TIN2016-79637-P “Towards Unification of HPC and Big Data paradigms”. It was also partially supported by EU under the COST Program Action IC1305, Network for Sustainable Ultrascale Computing (NESUS).