Towards Unification of HPC and Big Data Paradigms


The global information technology ecosystem is currently in transition to a new generation of applications requiring intensive data acquisition, processing and storage systems. As a result of this shift towards data-intensive computing, there is a growing overlap between high-performance computing (HPC) and Big Data in applications, given that many HPC applications produce Big Data, while Big Data is a growing consumer of HPC capabilities. The topic has raised a strong interest, as evidenced by the fact that papers and communications related to this topics have been published in high impact journals (Dongarra, Comm ACM) and congresses (as Supercomputing 2015, Austin). Moreover, international groups have been created to cope with this problem, like the “Big Data & Extreme Computing Initiative” joining EU, USA and Japan experts.

The hypothesis of this project, following a current trend in scientific computing, is that the potential interoperability and scaling convergence of HPC and Big Data ecosystems is crucial to the future, and unification is essential to address a spectrum of major research domains.

Thus, the main goal of this project is to research new approaches to facilitate the convergence of HPC and Big Data paradigms by providing common abstractions and mechanisms for improving scalability, data locality exploitation, and adaptivity on large scale computers. To achieve this goal, the project explores solutions to overcome challenges at three layers levels of the software stack: application layer, the data management at the system software layer, and the local node level. Cross‐layer mechanisms will be addressed for resource monitoring and discovery and to exploit the parallelism at all levels to spread system information allowing informed optimizations to be applied at every level, making it possible to perform global system optimizations and avoiding mismatches between layers.

The specific goals of this project are:

  •  To explore novel techniques and abstractions to allow HPC and Big Data applications to exploit parallelism, locality, elasticity, and adaptivity of the computing systems.
  •  To research on new data management mechanisms to unify the storage paradigms currently used in HPC and Big Data.
  •  To investigate node-level techniques to efficiently exploit the parallelism and data locality, considering homogeneous and heterogeneous nodes.
  •  And to evaluate the feasibility of the proposed solutions through relevant use cases (O4). A careful planning and coordination of the project activities and high quality dissemination will be also made to maximize impact of the project.

Project solutions implemented are validated through benchmarks and using four real-world use cases developed by the ARCOS group: RPCS, a railway electric power consumption simulator; EpiGraph, a large-scale simulator of the propagation of epidemic diseases; Fux-Sim, a X-Ray simulator that aims to reproduce different low-dose scan positions to be later reconstructed with high quality algorithms; and pHardi, a tool for the analysis of High Angular Resolution Diffusion Imaging data.


KEYWORDS:   HPC , Big Data , mapping abstractions of data , data-aware scheduling , parallelism exploitation .