A generic I/O architecture for data-intensive applications based on in-memory distributed cache

13 March, 2017

Abstract
The evolution in scientific computing towards data-intensive applications and the increase of heterogeneity in the computing resources are exposing new challenges in the I/O layer requirements. There is a need to recognize the close relationship between data-intensive scientific computing and the Big Data computing field. Exploiting synergies between both paradigms is necessary for achieving next-generation scientific computing breakthroughs. Moreover, to reach the desired unification, the new solutions should also be generic, portable, and extensible to future ultrascale systems. These systems are envisioned as large-scale complex systems that join parallel and distributed computing systems, reaching capacities two to three orders of magnitude larger than today’s systems.
Current trends in data-intensive scientific computing are based on the exploitation of cloud platforms and the utilization of workflow engines. On the one hand, the irruption of the cloud computing paradigm allows the deployment of execution environments for complex data reduction and analysis applications that can be fully customized using virtually limit-less resources in a pay-per-use basis. This new paradigm can be considered as the competence of classical HPC systems. However, future approaches will borrow the advantages of both HPC and cloud technologies.
On the other hand, workflow engines offer a simplification in the development and deployment of data-intensive applications in different infrastructures. The combination of these novel directions are changing the landscape of scientific computing as a popular way of providing high-throughput reduction and analysis capabilities.
This thesis presents a novel generic I/O architecture for data intensive applications based on in-memory distributed cache, targeting both I/O bottlenecks and heterogeneity of computing resources. The proposed architecture targets four main objectives: scalability, flexibility, portability, and performance.
In order to expose the potential performance improvements of our proposed solution in a wide range of different scenarios and to demonstrate the feasibility of our generic design, we have deployed our proposed architecture on three different scenarios, starting with a tightly coupled HPC infrastructure, to loosely coupled infrastructures such as cloud platforms and mobile cloud computing environments.
Every case includes subtle adaptations to leverage the specific characteristics of each platform and in-depth performance evaluations using benchmarks and applications.
This extensive evaluation on multiple systems demonstrates that our solution makes a better use of resources than existing state-of-the-art approaches providing better performance and scalability in most cases.

Project
www.arcos.inf.uc3m.es/sdamathecs

BibTex
@article{duro2016generic,
title={A generic I/O architecture for data-intensiveapplications based on in-memorydistributed cache},
author={Duro, Rodrigo and Garc{\’\i}a Blas, Javier and Carretero P{\’e}rez, Jes{\’u}s},
year={2016}
}