Tutorials

 

ACCELERATING BIG DATA PROCESSING WITH HADOOP, SPARK AND
MEMCACHED ON MODERN CLUSTERS


Dhabaleswar K. (DK) Panda and Xiaoyi Lu.
Department of Computer Science and Engineering, The Ohio State University

PROGRAMMING DISTRIBUTED PLATFORMS WITH PyCOMPSs


Rosa M Badia and Javier Conejero.
Barcelona Supercomputing Center

Tutorials Description


ACCELERATING BIG DATA PROCESSING WITH HADOOP, SPARK AND
MEMCACHED ON MODERN CLUSTERS


Dhabaleswar K. (DK) Panda and Xiaoyi Lu.
Department of Computer Science and Engineering, The Ohio State University

 

Abstract

Apache Hadoop and Spark are gaining prominence in handling Big Data and analytics. Similarly, Memcached in Web 2.0 environment is becoming important for large-scale query processing. These middleware are traditionally written with sockets and do not deliver best performance on datacenters with modern high performance networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, RPC, HBase, etc.), Spark and Memcached. We will examine the challenges in re-designing the networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand, iWARP, RoCE, and RSocket) with RDMA and storage architecture. Using the publicly available software packages in the High-Performance Big Data (HiBD, http://hibd.cse.ohio-state.edu) project, we will provide case studies of the new designs for several Hadoop/Spark/Memcached components and their associated benefits. Through these case studies, we will also examine the interplay between high performance interconnects, storage systems (NVM, SSD, HDD, and Parallel Filesystem), and multi-core platforms to achieve the best solutions for these components.

Targeted Audience and Scope

The tutorial content is planned for half-a-day. This tutorial is targeted for various categories of people working in the areas of Big Data including high-performance Hadoop/Spark/Memcached, high performance communication and I/O architecture, storage, networking, middleware, cloud computing and applications. Specific audience this tutorial is aimed at include:

  • Scientists, engineers, researchers, and students engaged in designing next-generation Big Data systems and applications
  • Designers and developers of Big Data, Hadoop, Spark and Memcached middleware
  • Newcomers to the field of Big Data who are interested in familiarizing themselves with Hadoop, Spark, Memcached, RDMA, and high-performance networking
  • Managers and administrators responsible for setting-up next generation Big Data environment and high-end systems/facilities in their organizations/laboratories

The content level will be as follows: 30% beginner, 40% intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the attendee has a general knowledge in Big Data, Hadoop, Spark, Memcached, high performance computing, networking and storage architecture, and related issues, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. This tutorial is organized as a coherent talk to cover multiple topics.

Table of contents

  • Introduction to Big Data Applications and Analytics
  • Overview of MapReduce and Resilient Distributed Datasets (RDD) Programming Models
  • Architecture Overview of Apache Hadoop, Spark and Memcached
    • MapReduce and YARN
    • HDFS
    • Spark
    • RPC
    • HBase
    • Memcached
  • Overview of High-Performance Interconnects, Protocols, and Storage Architectures for Modern Datacenters
    • InfiniBand and RDMA
    • 10/40 GigE, iWARP and RoCE technologies
    • RSocket and SDP protocols
    • SSD-based storage
  • Challenges in Accelerating Hadoop, Spark and Memcached on Modern Datacenters
  • Overview of Benchmarks and Applications using Hadoop, Spark and Memcached
  • Basic Acceleration Case Studies and In-Depth Performance Evaluation
    • MapReduce over InfiniBand with RDMA, RAMDisk, SSD, and Lustre
    • HDFS over InfiniBand with RDMA and Heterogeneous Storage (RAMDisk, SSD, HDD, and Lustre)
    • Spark over InfiniBand with RDMA, SSD, and Lustre
    • RPC over InfiniBand with RDMA
    • HBase over InfiniBand with RDMA and SSD
    • Memcached over InfiniBand with RDMA, SSD, and Lustre
  • The High-Performance Big Data (HiBD) Project and Associated Releases
  • Other Activities for Accelerating Big Data Applications
  • Advanced Acceleration Case Studies
    • HDFS over InfiniBand with RDMA and NVM
    • MapReduce over InfiniBand with RDMA and NVM
    • MR-Advisor for Performance Tuning
    • Big Data over HPC Cloud
  • Conclusion and Q&A

Brief Bio of the Speakers 

Dr. Dhabaleswar K. (DK) Panda is a Professor of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high performance computing, communication protocols, files systems, network-based computing, and Quality of Service. He has published over 400 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand and 10GigE/iWARP companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE)open-source software package, developed by his research group, are currently being used by more than 2,725 organizations worldwide (in 83 countries). This software has enabled several InfiniBand clusters (including the 1st one) to get into the latest TOP500 ranking. More than 409,000 downloads of these libraries have taken place from the project’s site. These software packages are also available with the stacks for network vendors (InfiniBand and iWARP), server vendors and Linux distributors. The RDMA-enabled Apache Hadoop, Spark and Memcached packages, consisting of acceleration for HDFS, MapReduce, RPC, Spark and Memcached, are publicly available from High-Performance Big Data (HiBD) project site: http://hibd.cse.ohio-state.edu. These packages are currently being used by more than 205 organizations in 29 countries. More than 19,850 downloads have taken place from the project’s site. Dr. Panda’s research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr. Panda, including a comprehensive CV and publications are available here.

Dr. Xiaoyi Lu is a Research Scientist of the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization and Cloud Computing. He has published over 80 papers in International journals and conferences related to these research areas. He has been actively involved in various professional activities (PC Co-Chair, PC Member, Reviewer, Session Chair) in academic journals and conferences. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from (http://hibd.cse.ohio-state.edu). These libraries are currently being used by more than 205 organizations from 29 countries. More than 19,850 downloads of these libraries have taken place from the project site. He is a core member of the MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at here.


PROGRAMMING DISTRIBUTED PLATFORMS WITH PyCOMPSs


Rosa M Badia and Javier Conejero.
Barcelona Supercomputing Center

Motivation

Task-based programming models have proven to be a right approach to exploit large scale parallelism by enabling a data-flow execution model and avoiding global synchronization. COMPSs falls into this category and is able to exploit the inherent concurrency of sequential applications and execute them in a distributed platform, including clusters. This is achieved by annotating part of the codes as tasks, and building at execution a task-dependence graph based on the data consumed/produced by the tasks. COMPSs runtime is able to schedule tasks in computing nodes taking into account facts like data locality and node heterogeneity. The recently released Python binding (PyCOMPSs) opens a new door to execute Python scripts in parallel in distributed platforms.
COMPSs environment consists of several components besides the language interface and runtime: an instrumented runtime able to generate post-mortem tracefiles that can be analyzed in detail with Paraver; a runtime monitor able to show current task graph, tasks being executed, load in the different nodes, etc; a graphical development interface (IDE) that helps in the deployment of applications (specially in environments such as clouds); and an integration with the Jupyter notebook to execute interactive applications in PyCOMPSs.The objectives of the tutorial are to give an overview of the PyCOMPSs/COMPSs task based programming model syntax: how tasks are annotated, how tasks are synchronized; also, to demonstrate how to use PyCOMPSs/COMPSs to parallelize and run applications in clusters and clouds: parallelization strategies and sample codes to illustrate how the framework words; also, to give an overview of the COMPSs runtime: scheduling and policies, interface with execution platforms.The tutorial will be composed of presentations and hands-on.

Table of contents

Roundtable – Presentation and background of participants

Session 1: Introduction to PyCOMPSs

Section 2: Local environment hands-on

Session 3: COMPSs runtime

Session 4: Execution in large scale platform hands-on

Conclusions

Brief Biography of the instructors 

Rosa M. Badia has a PhD on Computer Science (1994) from the Technical University of Catalunya (UPC). She is the manager of the Workkflows and Distributed Computing at the Barcelona Supercomputing Center (BSC) a Scientific Researcher from the Consejo Superior de Investigaciones Científicas (CSIC). She was involved in teaching and research activities at the UPC from 1989 to 2008, where she was an Associated Professor. From 1999 to 2005 she was involved in research and development activities at the European Center of Parallelism of Barcelona (CEPBA). Her current research interests are programming models for complex platforms (from multicore, GPUs to Grid/Cloud). She has published more than 150 papers in international conferences and journals in the topics of her research. She has participated in a large number of European projects and currently she is participating in the projects Euroserver, The Human Brain Project, EUBrazil CloudConnect, the BioExcel CoE, NEXTGenIO, MUG, EUBra BIGSEA, TANGO, mf2C and it is a member of HiPEAC2 NoE. She has delivered multiple tutorials in conferences and events about the programming models at BSC (OmpSs and PyCOMPSs/COMPSs).

Javier Conejero is a Senior Researcher at Barcelona Supercomputing Center. He holds a Ph.D. Degree in Computer Science from the University of Castilla-La Mancha. Javier worked at CERN for one year into WLCG software development and management. His current research interests are programming models, efficient exploitation of Cloud and HPC environments and QoS. He has been the main developer of PyCOMPSs since his arrival tu BSC on 2014. He is also contributing to the European projects NEXTGenIO and MUG.