Theory and Practice of Batch Processing of Big Data

Real-time data processing has drawn an enormous attention lately, however batch style has always been an effective approach. Today, volume and variety of data are major challenges for batch style data processing. This course provides basic concepts of batch processing technologies. It provides strong theoretical foundation of batch style data processing. The course covers two widely used batch data processing technologies: Apache Spark and Hadoop/MapReduce. This course includes practical works such as, configuring Apache Spark on standalone server and or on a cluster, deploying jobs on Spark cluster, and integrating Hadoop, Apache Mesos with Spark. The course also includes hands-on knowledge of how to process jobs with MapReduce. It includes a detailed description of techniques to configure MapReduce, deploying and executing them on Hadoop’s single-node or multi-node  cluster. During the course a real-world data scenario of batch processing of Big Data will be simulated.

Course Objectives

  • Provide a strong knowledge of batch method for processing data.
  • Provide theoretical knowledge of processing massive-scale data in batch style.
  • Provide a solid understanding about the core components and architecture of Apache Spark and MapReduce.
  • Provide a strong knowledge of fault-tolerance in a distributed architecture.
  • Provide hands-on knowledge of how to configure and manage Apache Spark and Hadoop/MapReduce clusters (single-node and multi-node).
  • Provide practical knowledge of how to configure jobs, running them, parallelize the execution of jobs in a Apache Spark cluster.
  • Provide hands-on knowledge of how to integrate other technologies with Spark, for instance, integrating MongoDB, Hadoop with Apache Spark.
  • Provide hands-on knowledge of configuring and executing MapReduce jobs in Hadoop cluster.
  • Present and describe the simulation of a real-world batch processing scenario.

What is in it for the Participants?

  • Learning fundamental concepts of batch style data processing.
  • Learning theoretical aspects of batch style batch processing of Big Data.
  • Learning core components and architecture of Apache Spark and MapReduce technologies.
  • Being able to implement, configure and distribute jobs in standalone Apache Spark server.
  • Being able to configure multi-node Apache Spark cluster with Hadoop.
  • Begin able to integrate storage such as MongoDB and Neo4j with Apache Spark.
  • Being able to run jobs on multi-node clusters.
  • Being able to implement MapReduce jobs (written in Java) and distribute them in single-node as well as multi-node cluster.