Batch Style Processing of Big Data

Although real-time data processing has drawn an enormous attention lately, batch style has always been used by a large number of industrial users because, the enterprises very often process data in a batch. For instance, they process data in every 24 hours or every week or every month. Today, the major challenge of batch processing is the volume and variety of data. This course covers fundamental concepts of batch style technologies. The course covers the description of two widely used technologies: Apache Spark and Hadoop/MapReduce. Practical works are included in this course such as, configuring Apache Spark on standalone server and or on a cluster, deploying jobs on Spark cluster, and integrating Hadoop, Apache Mesos with Spark. The course also includes hands-on knowledge in processing of jobs with MapReduce. It includes detailing the methods of configuring MapReduce, deploying and executing them on Hadoop’s single-node or multi-node  cluster. During the course a real-world batch processing scenario will be simulated.

Course Objectives

  • Provide a strong knowledge of basic concepts of batch processing technologies.
  • Provide a solid understanding about the core components and architecture of Apache Spark and MapReduce.
  • Provide a strong knowledge of fault-tolerance in a distributed architecture.
  • Provide hands-on knowledge of how to configure and manage Apache Spark and Hadoop/MapReduce clusters (single-node and multi-node).
  • Provide practical knowledge of how to configure jobs, running them, parallelize jobs in an Apache Spark cluster for processing data in a batch.
  • Provide hands-on knowledge of how to integrate other technologies with Spark. For instance, integrating MongoDB, Hadoop with Apache Spark.
  • Provide hands-on knowledge of how to implement a solution for batch style data processing.
  • Provide hands-on knowledge of configuring and executing MapReduce jobs in Hadoop cluster.
  • Present and describe the simulation of a real-world batch processing scenario.

What is in it for the Participants?

  • Learning the core components and architecture of Apache Spark and MapReduce technologies.
  • Being able to implement Apache Spark jobs in standalone Apache Spark server.
  • Being able to configure multi-node Apache Spark cluster with Hadoop.
  • Being able to deploy spark jobs in distributed Apacke spark clusters.
  • Begin able to integrate storage including MongoDB, and Neo4j with Apache Spark.
  • Being able to deploy and run Spark jobs on multi-node clusters.
  • Being able to implement MapReduce jobs (written in Java) and deploy them in single-node as well as multi-node cluster.