Data Ingestion Concepts & Technologies

Today, enterprises tend to use data from blogs, web portals, microblogs (e.g., Twitter, Tumbler), and social medias (e.g., Facebook, Myspace). Collecting data from these heterogeneous sources is highly challenging; because, these data very often stream with high speed, their volumes can be huge, and also they come with variety. This course covers the concepts, methods, and technologies of acquiring massive-scale, diverse data that traverses within data centers or the Web containers with high-speed. The course delivers hands-on knowledge of how to acquire these data in real-time from different sources including social media, sensors, data dumps, and web portals. The course also focuses on ingesting data into different operational platforms and storage. During the course a real-world data acquisition and ingestion scenario will be simulated.

Course Objectives

  • Provide firm knowledge of the concepts of data acquisition and ingestion.
  • Provide a solid understanding of data sources, especially how the sources organize data.
  • Explain different ways of discovering relevant sources.
  • Describe various mechanisms of accessing data sources.
  • Explain different data acquisition methods that include real-time and batch methods.
  • Provide an in-depth knowledge of both batch and real-time data acquisition technologies.
  • Provide practical knowledge of data acquisition technologies: Apache Flume and Apache Kafka.
  • Present and describe a simulation of a real-world data acquisition and ingestion scenario.

What is in it for the Participants?

  • Gaining strong knowledge of data acquisition concepts and methods.
  • Understanding the core components and architectures of data acquisition technologies.
  • Learning how to gather and extract data from widely used data formats.
  • Being able to design and configure the centralized and distributed data collector for obtaining streaming data in real-time.
  • Being able to develop a solution for collecting different types of data such as texts, hashtags, audio, video, images, relational SQL data, and geospatial data.
  • Being able to design and implement centralized and distributed data collector for obtaining data in batch.
  • Being able to design and implement a web crawler for obtaining data from the Web portals.
  • Being able to develop a solution to ingest streaming and batch data in operational platform or in storage.