Data Ingestion Concepts & Technologies
Today, enterprises tend to use data from blogs, web portals, microblogs (e.g., Twitter, Tumbler), and social medias (e.g., Facebook, Myspace). Collecting data from these heterogeneous sources is highly challenging; because, these data very often stream with high speed, their volumes can be huge, and also they come with variety. This course covers the concepts, methods, and technologies of acquiring massive-scale, diverse data that traverses within data centers or the Web containers with high-speed. The course delivers hands-on knowledge of how to acquire these data in real-time from different sources including social media, sensors, data dumps, and web portals. The course also focuses on ingesting data into different operational platforms and storage. During the course a real-world data acquisition and ingestion scenario will be simulated.
- Provide firm knowledge of the concepts of data acquisition and ingestion.
- Provide a solid understanding of data sources, especially how the sources organize data.
- Explain different ways of discovering relevant sources.
- Describe various mechanisms of accessing data sources.
- Explain different data acquisition methods that include real-time and batch methods.
- Provide an in-depth knowledge of both batch and real-time data acquisition technologies.
- Provide practical knowledge of data acquisition technologies: Apache Flume and Apache Kafka.
- Present and describe a simulation of a real-world data acquisition and ingestion scenario.
What is in it for the Participants?
- Gaining strong knowledge of data acquisition concepts and methods.
- Understanding the core components and architectures of data acquisition technologies.
- Learning how to gather and extract data from widely used data formats.
- Being able to design and configure the centralized and distributed data collector for obtaining streaming data in real-time.
- Being able to develop a solution for collecting different types of data such as texts, hashtags, audio, video, images, relational SQL data, and geospatial data.
- Being able to design and implement centralized and distributed data collector for obtaining data in batch.
- Being able to design and implement a web crawler for obtaining data from the Web portals.
- Being able to develop a solution to ingest streaming and batch data in operational platform or in storage.