Phát triển ứng dụng Big Data với Hadoop, MapReduce và Spark

Overview:

This five-day hands-on training course delivers the key concepts and expertise developers need to develop high-performance parallel applications with Hadoop & MapReduce & Apache Spark. Participants will learn how to use Spark SQL to query structured data and Spark Streaming to perform real-time processing on streaming data from a variety of sources. Developers will also practice writing applications that use core Spark to perform ETL processing and iterative algorithms. The course covers how to work with large datasets stored in a distributed file system, and execute Spark applications on a Hadoop cluster. After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.

Duration: 

05 days

Intended Audience:

-       Students coming from IT, Software, Data warehouse background and wanting to get into the Big Data Analytics domain

Courses

1.      Introduction To Big Data

  • Introduction and Relevance
  • Uses of Big Data analytics in various industries like Oil & Gas, Telecom, E- commerce, Finance and Insurance etc.
  • Problems with Traditional Large-Scale Systems

2.      Hadoop (Big Data) Eco-System

  • Motivation for Hadoop
  • Different types of projects by Apache
  • Role of projects in the Hadoop Ecosystem
  • Key technology foundations required for Big Data
  • Limitations and Solutions of existing Data Analytics Architecture
  • Comparison of traditional data management systems with Big Data management systems
  • Evaluate key framework requirements for Big Data analytics
  • Hadoop Ecosystem & Hadoop 2.x core components
  • Explain the relevance of real-time data
  • Explain how to use Big Data and real-time data as a Business planning tool

3.      Hadoop Cluster – Architecture – Configuration Files

  • Hadoop Master-Slave Architecture
  • The Hadoop Distributed File System - Concept of data storage
  • Explain different types of cluster setups(Fully distributed/Pseudo etc)
  • Hadoop cluster set up - Installation
  • Hadoop 2.x Cluster Architecture
  • A Typical enterprise cluster – Hadoop Cluster Modes
  • Understanding cluster management tools like Cloudera manager/Apache ambary

4.      Hadoop – HDFS & Mapreduce (YARN)

  • HDFS Overview & Data storage in HDFS
  • Get the data into Hadoop from local machine(Data Loading Techniques) - vice versa
  • Map Reduce Overview (Traditional way Vs. MapReduce way)
  • Concept of Mapper & Reducer
  • Understanding MapReduce program Framework
  • Develop MapReduce Program using Java (Basic)
  • Develop MapReduce program with streaming API) (Basic)

5.      Data Integration Using Sqoop & Flume

  • Integrating Hadoop into an Existing Enterprise
  • Loading Data from an RDBMS into HDFS by Using Sqoop
  • Managing Real-Time Data Using Flume
  • Accessing HDFS from Legacy Systems

6.      Data Analysis Using Pig

  • Introduction to Data Analysis Tools
  • Apache PIG - MapReduce Vs Pig, Pig Use Cases
  • PIG’s Data Model
  • PIG Streaming
  • Pig Latin Program & Execution
  • Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
  • Writing JAVA UDF’s
  • Embedded PIG in JAVA
  • PIG Macros
  • Parameter Substitution
  • Use Pig to automate the design and implementation of MapReduce applications
  • Use Pig to apply structure to unstructured Big Data

7.      Data Analysis Using Hive

  • Apache Hive - Hive Vs. PIG - Hive Use Cases
  • Discuss the Hive data storage principle
  • Explain the File formats and Records formats supported by the Hive environment
  • Perform operations with data in Hive
  • Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
  • Hive Script, Hive UDF
  • Hive Persistence formats
  • Loading data in Hive - Methods
  • Serialization & Deserialization
  • Handling Text data using Hive
  • Integrating external BI tools with Hadoop Hive

8.      Data Analysis Using Impala

  • Impala & Architecture
  • How Impala executes Queries and its importance
  • Hive vs. PIG vs. Impala
  • Extending Impala with User Defined functions

9.      Introduction To Other Ecosystem Tools

  • NoSQL database - Hbase
  • Introduction Oozie

10.  Spark: Introduction

  • Introduction to Apache Spark
  • Streaming Data Vs. In Memory Data
  • Map Reduce Vs. Spark
  • Modes of Spark
  • Spark Installation Demo
  • Overview of Spark on a cluster
  • Spark Standalone Cluster

11.  Spark: Spark In Practice

  • Invoking Spark Shell
  • Creating the Spark Context
  • Loading a File in Shell
  • Performing Some Basic Operations on Files in Spark Shell
  • Caching Overview
  • Distributed Persistence
  • Spark Streaming Overview(Example: Streaming Word Count)

12.  Spark: Spark Meets Hive

  • Analyze Hive and Spark SQL Architecture
  • Analyze Spark SQL
  • Context in Spark SQL
  • Implement a sample example for Spark SQL
  • Integrating hive and Spark SQL
  • Support for JSON and Parquet File Formats Implement Data Visualization in Spark
  • Loading of Data
  • Hive Queries through Spark
  • Performance Tuning Tips in Spark
  • Shared Variables: Broadcast Variables & Accumulators

13.  Spark Streaming

  • Extract and analyze the data from twitter using Spark streaming
  • Comparison of Spark and Storm – Overview

14.  Spark Graphx

  • Overview of GraphX module in spark
  • Creating graphs with GraphX

15.  Introduction To Machine Learning Using Spark

  • Understand Machine learning framework
  • Implement some of the ML algorithms using Spark MLLib

16.  Project

  • Consolidate all the learnings
  • Working on Big Data Project by integrating various key components
  • Online

  • At Ho Chi Minh City

  • At Ha Noi


Other courses