Phát triển ứng dụng Big Data với Hadoop, MapReduce và Spark

Overview:

This five-day hands-on training course delivers the key concepts and expertise developers need to develop high-performance parallel applications with Hadoop & MapReduce & Apache Spark. Participants will learn how to use Spark SQL to query structured data and Spark Streaming to perform real-time processing on streaming data from a variety of sources. Developers will also practice writing applications that use core Spark to perform ETL processing and iterative algorithms. The course covers how to work with large datasets stored in a distributed file system, and execute Spark applications on a Hadoop cluster. After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.

Duration:

05 days

Intended Audience:

- Students coming from IT, Software, Data warehouse background and wanting to get into the Big Data Analytics domain

Courses

1. Introduction To Big Data

Introduction and Relevance
Uses of Big Data analytics in various industries like Oil & Gas, Telecom, E- commerce, Finance and Insurance etc.
Problems with Traditional Large-Scale Systems

2. Hadoop (Big Data) Eco-System

Motivation for Hadoop
Different types of projects by Apache
Role of projects in the Hadoop Ecosystem
Key technology foundations required for Big Data
Limitations and Solutions of existing Data Analytics Architecture
Comparison of traditional data management systems with Big Data management systems
Evaluate key framework requirements for Big Data analytics
Hadoop Ecosystem & Hadoop 2.x core components
Explain the relevance of real-time data
Explain how to use Big Data and real-time data as a Business planning tool

3. Hadoop Cluster – Architecture – Configuration Files

Hadoop Master-Slave Architecture
The Hadoop Distributed File System - Concept of data storage
Explain different types of cluster setups(Fully distributed/Pseudo etc)
Hadoop cluster set up - Installation
Hadoop 2.x Cluster Architecture
A Typical enterprise cluster – Hadoop Cluster Modes
Understanding cluster management tools like Cloudera manager/Apache ambary

4. Hadoop – HDFS & Mapreduce (YARN)

HDFS Overview & Data storage in HDFS
Get the data into Hadoop from local machine(Data Loading Techniques) - vice versa
Map Reduce Overview (Traditional way Vs. MapReduce way)
Concept of Mapper & Reducer
Understanding MapReduce program Framework
Develop MapReduce Program using Java (Basic)
Develop MapReduce program with streaming API) (Basic)

5. Data Integration Using Sqoop & Flume

Integrating Hadoop into an Existing Enterprise
Loading Data from an RDBMS into HDFS by Using Sqoop
Managing Real-Time Data Using Flume
Accessing HDFS from Legacy Systems

6. Data Analysis Using Pig

Introduction to Data Analysis Tools
Apache PIG - MapReduce Vs Pig, Pig Use Cases
PIG’s Data Model
PIG Streaming
Pig Latin Program & Execution
Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
Writing JAVA UDF’s
Embedded PIG in JAVA
PIG Macros
Parameter Substitution
Use Pig to automate the design and implementation of MapReduce applications
Use Pig to apply structure to unstructured Big Data

7. Data Analysis Using Hive

Apache Hive - Hive Vs. PIG - Hive Use Cases
Discuss the Hive data storage principle
Explain the File formats and Records formats supported by the Hive environment
Perform operations with data in Hive
Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
Hive Script, Hive UDF
Hive Persistence formats
Loading data in Hive - Methods
Serialization & Deserialization
Handling Text data using Hive
Integrating external BI tools with Hadoop Hive

8. Data Analysis Using Impala

Impala & Architecture
How Impala executes Queries and its importance
Hive vs. PIG vs. Impala
Extending Impala with User Defined functions

9. Introduction To Other Ecosystem Tools

NoSQL database - Hbase
Introduction Oozie

10. Spark: Introduction

Introduction to Apache Spark
Streaming Data Vs. In Memory Data
Map Reduce Vs. Spark
Modes of Spark
Spark Installation Demo
Overview of Spark on a cluster
Spark Standalone Cluster

11. Spark: Spark In Practice

Invoking Spark Shell
Creating the Spark Context
Loading a File in Shell
Performing Some Basic Operations on Files in Spark Shell
Caching Overview
Distributed Persistence
Spark Streaming Overview(Example: Streaming Word Count)

12. Spark: Spark Meets Hive

Analyze Hive and Spark SQL Architecture
Analyze Spark SQL
Context in Spark SQL
Implement a sample example for Spark SQL
Integrating hive and Spark SQL
Support for JSON and Parquet File Formats Implement Data Visualization in Spark
Loading of Data
Hive Queries through Spark
Performance Tuning Tips in Spark
Shared Variables: Broadcast Variables & Accumulators

13. Spark Streaming

Extract and analyze the data from twitter using Spark streaming
Comparison of Spark and Storm – Overview

14. Spark Graphx

Overview of GraphX module in spark
Creating graphs with GraphX

15. Introduction To Machine Learning Using Spark

Understand Machine learning framework
Implement some of the ML algorithms using Spark MLLib

16. Project

Consolidate all the learnings
Working on Big Data Project by integrating various key components

Học trực tuyến

Học tại Hồ Chí Minh

Học tại Hà Nội

Các khóa học khác