Big Data nâng cao (khoá tổng hợp)


This is our advanced Big Data training, where attendees will gain practical skill set not only on Hadoop in detail, but also learn advanced analytics concepts through Python, Hadoop and Spark. For extensive hands-on practice, candidates will get access to the virtual lab and several assignments and projects. At end of the program candidates are awarded Advance Data Science Certification on successful completion of projects that are provided as part of the training.

A completely industry relevant Big Data Analytics training and a great blend of analytics and technology, making it quite apt for aspirants who want to develop Big Data skills and head-start in Big Data Analytics!


10 days

Intended Audience:

-       Students coming from IT, Software, Data warehouse background and wanting to get into the Big Data Analytics domain

Courses Outline

 Course 1 (4 days)

Data Science using Python 

1.      Introduction To Data Science

  • What is Data Science?
  • Why Python for data science?
  • Relevance in industry and need of the hour
  • How leading companies are harnessing the power of Data Science with Python?
  • Different phases of a typical Analytics/Data Science projects and role of python
  • Anaconda vs. Python

2.      Python Essentials (Core)

  • Overview of Python- Starting with Python
  • Introduction to installation of Python
  • Introduction to Python Editors & IDE's(Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
  • Understand Jupyter notebook & Customize Settings
  • Concept of Packages/Libraries - Important packages(NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
  • Installing & loading Packages & Name Spaces
  • Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
  • List and Dictionary Comprehensions
  • Variable & Value Labels –  Date & Time Values
  • Basic Operations - Mathematical - string - date
  • Reading and writing data
  • Simple plotting
  • Control flow & conditional statements
  • Debugging & Code profiling
  • How to create class and modules and how to call them?
  • Scientific distributions used in python for Data Science - Numpy, scify, pandas, scikitlearn, statmodels, nltk etc

3.      Accessing/Importing And Exporting Data Using Python Modules

  • Importing Data from various sources (Csv, txt, excel, access etc)
  • Database Input (Connecting to database)
  • Viewing Data objects - subsetting, methods
  • Exporting Data to various formats
  • Important python modules: Pandas, beautifulsoup

4.      Data Manipulation – Cleansing – Munging Using Python Modules

  • Cleansing Data with Python
  • Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversions, renaming, formatting etc)
  • Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
  • Python Built-in Functions (Text, numeric, date, utility functions)
  • Python User Defined Functions
  • Stripping out extraneous information
  • Normalizing data
  • Formatting data
  • Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)

5.      Data Analysis – Visualization Using Python

  • Introduction exploratory data analysis
  • Descriptive statistics, Frequency Tables and summarization
  • Univariate Analysis (Distribution of data & Graphical Analysis)
  • Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
  • Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
  • Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats etc)

6.      Basic Statistics & Implementation Of Stats Methods In Python

  • Basic Statistics - Measures of Central Tendencies and Variance
  • Building blocks - Probability Distributions - Normal distribution - Central Limit Theorem
  • Inferential Statistics -Sampling - Concept of Hypothesis Testing
  • Statistical Methods - Z/t-tests (One sample, independent, paired), Anova, Correlation and Chi-square
  • Important modules for statistical methods: Numpy, Scipy, Pandas

7.      Python: Machine Learning – Predictive Modelling – Basics

  • Introduction to Machine Learning & Predictive Modeling
  • Types of Business problems - Mapping of Techniques - Regression vs. classification vs. segmentation vs. Forecasting
  • Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
  • Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)
  • Overfitting (Bias-Variance Trade off) & Performance Metrics
  • Feature engineering & dimension reduction
  • Concept of optimization & cost function
  • Concept of gradient descent algorithm
  • Concept of Cross validation(Bootstrapping, K-Fold validation etc)
  • Model performance metrics (R-square, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics)

8.      Machine Learning Algorithms & Applications – Implementation In Python

  • Linear & Logistic Regression
  • Segmentation - Cluster Analysis (K-Means)
  • Decision Trees (CART/CD 5.0)
  • Ensemble Learning (Random Forest, Bagging & boosting)
  • Artificial Neural Networks(ANN)
  • Support Vector Machines(SVM)
  • Other Techniques (KNN, Naïve Bayes, PCA)
  • Introduction to Text Mining using NLTK
  • Introduction to Time Series Forecasting (Decomposition & ARIMA)
  • Important python modules for Machine Learning (SciKit Learn, stats models, scipy, nltk etc)
  • Fine tuning the models using Hyper parameters, grid search, piping etc.

9.      Project – Consolidate Learnings

  • Applying different algorithms to solve the business problems and bench mark the results


Courses 2

Big Data Analytics using Hadoop & Spark (6 days) 

1.      Introduction To Big Data

  • Introduction and Relevance
  • Uses of Big Data analytics in various industries like Oil & Gas, Telecom, E- commerce, Finance and Insurance etc.
  • Problems with Traditional Large-Scale Systems

2.      Analytics value proposition for Oil & Gas

  • Introduction to Oil & Gas industry
  • Stages of oil & gas sector: Exploration – Finding Oil & Gas
  • Role of Analytics in Oil & Gas Industry
  • E&P Analytics

         - Reservoir characterization

         - Drilling optimization

         - Unconventional completions

         - Production forecasting

         - Well and well portfolio value optimization

         - Seismic analyses

  • Assets & Operations

         - Process regularity and facility integrity

         - Demand forecasting

         - Integrated operations and logistics

         - Operational risk/environment, health and safety (EH & S)

  • Energy Risk Management

         - Energy trading and risk management

         - Credit risk management

         - A consolidated view of your entire portfolio

         - Cash flow management

3.      Hadoop (Big Data) Eco-System

  • Motivation for Hadoop
  • Different types of projects by Apache
  • Role of projects in the Hadoop Ecosystem
  • Key technology foundations required for Big Data
  • Limitations and Solutions of existing Data Analytics Architecture
  • Comparison of traditional data management systems with Big Data management systems
  • Evaluate key framework requirements for Big Data analytics
  • Hadoop Ecosystem & Hadoop 2.x core components
  • Explain the relevance of real-time data
  • Explain how to use Big Data and real-time data as a Business planning tool

4.      Hadoop Cluster – Architecture – Configuration Files

  • Hadoop Master-Slave Architecture
  • The Hadoop Distributed File System - Concept of data storage
  • Explain different types of cluster setups(Fully distributed/Pseudo etc)
  • Hadoop cluster set up - Installation
  • Hadoop 2.x Cluster Architecture
  • A Typical enterprise cluster – Hadoop Cluster Modes
  • Understanding cluster management tools like Cloudera manager/Apache ambary

5.      Hadoop – HDFS & Mapreduce (YARN)

  • HDFS Overview & Data storage in HDFS
  • Get the data into Hadoop from local machine(Data Loading Techniques) - vice versa
  • Map Reduce Overview (Traditional way Vs. MapReduce way)
  • Concept of Mapper & Reducer
  • Understanding MapReduce program Framework
  • Develop MapReduce Program using Java (Basic)
  • Develop MapReduce program with streaming API) (Basic)

6.      Data Integration Using Sqoop & Flume

  • Integrating Hadoop into an Existing Enterprise
  • Loading Data from an RDBMS into HDFS by Using Sqoop
  • Managing Real-Time Data Using Flume
  • Accessing HDFS from Legacy Systems

7.      Data Analysis Using Pig

  • Introduction to Data Analysis Tools
  • Apache PIG - MapReduce Vs Pig, Pig Use Cases
  • PIG’s Data Model
  • PIG Streaming
  • Pig Latin Program & Execution
  • Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
  • Writing JAVA UDF’s
  • Embedded PIG in JAVA
  • PIG Macros
  • Parameter Substitution
  • Use Pig to automate the design and implementation of MapReduce applications
  • Use Pig to apply structure to unstructured Big Data

8.      Data Analysis Using Hive

  • Apache Hive - Hive Vs. PIG - Hive Use Cases
  • Discuss the Hive data storage principle
  • Explain the File formats and Records formats supported by the Hive environment
  • Perform operations with data in Hive
  • Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
  • Hive Script, Hive UDF
  • Hive Persistence formats
  • Loading data in Hive - Methods
  • Serialization & Deserialization
  • Handling Text data using Hive
  • Integrating external BI tools with Hadoop Hive

9.      Data Analysis Using Impala

  • Impala & Architecture
  • How Impala executes Queries and its importance
  • Hive vs. PIG vs. Impala
  • Extending Impala with User Defined functions

10.  Introduction To Other Ecosystem Tools

  • NoSQL database - Hbase
  • Introduction Oozie

11.  Spark: Introduction

  • Introduction to Apache Spark
  • Streaming Data Vs. In Memory Data
  • Map Reduce Vs. Spark
  • Modes of Spark
  • Spark Installation Demo
  • Overview of Spark on a cluster
  • Spark Standalone Cluster

12.  Spark: Spark In Practice

  • Invoking Spark Shell
  • Creating the Spark Context
  • Loading a File in Shell
  • Performing Some Basic Operations on Files in Spark Shell
  • Caching Overview
  • Distributed Persistence
  • Spark Streaming Overview(Example: Streaming Word Count)

13.  Spark: Spark Meets Hive

  • Analyze Hive and Spark SQL Architecture
  • Analyze Spark SQL
  • Context in Spark SQL
  • Implement a sample example for Spark SQL
  • Integrating hive and Spark SQL
  • Support for JSON and Parquet File Formats Implement Data Visualization in Spark
  • Loading of Data
  • Hive Queries through Spark
  • Performance Tuning Tips in Spark
  • Shared Variables: Broadcast Variables & Accumulators

14.  Spark Streaming

  • Extract and analyze the data from twitter using Spark streaming
  • Comparison of Spark and Storm – Overview

15.  Spark Graphx

  • Overview of GraphX module in spark
  • Creating graphs with GraphX

16.  Introduction To Machine Learning Using Spark

  • Understand Machine learning framework
  • Implement some of the ML algorithms using Spark MLLib

17.  Project

  • Consolidate all the learnings
  • Working on Big Data Project by integrating various key components
  • Học tại Hồ Chí Minh

  • Học tại Hà Nội

  • Học trực tuyến

Các khóa học khác