Big Data nâng cao (khoá tổng hợp)

Overview:

This is our advanced Big Data training, where attendees will gain practical skill set not only on Hadoop in detail, but also learn advanced analytics concepts through Python, Hadoop and Spark. For extensive hands-on practice, candidates will get access to the virtual lab and several assignments and projects. At end of the program candidates are awarded Advance Data Science Certification on successful completion of projects that are provided as part of the training.

A completely industry relevant Big Data Analytics training and a great blend of analytics and technology, making it quite apt for aspirants who want to develop Big Data skills and head-start in Big Data Analytics!

Duration:

10 days

Intended Audience:

- Students coming from IT, Software, Data warehouse background and wanting to get into the Big Data Analytics domain

Courses Outline

Course 1 (4 days)

Data Science using Python

1. Introduction To Data Science

What is Data Science?
Why Python for data science?
Relevance in industry and need of the hour
How leading companies are harnessing the power of Data Science with Python?
Different phases of a typical Analytics/Data Science projects and role of python
Anaconda vs. Python

2. Python Essentials (Core)

Overview of Python- Starting with Python
Introduction to installation of Python
Introduction to Python Editors & IDE's(Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
Understand Jupyter notebook & Customize Settings
Concept of Packages/Libraries - Important packages(NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
Installing & loading Packages & Name Spaces
Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
List and Dictionary Comprehensions
Variable & Value Labels – Date & Time Values
Basic Operations - Mathematical - string - date
Reading and writing data
Simple plotting
Control flow & conditional statements
Debugging & Code profiling
How to create class and modules and how to call them?
Scientific distributions used in python for Data Science - Numpy, scify, pandas, scikitlearn, statmodels, nltk etc

3. Accessing/Importing And Exporting Data Using Python Modules

Importing Data from various sources (Csv, txt, excel, access etc)
Database Input (Connecting to database)
Viewing Data objects - subsetting, methods
Exporting Data to various formats
Important python modules: Pandas, beautifulsoup

4. Data Manipulation – Cleansing – Munging Using Python Modules

Cleansing Data with Python
Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversions, renaming, formatting etc)
Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
Python Built-in Functions (Text, numeric, date, utility functions)
Python User Defined Functions
Stripping out extraneous information
Normalizing data
Formatting data
Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)

5. Data Analysis – Visualization Using Python

Introduction exploratory data analysis
Descriptive statistics, Frequency Tables and summarization
Univariate Analysis (Distribution of data & Graphical Analysis)
Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats etc)

6. Basic Statistics & Implementation Of Stats Methods In Python

Basic Statistics - Measures of Central Tendencies and Variance
Building blocks - Probability Distributions - Normal distribution - Central Limit Theorem
Inferential Statistics -Sampling - Concept of Hypothesis Testing
Statistical Methods - Z/t-tests (One sample, independent, paired), Anova, Correlation and Chi-square
Important modules for statistical methods: Numpy, Scipy, Pandas

7. Python: Machine Learning – Predictive Modelling – Basics

Introduction to Machine Learning & Predictive Modeling
Types of Business problems - Mapping of Techniques - Regression vs. classification vs. segmentation vs. Forecasting
Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)
Overfitting (Bias-Variance Trade off) & Performance Metrics
Feature engineering & dimension reduction
Concept of optimization & cost function
Concept of gradient descent algorithm
Concept of Cross validation(Bootstrapping, K-Fold validation etc)
Model performance metrics (R-square, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics)

8. Machine Learning Algorithms & Applications – Implementation In Python

Linear & Logistic Regression
Segmentation - Cluster Analysis (K-Means)
Decision Trees (CART/CD 5.0)
Ensemble Learning (Random Forest, Bagging & boosting)
Artificial Neural Networks(ANN)
Support Vector Machines(SVM)
Other Techniques (KNN, Naïve Bayes, PCA)
Introduction to Text Mining using NLTK
Introduction to Time Series Forecasting (Decomposition & ARIMA)
Important python modules for Machine Learning (SciKit Learn, stats models, scipy, nltk etc)
Fine tuning the models using Hyper parameters, grid search, piping etc.

9. Project – Consolidate Learnings

Applying different algorithms to solve the business problems and bench mark the results

Courses 2

Big Data Analytics using Hadoop & Spark (6 days)

1. Introduction To Big Data

Introduction and Relevance
Uses of Big Data analytics in various industries like Oil & Gas, Telecom, E- commerce, Finance and Insurance etc.
Problems with Traditional Large-Scale Systems

2. Analytics value proposition for Oil & Gas

Introduction to Oil & Gas industry
Stages of oil & gas sector: Exploration – Finding Oil & Gas
Role of Analytics in Oil & Gas Industry
E&P Analytics

- Reservoir characterization

- Drilling optimization

- Unconventional completions

- Production forecasting

- Well and well portfolio value optimization

- Seismic analyses

Assets & Operations

- Process regularity and facility integrity

- Demand forecasting

- Integrated operations and logistics

- Operational risk/environment, health and safety (EH & S)

Energy Risk Management

- Energy trading and risk management

- Credit risk management

- A consolidated view of your entire portfolio

- Cash flow management

3. Hadoop (Big Data) Eco-System

Motivation for Hadoop
Different types of projects by Apache
Role of projects in the Hadoop Ecosystem
Key technology foundations required for Big Data
Limitations and Solutions of existing Data Analytics Architecture
Comparison of traditional data management systems with Big Data management systems
Evaluate key framework requirements for Big Data analytics
Hadoop Ecosystem & Hadoop 2.x core components
Explain the relevance of real-time data
Explain how to use Big Data and real-time data as a Business planning tool

4. Hadoop Cluster – Architecture – Configuration Files

Hadoop Master-Slave Architecture
The Hadoop Distributed File System - Concept of data storage
Explain different types of cluster setups(Fully distributed/Pseudo etc)
Hadoop cluster set up - Installation
Hadoop 2.x Cluster Architecture
A Typical enterprise cluster – Hadoop Cluster Modes
Understanding cluster management tools like Cloudera manager/Apache ambary

5. Hadoop – HDFS & Mapreduce (YARN)

HDFS Overview & Data storage in HDFS
Get the data into Hadoop from local machine(Data Loading Techniques) - vice versa
Map Reduce Overview (Traditional way Vs. MapReduce way)
Concept of Mapper & Reducer
Understanding MapReduce program Framework
Develop MapReduce Program using Java (Basic)
Develop MapReduce program with streaming API) (Basic)

6. Data Integration Using Sqoop & Flume

Integrating Hadoop into an Existing Enterprise
Loading Data from an RDBMS into HDFS by Using Sqoop
Managing Real-Time Data Using Flume
Accessing HDFS from Legacy Systems

7. Data Analysis Using Pig

Introduction to Data Analysis Tools
Apache PIG - MapReduce Vs Pig, Pig Use Cases
PIG’s Data Model
PIG Streaming
Pig Latin Program & Execution
Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
Writing JAVA UDF’s
Embedded PIG in JAVA
PIG Macros
Parameter Substitution
Use Pig to automate the design and implementation of MapReduce applications
Use Pig to apply structure to unstructured Big Data

8. Data Analysis Using Hive

Apache Hive - Hive Vs. PIG - Hive Use Cases
Discuss the Hive data storage principle
Explain the File formats and Records formats supported by the Hive environment
Perform operations with data in Hive
Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
Hive Script, Hive UDF
Hive Persistence formats
Loading data in Hive - Methods
Serialization & Deserialization
Handling Text data using Hive
Integrating external BI tools with Hadoop Hive

9. Data Analysis Using Impala

Impala & Architecture
How Impala executes Queries and its importance
Hive vs. PIG vs. Impala
Extending Impala with User Defined functions

10. Introduction To Other Ecosystem Tools

NoSQL database - Hbase
Introduction Oozie

11. Spark: Introduction

Introduction to Apache Spark
Streaming Data Vs. In Memory Data
Map Reduce Vs. Spark
Modes of Spark
Spark Installation Demo
Overview of Spark on a cluster
Spark Standalone Cluster

12. Spark: Spark In Practice

Invoking Spark Shell
Creating the Spark Context
Loading a File in Shell
Performing Some Basic Operations on Files in Spark Shell
Caching Overview
Distributed Persistence
Spark Streaming Overview(Example: Streaming Word Count)

13. Spark: Spark Meets Hive

Analyze Hive and Spark SQL Architecture
Analyze Spark SQL
Context in Spark SQL
Implement a sample example for Spark SQL
Integrating hive and Spark SQL
Support for JSON and Parquet File Formats Implement Data Visualization in Spark
Loading of Data
Hive Queries through Spark
Performance Tuning Tips in Spark
Shared Variables: Broadcast Variables & Accumulators

14. Spark Streaming

Extract and analyze the data from twitter using Spark streaming
Comparison of Spark and Storm – Overview

15. Spark Graphx

Overview of GraphX module in spark
Creating graphs with GraphX

16. Introduction To Machine Learning Using Spark

Understand Machine learning framework
Implement some of the ML algorithms using Spark MLLib

17. Project

Consolidate all the learnings
Working on Big Data Project by integrating various key components

Học trực tuyến

Học tại Hồ Chí Minh

Học tại Hà Nội

Các khóa học khác

Sắp khai giảng Xem thêm

LINUX LPIC 2 (201-450 & 202-450)
Ngày khai giảng : 16-08-2025
Microsoft Azure Administrator
Ngày khai giảng : 18-08-2025
LINUX LPIC 1 (101-500 & 102-500)
Ngày khai giảng : 06-09-2025
Data Science and Big Data Analytics
Ngày khai giảng : 06-09-2025

Góc công nghệ Xem thêm

Thông tin việc làm Xem thêm

DIGI-TEXX VIETNAM – Tuyển Dụng Đội Ngũ Công Nghệ
Ngày đăng : 04/08/2025
Tuyển dụng Thực tập sinh Công Nghệ Thông Tin (AI, Data Science)
Ngày đăng : 19/06/2025
Tuyển dụng Thực tập sinh Hành Chính Văn Phòng
Ngày đăng : 28/05/2025
Robusta tuyển dụng Nhân viên Sale & Marketing – Mảng Giáo dục Công nghệ
Ngày đăng : 12/05/2025