Advanced Methods in Data Science and Big Data Analytics

Overview

This course provides practical foundation level training that enables immediate and effective participation in big data and other analytics projects. It includes an introduction to big data and the Data Analytics Lifecycle to address business challenges that leverage big data. The course provides grounding in basic and advanced analytic methods and an introduction to big data analytics technology and tools, including MapReduce and Hadoop. Labs offer opportunities for students to understand how these methods and tools may be applied to real world business challenges by a practicing data scientist. The course takes an “Open”, or technology-neutral approach, and includes a final lab which addresses a big data analytics challenge by applying the concepts taught in the course in the context of the Data Analytics Lifecycle. The course prepares the student for the Proven™ Professional Data Scientist Associate (EMCDSA) certification exam.

Duration: 40 hours
Prerequisite Knowledge/Skills
  • · Completion of the Data Science and Big Data Analytics course
  • · Proficiency in at least one programming language such as Java or Python
Audience

This course is intended for aspiring Data Scientists, data analysts that have completed the associate level Data Science and Big Data Analytics course, and computer scientists wanting to learn MapReduce and methods for analyzing unstructured data such as text.

Objectives

Upon successful completion of this course, participants should be able to:

  • · Develop and execute MapReduce functionality
  • · Gain familiarity with NoSQL databases and Hadoop Ecosystem tools for analyzing large-scale, unstructured data sets
  • · Develop a working knowledge of Natural Language Processing, Social Network Analysis, and Data Visualization concepts
  • · Use advanced quantitative methods, and apply one of them in a Hadoop environment
  • · Apply advanced techniques to real-world datasets in a final lab
Course Outline

The content of this course is designed to support the course objectives.

- Module 1: MapReduce and Hadoop

  •  Lesson 1: The MapReduce Framework
  •  Lesson 2: Apache Hadoop
  •  Lesson 3: Hadoop Distributed File System
  •  Lesson 4: YARN · 

- Module 2: Hadoop Ecosystem and NoSQL

  •  Lesson 1: Hadoop Ecosystem
  •  Lesson 2: Pig
  •  Lesson 3: Hive
  •  Lesson 4: NoSQL - Not Only SQL
  •  Lesson 5: HBase
  •  Lesson 6: Spark

- Module 3: Natural Language Processing

  •  Lesson 1: Introduction to NLP
  •  Lesson 2: Text Preprocessing
  •  Lesson 3: TFIDF
  •  Lesson 4: Beyond Bag of Words
  •  Lesson 5: Language Modeling
  •  Lesson 6: POS Tagging and HMM
  •  Lesson 7: Sentiment Analysis and Topic Modeling

- Module 4: Social Network Analysis

  •  Lesson 1: Introduction to SNA and Graph Theory
  •  Lesson 2: Most Important Nodes
  •  Lesson 3: Communities and Small World
  •  Lesson 4: Network Problems and SNA Tools

-· Module 5: Data Science Theory and Methods

  •  Lesson 1: Simulation 
  •  Lesson 2: Random Forests
  •  Lesson 3: Multinomial Logistic Regression

- Module 6: Data Visualization

  •  Lesson 1: Perception and Visualization
  •  Lesson 2: Visualization of Multivariate Data
  • Học trực tuyến

  • Học tại Hồ Chí Minh

  • Học tại Hà Nội