Big Data Architecture Workshop

Overview:

BDAW is a learning event that addresses advanced big data architecture topics. BDAW brings together technical contributors into a group setting to design and architect solutions to a challenging business problem. The workshop addresses big data architecture problems in general, and then applies them to the design of a challenging system.

Throughout the highly interactive workshop, participants apply concepts to real-world examples resulting in detailed synergistic discussions.The workshop is conducive for participants to learn techniques for architecting big data systems, not only from Cloudera’s experience but also from the experiences of fellow participants.

Delivery Method and Course Duration:

Classroom: 3 days

Intended Audience & Prerequisites:

To gain the most from the workshop, participants should have working knowledge of technologies such as HDFS, Spark, MapReduce, Hive/Impala, Data Formats and relational database management systems. Detailed API level knowledge is not needed, as there will not be any programming activities.

The workshop will be divided into small groups to discuss the problems and develop solutions. Each group will select a spokesperson who will present the group’s findings to the workshop. There will not be any programming labs, but we will have solutions implemented and deployed in the cloud for demos during the workshop

Course outlines:

1. Introduction

2. Workshop Application Use Cases

Oz Metropolitan
Architectural questions
Team activity: Analyze Metroz Application Use Cases

3. Application Vertical Slice

Definition
Minimizing risk of an unsound architecture
Selecting a vertical slice
Team activity: Identify an initial vertical slice for Metroz

4. Application Processing

Real time, near real time processing
Batch processing
Data access patterns
Delivery and processing guarantees
Machine Learning pipelines
Team activity: identify delivery and processing patterns in Metroz, characterize response time requirements, identify Machine Learning pipelines

5. Application Data

Three V’s of Big Data
Data Lifecycle
Data Formats
Transforming Data
Team activity: Metroz Data Requirements

6. Scalable Applications

Scale up, scale out, scale to X
Determining if an application will scale
Poll: scalable airport terminal designs
Hadoop and Spark Scalability
Team activity: Scaling Metroz

7. Fault Tolerant Distributed Systems

Principles
Transparency
Hardware vs. Software redundancy
Tolerating disasters
Stateless functional fault tolerance
Stateful fault tolerance
Replication and group consistency
Fault tolerance in Spark and Map Reduce
Application tolerance for failures
Team activity: Identify Metroz component failures and requirements

8. Security and Privacy

Principles
Privacy
Threats
Technologies
Team activity: identify threats and security mechanisms in Metroz

9. Deployment

Cluster sizing and evolution
On-premise vs. Cloud
Edge computing
Team activity: select deployment for Metroz

10. Technology Selection

HDFS
HBase
Kudu
Relational Database Management Systems
Map Reduce
Spark, including streaming, SparkSQL and SparkML
Hive
Impala
Cloudera Search
Data Sets and Formats
Team activity: technologies relevant to Metroz

11. Software Architecture

Architecture artifacts
One platform or multiple, lambda architecture
Team activity: produce high level architecture, selected technologies, revisit vertical slice
Vertical Slice demonstration

12. Wrap Up