Advanced Spark (Scala/Java)

Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, open-source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.

The 5 day Spark course is aimed at developers who are encountering Spark for the first time and want to understand how to build Big Data Products with Spark. The course would enable participants to build complete, unified Big Data applications combining batch, streaming, and interactive analytics on all their data.

Developers would be able to write sophisticated parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

The course has a practical focus, mixing presentation with in-depth hands-on labs and exercises.

Prerequisites

To benefit from this course you should have programming experience with Scala or with Python. The language of instruction is Scala. Basic Linux knowledge is expected.

Day 1

First Brush

Big Data Why and What?

Introduction to Spark.

Spark Installation and Modes of Operation

Spark shell

RDD Fundamentals

Transformations in RDD

Actions in RDD

Programming with Spark

Spark Fundamentals

Role of Spark Context

MapReduce in Spark

Day 2

RDDs

RDD API In Detail.

Types of RDD (Pair RDD, Numeric RDD, JDBC RDD, Key-Value etc).

Creating RDD From Different File Formats (Parquet, Avro, JSON, JDBC).

Caching and Persistence

Caching Overview

Distributed Persistence

Parallel Programming

Partitions and Data Locality.

Executing parallel operations

Advanced Concepts of RDD

Accumulators and Broadcast Variables

RDD Internals

Day 3

Spark SQL

Overview

Role of SQLContext

Running Spark SQL in Spark shell

Datasets

Overview

Creating Datasets

Difference between Data Frames and Data Sets.

Conversion from Data Frame to Dataset and vice versa.

Data Frames

Introduction to Data Frames

Creating Data Frames

Transformations and Operations on Data Frames

Interoperating with RDDs

Spark Schedulers

Overview

Scheduling Across Applications

Scheduling Within Application

Day 4

Spark Streaming

Overview

Role of StreamingContext

Receivers

Streaming Applications

Spark MLLib

Data Types

Basic Statistics

Classification

Clustering

Pipelining

DStreams

Introduction

Operations in DStreams

Sliding Window Operations

Performance Tuning of DStreams

Stateful and Stateless Transformations in DStreams.

Day 5

Clustering

Standalone

Configuration of SQLContext

Monitoring

Web UI

REST API

Tuning and Debugging

Data Serialization

Memory Management

Broadcasting Large Variables

Security

Event Logging

Encryption

SSL Configuration

Standalone mode

Deployment

Submitting Applications

Spark Standalone

Amazon EC2

Logging

Q&A