Optimizations in Spark: RDD, DataFrames

5 Mar 2020
By Sarfaraz Hussain

This webinar has now ended. Please view the session recording above

Developing Apache Spark Jobs is the easier part of the process but the difficult portion comes in while executing them under full load as each job is unique when it comes to performance. Spark programs often face bottlenecks in terms of CPU, network bandwidth, memory usage which stems from Spark's basic nature of in-memory computations.

In this webinar, we will deal with the problem of how optimally you can perform your job operations in Apache Spark. We will address common performance problems including -

Inadequate transformations when working with RDD API as optimization is the developer's responsibility, unlike in SQL querying language.
Proper partitioning of data so that Spark can perform tasks optimally
Why DataFrames have better performance than RDD?

Here's the agenda of the webinar -

Spark Execution Model
Optimizing Shuffle Operations
Optimizing Functions
SQL VS RDD
Logical & Physical Plan
Optimizing Joins

Sarfaraz Hussain

Software Consultant

Sarfaraz Hussain is a Big Data fan working as a Software Consultant with an experience of 1+ years. He is working in technologies like Spark, Scala, Java, Hive & Sqoop and has completed his Master of Engineering with specialization in Big Data & Analytics. He loves to teach and is a huge fitness freak and loves to hit the gym when he's not coding.

Services

Accelerators

Industries

Insights

Optimizations in Spark: RDD, DataFrames

Sarfaraz Hussain

Related Videos

Services

Accelerators

Industries

Insights

Webinars

Optimizations in Spark: RDD, DataFrames

Subscribe

Share

Sarfaraz Hussain

Related Videos