Optimizations in Spark: RDD, DataFrames

  • 5 Mar 2020
  • By Sarfaraz Hussain
This webinar has now ended. Please view the session recording above

Developing Apache Spark Jobs is the easier part of the process but the difficult portion comes in while executing them under full load as each job is unique when it comes to performance. Spark programs often face bottlenecks in terms of CPU, network bandwidth, memory usage which stems from Spark's basic nature of in-memory computations.

In this webinar, we will deal with the problem of how optimally you can perform your job operations in Apache Spark. We will address common performance problems including -

  • Inadequate transformations when working with RDD API as optimization is the developer's responsibility, unlike in SQL querying language.
  • Proper partitioning of data so that Spark can perform tasks optimally
  • Why DataFrames have better performance than RDD?

    Here's the agenda of the webinar -

    • Spark Execution Model
    • Optimizing Shuffle Operations
    • Optimizing Functions
    • SQL VS RDD
    • Logical & Physical Plan
    • Optimizing Joins

    Sarfaraz Hussain

    Software Consultant

    Sarfaraz Hussain is a Big Data fan working as a Software Consultant with an experience of 1+ years. He is working in technologies like Spark, Scala, Java, Hive & Sqoop and has completed his Master of Engineering with specialization in Big Data & Analytics. He loves to teach and is a huge fitness freak and loves to hit the gym when he's not coding.

    Related Videos


    Let's get started with Cats in Scala


    Lambda Expression


    Java 8 Streams: Cheat Sheet

    Schedule a meeting