DAG of Stages in Apache Spark

Learn how Apache Spark provides efficient fault tolerance.

As explained in the previous lesson, the driver examines the lineage graph of the application code and builds a DAGDirected Acyclic Graph of stages to execute.

DAG scheduler of stages

A DAG of Stages is shown in the following illustration:

  • Each stage contains as many pipelined transformations with narrow dependencies (one-to-one) as possible.
  • The boundaries of each stage correspond to operations with wide dependencies (one-to-many) that require a data shuffle between partitions or any previously computed partitions that have been persisted. And it can short-circuit the computation of ancestor RDDs.

The driver launches tasks to compute missing partitions from each stage until it has computed the target RDD.

Note: Spark is based on the concept of Resilient Distributed Datasets (RDD), which is a distributed memory abstraction used to perform in-memory computations on large clusters of machines in a fault-tolerant way.

The tasks are assigned to executors based on data locality. If a task needs to process a partition in memory on a node, it’s submitted to that node. If a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), it’s submitted to these nodes.

For wide dependencies that require data shuffling, nodes holding parent partitions materialize intermediate records locally that are later pulled by nodes from the next stage, similar to MapReduce.

Note: This graph is the basic building block for efficient fault tolerance.

Tolerating fault

When an executor fails for some reason, any tasks running on it are re-scheduled on another executor. Along with this, tasks are scheduled for any parent RDDs required to calculate the RDD of this task. Consequently, wide dependencies are much more inefficient than narrow dependencies when recovering from failures, as shown in the following illustration:

Slow recovery

Long lineage graphs can make a recovery very slow since many RDDs will need to be recomputed in a potential failure near the end of the graph.

Fast recovery process

Spark provides a checkpointing capability to make fast recovery.

Checkpointing capability

Checkpointing capability is used to store RDDs from specified tasks to stable storage (e.g., a distributed file system). In this way, checkpointed RDDs can be read from stable storage during recovery, thus only having to rerun smaller portions of the graph. Users can call a persist() method to indicate which RDDs need to be stored on the disk.

Note: The checkpoint capability is very useful for interactive applications, where a specific RDD is calculated and used in multiple ways to explore a dataset without having to calculate the whole lineage each time.

Get hands-on with 1400+ tech skills courses.