Apache Spark is a data processing system that was initially developed at the University of California by Zaharia et al.,M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster Computing with Working Sets,” Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, 2010. and then donated to the Apache Software Foundation.

Note that Apache Spark was developed in response to some of the limitations of MapReduce…

Limitation of MapReduce

The MapReduce model allowed developing and running embarrassingly parallel computations on a big cluster of machines. Still, every job had to read the input from the disk and write the output to the disk. As a result, there was a lower bound in the latency of job execution, which was determined by disk speeds. So the MapReduce was not a good fit for:

  • Iterative computations, where a single job was executed multiple times or data were passed through multiple jobs.
  • Interactive data analysis, where a user wants to run multiple ad hoc queries on the same dataset.

Note that Spark addresses the above two use-cases.

Foundation of Spark

Spark is based on the concept of Resilient Distributed Datasets (RDD).

Resilient Distributed Datasets (RDD)

RDD is a distributed memory abstraction used to perform in-memory computations on large clusters of machines in a fault-tolerant way. More concretely, an RDD is a read-only, partitioned collection of records.

RDDs can be created through operations on data in stable storage or other RDDs.

Types of operations performed on RDDs

The operations performed on an RDD can be one of the following two types:

Transformations

Transformations are lazy operations that define a new RDD. Some examples of transformations are map, filter, join, and union.

Actions

Actions trigger a computation to return a value to the program or write data to external storage. Some examples of actions are count, collect, reduce, and save.

Creating an RDD

A typical Spark application will create an RDD by reading some data from a distributed file system. It will then process the data by calculating new RDDs through transformations and storing the results in an output file.

For example, an application used to read some log files from HDFS and count the number of lines that contain the word “sale completed” would look like the following:

lines = spark.textFile("hdfs://...")
completed_sales = lines.filter(_.contains("sale completed"))
number_of_sales = completed_sales.count()

The above program can either be submitted as an individual application in the background or each one of the commands can be executed interactively in the Spark interpreter.

Note that a Spark program is executed from a coordinator process, called the driver.

Architecture of Spark

The Spark cluster contains a cluster manager node, and a set of worker nodes as shown in the following illustration:

Press + to interact
Architecture of Spark
Architecture of Spark

The responsibilities between Spark components are split in the following ways:

Cluster manager

The cluster manager manages the resources of the cluster (i.e. the worker nodes) and allocates resources to clients that need to run applications.

Worker nodes

The worker nodes are the nodes of the cluster that wait to receive applications/jobs to execute.

Request broker

Spark also contains a request broker that requests resources in the cluster and makes them available to the driver.

Note: Spark supports both a standalone and some clustering modes using third-party cluster management systems, such as YARN, Mesos, and Kubernetes. In the standalone mode, the request broker also performs the functions of the cluster manager. In some of the other clustering modes, such as Mesos and YARN, they are separate processes.

Driver

The driver is responsible for:

  • Requesting the required resources from the request broker.
  • Starting a Spark agent process on each node that runs for the entire lifecycle of the application, called the executor.
  • Analyzing the user’s application code into a directed acyclic graph (DAG) of stages
  • Partitioning the associated RDDs
  • Assigning the corresponding tasks to the executors available to compute them.

The driver is also responsible for managing the overall execution of the application, e.g., receiving heartbeats from executors and restarting failed tasks.

Note: In the previous example, the second line completed_sales = lines.filter(_.contains("sale completed")) is executed without any data being read or processed yet since the filter is a transformation. The data is being read from HDFS, filtered, and then counted when the third line is processed, containing the count operation, which is an action. To achieve that, the driver maintains the relationship between the various RDDs through a lineage graph. When an action is performed, it triggers the calculation of an RDD and all its ancestors.

RDD operations

RDDs provide the following basic operations:

  • partitions(): Returns a list of partition objects. For example, an RDD representing an HDFS file has a partition for each block of the file by default. The user can specify a custom number of partitions for an RDD if needed.

  • partitioner(): Returns metadata determining whether the RDD is hash/range partitioned. This is relevant to transformations that join multiple key-value RDDs based on their keys, such as join or groupByKey. In these cases, hash partitioning is used by default on the keys, but the application can specify a custom range partitioning if needed. An example is provided in Zaharia’s paper, which showed that execution of the PageRank algorithm on Spark can be optimized by providing a custom partitioner that groups all the URLs of a single domain in the same partition.

  • preferredLocations(p): Lists nodes where partition p are accessed faster due to data locality. This might return nodes that contain the blocks of an HDFS file corresponding to that partition or a node that already contains in memory a partition that needs to be processed.

  • dependencies(): Returns a list of dependencies on parent RDDs. These dependencies can be classified into two major types: narrow and wide dependencies:

    • A narrow dependency is where a partition of the parent RDD is used by at most one partition of the child RDD, such as map, filter, or union.
    • A wide dependency is where multiple child partitions may depend on a single parent partition, such as a join or groupByKey.

Note that the join of two RDDs can lead to two narrow dependencies if both RDDs are partitioned with the same partitioner, as shown in the following illustration.

  • iterator(p, parentIters): Computes the elements of a partition p given iterators for its parent partitions.

Note that the framework mainly uses these operations to orchestrate the execution of Spark applications. The applications are not supposed to use these operations, they should use the transformations and actions that were presented previously.

Get hands-on with 1400+ tech skills courses.