Introduction to Apache Spark
Let's have an introduction to Apache Spark and its architecture.
Apache Spark is a data processing system that was initially developed at the University of California by
Note that Apache Spark was developed in response to some of the limitations of MapReduce…
Limitation of MapReduce
The MapReduce model allowed developing and running embarrassingly parallel computations on a big cluster of machines. Still, every job had to read the input from the disk and write the output to the disk. As a result, there was a lower bound in the latency of job execution, which was determined by disk speeds. So the MapReduce was not a good fit for:
- Iterative computations, where a single job was executed multiple times or data were passed through multiple jobs.
- Interactive data analysis, where a user wants to run multiple ad hoc queries on the same dataset.
Note that Spark addresses the above two use-cases.
Foundation of Spark
Spark is based on the concept of Resilient Distributed Datasets (RDD).
Resilient Distributed Datasets (RDD)
RDD is a distributed memory abstraction used to perform in-memory computations on large clusters of machines in a fault-tolerant way. More concretely, an RDD is a read-only, partitioned collection of records.
RDDs can be created through operations on data in stable storage or other RDDs.
Types of operations performed on RDDs
The operations performed on an RDD can be one of the following two types:
Transformations
Transformations are lazy operations that define a new RDD. Some examples of transformations are map, filter, join, and union.
Actions
Actions trigger a computation to return a value to the program or write data to external storage. Some examples of actions are count, collect, reduce, and save.
Creating an RDD
A typical Spark application will create an RDD by reading some data from a distributed file system. It will then process the data by calculating new RDDs through transformations and storing the results in an output file.
For example, an application used to read some log files from HDFS and count the number of lines that contain the word “sale completed” would look like the following:
lines = spark.textFile("hdfs://...")
completed_sales = lines.filter(_.contains("sale completed"))
number_of_sales = completed_sales.count()
The above program can either be submitted as an individual application in the background or each one of the commands can be executed interactively in the Spark interpreter.
Note that a Spark program is executed from a coordinator process, called the driver.
Architecture of Spark
The Spark cluster contains a cluster manager node, and a set of worker nodes as shown in the following illustration:
The responsibilities between Spark components are split in the following ways:
Cluster manager
The cluster manager manages the resources of the cluster (i.e. the worker nodes) and allocates resources to clients that need to run applications.
Worker nodes
The worker nodes are the nodes of the cluster that wait to receive applications/jobs to execute.
Request broker
Spark also contains a request broker that requests resources in the cluster and makes them available to the driver.
Note: Spark supports both a standalone and some clustering modes using third-party cluster management systems, such as YARN, Mesos, and Kubernetes. In the standalone mode, the request broker also performs the functions of the cluster manager. In some of the other clustering modes, such as Mesos and YARN, they are separate processes.
Driver
The driver is responsible for:
- Requesting the required resources from the request broker.
- Starting a Spark agent process on each node that runs for the entire lifecycle of the application, called the executor.
- Analyzing the user’s application code into a directed acyclic graph (DAG) of stages
- Partitioning the associated RDDs
- Assigning the corresponding tasks to the executors available to compute them.
The driver is also responsible for managing the overall execution of the application, e.g., receiving heartbeats from executors and restarting failed tasks.
Note: In the previous example, the second line
completed_sales = lines.filter(_.contains("sale completed"))
is executed without any data being read or processed yet since thefilter
is a transformation. The data is being read from HDFS, filtered, and then counted when the third line is processed, containing thecount
operation, which is an action. To achieve that, the driver maintains the relationship between the various RDDs through a lineage graph. When an action is performed, it triggers the calculation of an RDD and all its ancestors.
RDD operations
RDDs provide the following basic operations:
-
partitions()
: Returns a list of partition objects. For example, an RDD representing an HDFS file has a partition for each block of the file by default. The user can specify a custom number of partitions for an RDD if needed. -
partitioner()
: Returns metadata determining whether the RDD is hash/range partitioned. This is relevant to transformations that join multiple key-value RDDs based on their keys, such asjoin
orgroupByKey
. In these cases, hash partitioning is used by default on the keys, but the application can specify a custom range partitioning if needed. An example is provided in Zaharia’s paper, which showed that execution of the PageRank algorithm on Spark can be optimized by providing a custom partitioner that groups all the URLs of a single domain in the same partition. -
preferredLocations(p)
: Lists nodes where partitionp
are accessed faster due to data locality. This might return nodes that contain the blocks of an HDFS file corresponding to that partition or a node that already contains in memory a partition that needs to be processed. -
dependencies()
: Returns a list of dependencies on parent RDDs. These dependencies can be classified into two major types: narrow and wide dependencies:- A narrow dependency is where a partition of the parent RDD is used by at most one partition of the child RDD, such as map, filter, or union.
- A wide dependency is where multiple child partitions may depend on a single parent partition, such as a
join
orgroupByKey
.
Note that the
join
of two RDDs can lead to two narrow dependencies if both RDDs are partitioned with the same partitioner, as shown in the following illustration.
iterator(p, parentIters)
: Computes the elements of a partitionp
given iterators for its parent partitions.
Note that the framework mainly uses these operations to orchestrate the execution of Spark applications. The applications are not supposed to use these operations, they should use the transformations and actions that were presented previously.
Get hands-on with 1400+ tech skills courses.