Introduction to Apache Spark

Let's have an introduction to Apache Spark and its architecture.

We'll cover the following

Limitation of MapReduce
Foundation of Spark
- Resilient Distributed Datasets (RDD)
  - Types of operations performed on RDDs
    - Transformations
    - Actions
- Creating an RDD
Architecture of Spark
RDD operations

Apache Spark is a data processing system that was initially developed at the University of California by Zaharia et al.,M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster Computing with Working Sets,” Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, 2010. and then donated to the Apache Software Foundation.

Note that Apache Spark was developed in response to some of the limitations of MapReduce…

Limitation of MapReduce

The MapReduce model allowed developing and running embarrassingly parallel computations on a big cluster of machines. Still, every job had to read the input from the disk and write the output to the disk. As a result, there was a lower bound in the latency of job execution, which was determined by disk speeds. So the MapReduce was not a good fit for:

Iterative computations, where a single job was executed multiple times or data were passed through multiple jobs.
Interactive data analysis, where a user wants to run multiple ad hoc queries on the same dataset.

Note that Spark addresses the above two use-cases.

Foundation of Spark

Spark is based on the concept of Resilient Distributed Datasets (RDD).

Resilient Distributed Datasets (RDD)

RDD is a distributed memory abstraction used to perform in-memory computations on large clusters of machines in a fault-tolerant way. More concretely, an RDD is a read-only, partitioned collection of records.

RDDs can be created through operations on data in stable storage or other RDDs.

Types of operations performed on RDDs

The operations performed on an RDD can be one of the following two types:

Transformations

Transformations are lazy operations that define a new RDD. Some examples of transformations are map, filter, join, and union.

Actions

Actions trigger a computation to return a value to the program or write data to external storage. Some examples of actions are count, collect, reduce, and save.

Press + to interact

The responsibilities between Spark components are split in the following ways:

Cluster manager

The cluster manager manages the resources of the cluster (i.e. the worker nodes) and allocates resources to clients that need to run applications.

Worker nodes

The worker nodes are the nodes of the cluster that wait to receive applications/jobs to execute.

Request broker

Spark also contains a request broker that requests resources in the cluster and makes them available to the driver.

Note: Spark supports both a standalone and some clustering modes using third-party cluster management systems, such as YARN, Mesos, and Kubernetes. In the standalone mode, the request broker also performs the functions of the cluster manager. In some of the other clustering modes, such as Mesos and YARN, they are separate processes.

Driver

The driver is responsible for:

Requesting the required resources from the request broker.
Starting a Spark agent process on each node that runs for the entire lifecycle of the application, called the executor.
Analyzing the user’s application code into a directed acyclic graph (DAG) of stages
Partitioning the associated RDDs
Assigning the corresponding tasks to the executors available to compute them.

The driver is also responsible for managing the overall execution of the application, e.g., receiving heartbeats from executors and restarting failed tasks.

Note: In the previous example, the second line completed_sales = lines.filter(_.contains("sale completed")) is executed without any data being read or processed yet since the filter is a transformation. The data is being read from HDFS, filtered, and then counted when the third line is processed, containing the count operation, which is an action. To achieve that, the driver maintains the relationship between the various RDDs through a lineage graph. When an action is performed, it triggers the calculation of an RDD and all its ancestors.

RDD operations

RDDs provide the following basic operations:

partitions(): Returns a list of partition objects. For example, an RDD representing an HDFS file has a partition for each block of the file by default. The user can specify a custom number of partitions for an RDD if needed.
partitioner(): Returns metadata determining whether the RDD is hash/range partitioned. This is relevant to transformations that join multiple key-value RDDs based on their keys, such as join or groupByKey. In these cases, hash partitioning is used by default on the keys, but the application can specify a custom range partitioning if needed. An example is provided in Zaharia’s paper, which showed that execution of the PageRank algorithm on Spark can be optimized by providing a custom partitioner that groups all the URLs of a single domain in the same partition.
preferredLocations(p): Lists nodes where partition p are accessed faster due to data locality. This might return nodes that contain the blocks of an HDFS file corresponding to that partition or a node that already contains in memory a partition that needs to be processed.
dependencies(): Returns a list of dependencies on parent RDDs. These dependencies can be classified into two major types: narrow and wide dependencies:
- A narrow dependency is where a partition of the parent RDD is used by at most one partition of the child RDD, such as map, filter, or union.
- A wide dependency is where multiple child partitions may depend on a single parent partition, such as a join or groupByKey.

Note that the join of two RDDs can lead to two narrow dependencies if both RDDs are partitioned with the same partitioner, as shown in the following illustration.

iterator(p, parentIters): Computes the elements of a partition p given iterators for its parent partitions.

Get hands-on with 1400+ tech skills courses.

Before Getting Started

Introduction to Distributed Systems

Basic Concepts and Theorems

Distributed Transactions

Achieving Isolation

Achieving Atomicity

Concluding Distributed Transactions

Consensus

Time

Order

Networking

Security

Security Protocols

From Theory to Practice

Case Study 1: Distributed File Systems

Case Study 2: Distributed Coordination Service

Case Study 3: Distributed Data Stores

Case Study 4: Distributed Messaging System

Case Study 5: Distributed Cluster Management

Case Study 6: Distributed Ledger

Case Study 7: Distributed Data Processing Systems

Practices & Patterns

Communication Patterns

Coordination Patterns

Data Synchronization

Shared-nothing Architectures

Distributed Locking

Compatibility Patterns

Dealing with Failure

Distributed Tracing

Concluding this Course