Perks of Apache Spark

Keeps application running

Spark provides graceful degradation in cases where memory is not enough so that the application does not fail but keeps running with decreased performance.

For instance, Spark can recalculate any partitions on demand when they don’t fit in memory or spill them to disk.

Increasing performance

Wide dependencies cause more data to be exchanged between nodes compared to narrow dependencies, so performance is increased significantly by reducing wide dependencies, or the amount of data that needs to be shuffled. One way to do this is by pre-aggregating data, also known as map-side reduction.

Note: As explained previously, the map-side reduction is a capability provided in the MapReduce framework through combiners.

Example

The following code performs a word count in two different ways:

  • The first one will send multiple records of value 1 for each word across the network
  • The second one will send one record for each word containing the number of occurrences.
// word count without pre-aggregation
sparkContext.textFile("hdfs://...")
    .flatMap(line => line.split(" "))
    .map(word => (word,1))
    .groupByKey()
    .map((x,y) => (x,sum(y)))
// word count with pre-aggregation
sparkContext.textFile("hdfs://...")
    .flatMap(line => line.split(" "))
    .map(word => (word,1))
    .reduceByKey((x,y)=> (x+y))

Resilient to failures of the request broker

We can configure Spark in a way that is resilient to failures of the request broker. This is achieved via Zookeeper.

Zookeeper

In Zookeeper, all the managers perform leader election, and one of them is elected as the leader, with the rest remaining in standby mode.

When a new worker node is added to the cluster, it registers with the manager node. If failover occurs, the new manager will contact all previously registered worker nodes to inform them of the change in leadership. So, only the scheduling of new applications is affected during a failover; already running applications are unaffected.

When submitting a job to the Spark cluster, the user can specify through a --supervise option that the driver needs to be automatically restarted by the manager if it fails with a non-zero exit code.

Ensuring data persistence

Spark supports multiple, different systems for data persistence.

Commit protocol

Spark has a commit protocol that aims to provide exactly-once guarantees on the job’s output under specific conditions. It means that no matter how many times worker nodes fail and tasks are rerun, there will be no duplicate or corrupt data in the final output file if a job completes. This might not be the case for every supported storage system, and it’s achieved differently depending on the available capabilities.

For instance, when using HDFS, each task writes the output data to a unique, temporary file (e.g., targetDirectory/_temp/part-XXXX_attempt-YYYY). When the write is complete, the file is moved to the final location (e.g., targetDirectory/part-XXXX) using an atomic rename operation provided by HDFS. As a result, even if a task is executed multiple times due to failures, the final file will contain its output exactly once.

Furthermore, no matter which execution was completed successfully, the output data will be the same as long as the transformations used were deterministic and idempotent. It is true for any transformations that act solely on the data provided by the previous RDDs using deterministic operations.

Note: However, this is not the case if these transformations use data from other systems that might change between executions or if they use non-deterministic actions (e.g., mathematical calculations based on random number generation). Furthermore, if transformations perform side-effects on external systems that are not idempotent, no guarantee is provided since Spark might execute each side-effect more than once.

Get hands-on with 1400+ tech skills courses.