Introduction

Let's study distributed data processing systems.

This chapter will examine distributed systems used to process large amounts of data that would be impossible or very inefficient to process using only a single machine.

Categories of distributed data processing systems

Distributed data processing systems can be classified into the following two main categories:

Batch processing systems

Batch processing systems group individual data items into groups called batches, which are processed one at a time. In many cases, these groups can be quite large (e.g., all items for a day), so the main goal for these systems is usually to provide high throughput, but sometimes at the cost of higher latency.

Stream processing systems

Stream processing systems receive and process data continuously as a stream of data items. As a result, the main goal for these systems is to provide very low latency sometimes at the cost of decreased throughput.

The following illustration helps us differentiate between a batch and a stream processing system:

There is also a form of processing that is essentially a hybrid between these two categories, called micro-batch processing. This approach processes data in batches, but these are kept very small to achieve a balance between throughput and latency.

In this chapter, we will study the following three distributed data processing systems in detail:

  • Mapreduce
  • Apache Spark
  • Apache Flink

Get hands-on with 1400+ tech skills courses.