Introduction
Let's study distributed data processing systems.
We'll cover the following
This chapter will examine distributed systems used to process large amounts of data that would be impossible or very inefficient to process using only a single machine.
Categories of distributed data processing systems
Distributed data processing systems can be classified into the following two main categories:
Batch processing systems
Batch processing systems group individual data items into groups called batches, which are processed one at a time. In many cases, these groups can be quite large (e.g., all items for a day), so the main goal for these systems is usually to provide high throughput, but sometimes at the cost of higher latency.
Stream processing systems
Stream processing systems receive and process data continuously as a stream of data items. As a result, the main goal for these systems is to provide very low latency sometimes at the cost of decreased throughput.
The following illustration helps us differentiate between a batch and a stream processing system:
There is also a form of processing that is essentially a hybrid between these two categories, called micro-batch processing. This approach processes data in batches, but these are kept very small to achieve a balance between throughput and latency.
In this chapter, we will study the following three distributed data processing systems in detail:
- Mapreduce
- Apache Spark
- Apache Flink
Get hands-on with 1400+ tech skills courses.