Apache Flink
Let's have an introduction to Apache Flink and look into its architecture.
Apache Flink is an open-source stream-processing framework developed by
Note: This is the main differentiator between Flink and Spark. Flink processes incoming data as they arrive, which provides sub-second latency that can go down to single-digit millisecond latency. Spark also provides a streaming engine called Spark Streaming. However, that is running some form of micro-batch processing, where an input data stream is split into batches, which are then processed to generate the final results with the associated latency trade-off.
Basic constructs in Flink
The basic constructs in Flink are streams and transformations.
Stream
A stream is an unbounded flow of data records.
Note: Flink can also execute batch processing applications, where data is treated as a bounded stream. As a result, the mechanisms described in this section are used in this case also with slight variations.
Transformation
A transformation is an operation that takes one or more streams as input and produces one or more output streams as a result.
Flink provides many different APIs that are used to define these transformations. For example:
- The ProcessFunction API is a low-level interface that allows the user to define imperatively what each transformation should do by providing the code that will process each record.
- The DataStream API is a high-level interface that allows the user to define declaratively the various transformations by re-using basic operations, such as
map
,flatMap
,filter
,union
etc.
Additional Flink constructs
Since streams can be unbounded, the application has to produce some intermediate, periodic results. For this purpose, Flink provides some additional constructs, such as windows, timers, and local storage for stateful operators.
Flink provides a set of high-level operators that specify the windows over which data from the stream will be aggregated. These windows can be time-driven (e.g., time intervals) or data-driven (e.g., number of elements). The timer API allows applications to register callbacks to be executed at specific points in time in the future.
Flow of data in data processing Flink application
A data processing application in Flink is represented as a directed acyclic graph (DAG), where nodes represent computing tasks and edges represent data subscriptions between tasks.
Note: Flink also supports cyclic dataflow graphs, which can be used for use-cases such as iterative algorithms.
Flink is responsible for translating the logical graph corresponding to the application code to the actual physical graph executed. It includes logical data flow optimizations, such as the fusion of multiple operations to a single task (e.g., a combination of two consecutive filters). It also includes partitioning each task into multiple instances that can be executed in parallel in different compute nodes. This process is shown in the following illustration:
The architecture of Flink
The high-level architecture of Flink consists of three main components , as shown in the following illustration:
The Flink client receives the program code, transforms it into a dataflow graph, and submits it to the Job Manager, which coordinates the distributed execution of the dataflow.
The Job Manager schedules tasks on Task Managers, tracking their progress, and coordinating checkpoints and recovery from possible tasks failures.
Each Task Manager executes one or more tasks that perform user-specified operators that can produce other streams and report their status to the Job Manager along with heartbeats used for failure detection. When processing unbounded streams, these tasks are supposed to be long-lived. If they fail, the Job Manager is responsible for restarting them.
Avoid making the Job Manager a single point of failure
To avoid making the Job Manager a single point of failure multiple instances can run in parallel. One of them will be elected leader via Zookeeper and will be responsible for coordinating the execution of applications. The rest will be waiting to take over in case of a leader failure. As a result, the leader Job Manager stores some critical metadata about every application in Zookeeper so that it’s accessible to newly elected leaders.
Get hands-on with 1400+ tech skills courses.