Time and Watermarks in Flink
Learn about the concept of time and the use of watermarks in Apache Flink.
As explained in the previous lesson, time is a crucial element that is commonly used to define boundaries on unbounded streams.
Like other stream processing systems, such as
Note: Flink has a third notion of time called the ingestion time which corresponds to the time an event enters Flink. But this section will focus on event and processing time only.
Processing time
Processing time refers to the system time of the machine that is executing an operation.
Event time
Event time is the time that each event occurred on its producing device.
We can use processing time or event time with some trade-offs, which are following.
Processing time trade-offs
When a streaming program runs on processing time, all time-based operations (e.g., time windows) will use the system clock of the machines that runs the respective operation. It is the simplest notion of time and requires no coordination between the nodes of the system. It also provides good performance and reliably low latency on the produced results.
However, all this comes at the cost of consistency and non-determinism. The system clocks of different machines will differ, and the various nodes of the system will process data at different rates. As a consequence, different nodes might assign the same event to different windows depending on timing.
Event time trade-offs
When a streaming program runs on event time, all time-based operations will use the event time embedded within the stream records to track time, instead of system clocks. It brings consistency and determinism to the execution of the program since nodes will now have a common mechanism to track the progress of time and assign events to windows.
However, it requires some coordination between the various nodes, as we will see below. It also introduces some additional latency since nodes might have to wait for out of order or late events.
Tracking progress in event time
The main mechanism to track progress in event time in Flink is watermarks.
Watermarks
Watermarks are control records that flow as part of a data stream and carry a timestamp t
.
A Watermark(t)
record indicates that event time has reached time t
in that stream, which means there should be no more elements with a timestamp t' ≤ t
. Once a watermark reaches an operator, the operator can advance its internal event time clock to the value of the watermark.
Generating watermarks
Watermarks are generated either directly in the data stream source or by a watermark generator at the beginning of a Flink pipeline.
The operators later in the pipeline are supposed to use the watermarks for their processing (e.g., to trigger calculation of windows) and then emit them downstream to the next operators.
Example
There are many different strategies for generating watermarks. For example the BoundedOutOfOrdernessGenerator
, which assumes that the latest elements for a certain timestamp t
will arrive at most n
milliseconds after the earliest elements for timestamp t
. Of course, there could be elements that do not satisfy this condition and arrive after the associated watermark has been emitted and the corresponding windows have been calculated. These are called late elements.
Flink provides different ways to deal with late elements, such as discarding them or re-triggering the calculation of the associated window.
The following illustration shows the flow of watermarks and the progress of event time in a flink pipeline.
Get hands-on with 1400+ tech skills courses.