Kafka Streams Topologies
Learn about processor topology in Kafka Streams applications.
The brain of every Kafka Streams application is the processor topology. A processor topology defines the stream processing logic of the application and is structured as a directed acyclic graph. Each node in the graph is a stream processor, which represents a processing step, and each edge in the graph represents a stream of data flowing between processors.
Processor types
Processors in Kafka Streams can be divided into three types.
Source processors
Source processors do not have any upstream processors. They consume records directly from Kafka topics and pass them to downstream processors. A topology can have one or more source processors.
Stream processors
A stream processor receives one input record from its upstream processors and applies an operation to that record. The operation can produce one or more records to downstream processors, or none at all. Kafka Streams provides a declarative high-level DSL which includes processing operations such as filter
, map
, and join
out of the box. These operations are also called operators. The high-level DSL is recommended for most users, but when more flexibility and control are required, we can use an imperative low-level processor API which is also a part of the Kafka Streams library. We will heed the recommendation and focus on the high-level DSL.
Sink processors
Sink processors are used to write the processed records back to Kafka topics. Processing topologies do not necessarily have sink processors—in some use cases, we only need to pass the processed records to an external resource (saving them to a database) without writing them back to Kafka.
One record at a time
In Kafka Streams applications, data flows through the topology using a depth-first strategy. When Kafka Streams pulls a new record from an upstream Kafka topic, that record flows through the entire topology, and only then a new record starts its journey through the topology.
If an error occurs during the processing of a record, it is not lost. It will return to the topic, and the offset will not be committed. The application will try to process it again.
Stateless and stateful topologies
Kafka Streams topologies can be either stateless or stateful. If each record in our application is processed independently of other records, then the topology is considered stateless. For example, the operator map
is a stateless operator because it does not require memory of any other record apart from the one that it is currently processing.
Stateful topologies are topologies where there is the use of at least one stateful operator. For example, count
is a stateful operator because it has to keep track of how many previous records were processed. Using a stateful operator increases the complexity of our applications and raises additional considerations when it comes to fault tolerance, scalability, and maintenance. We will discuss each of the topology types in its own dedicated chapter.