Reliable Communication Layers

In this lesson, we discuss the basics of a reliable communication protocol such as TCP.

We'll cover the following

To build a reliable communication layer, we need some new mechanisms and techniques to handle packet loss. Let us consider a simple example in which a client is sending a message to a server over an unreliable connection. The first question we must answer: how does the sender know that the receiver has actually received the message?

Acknowledgment (ack)

The technique that we will use is known as an acknowledgment, or ack for short. The idea is simple: the sender sends a message to the receiver; the receiver then sends a short message back to acknowledge its receipt. The figure below depicts the process.

When the sender receives an acknowledgment of the message, it can then rest assured that the receiver did indeed receive the original message. However, what should the sender do if it does not receive an acknowledgment?

Timeout

To handle this case, we need an additional mechanism, known as a timeout. When the sender sends a message, the sender now sets a timer to go off after some period of time. If in that time, no acknowledgment has been received, the sender concludes that the message has been lost. The sender then simply performs a retry of the send, sending the same message again with hopes that this time, it will get through. For this approach to work, the sender must keep a copy of the message around, in case it needs to send it again. The combination of the timeout and then retry have led some to call the approach timeout/retry; pretty clever crowd, those networking types, no? The figure below shows an example.

Unfortunately, timeout/retry in this form is not quite enough. The figure below shows an example of packet loss which could lead to trouble.

In this example, it is not the original message that gets lost, but the acknowledgment. From the perspective of the sender, the situation seems the same: no ack was received, and thus a timeout and retry are in order. But from the perspective of the receiver, it is quite different: now the same message has been received twice! While there may be cases where this is OK, in general, it is not; imagine what would happen when you are downloading a file and extra packets are repeated inside the download. Thus, when we are aiming for a reliable message layer, we also usually want to guarantee that each message is received exactly once by the receiver.

To enable the receiver to detect duplicate message transmission, the sender has to identify each message in some unique way, and the receiver needs some way to track whether it has already seen each message before. When the receiver sees a duplicate transmission, it simply acks the message, but (critically) does not pass the message to the application that receives the data. Thus, the sender receives the ack but the message is not received twice, preserving the exactly-once semantics mentioned above.

There are myriad ways to detect duplicate messages. For example, the sender could generate a unique ID for each message; the receiver could track every ID it has ever seen. This approach could work, but it is prohibitively costly, requiring unbounded memory to track all IDs.

Sequence counter

A simpler approach, requiring little memory, solves this problem, and the mechanism is known as a sequence counter. With a sequence counter, the sender and receiver agree upon a start value (e.g., 1) for a counter that each side will maintain. Whenever a message is sent, the current value of the counter is sent along with the message; this counter value (NN) serves as an ID for the message. After the message is sent, the sender then increments the value (to N+1N + 1).

The receiver uses its counter value as the expected value for the ID of the incoming message from that sender. If the ID of a received message (NN) matches the receiver’s counter (also NN), it acks the message and passes it up to the application; in this case, the receiver concludes this is the first time this message has been received. The receiver then increments its counter (to N+1N + 1), and waits for the next message.

If the ack is lost, the sender will timeout and resend the message NN. This time, the receiver ’s counter is higher (N+1N + 1), and thus the receiver knows it has already received this message. Thus it acks the message but does not pass it up to the application. In this simple manner, sequence counters can be used to avoid duplicates.

The most commonly used reliable communication layer is known as TCP/IP, or just TCP for short. TCP has a great deal more sophistication than we describe above, including machinery to handle congestion in the network“Congestion Avoidance and Control” by Van Jacobson. SIGCOMM ’88 . A pioneering paper on how clients should adjust to perceived network congestion; definitely one of the key pieces of technology underlying the Internet, and a must read for anyone serious about systems, and for Van Jacobson’s relatives because well relatives should read all of your papers., multiple outstanding requests, and hundreds of other small tweaks and optimizations. Read more about it if you’re curious; better yet, take a networking course and learn that material well.

TIP: BE CAREFUL SETTING THE TIMEOUT VALUE

As you can probably guess from the discussion, setting the timeout value correctly is an important aspect of using timeouts to retry message sends. If the timeout is too small, the sender will resend messages needlessly, thus wasting CPU time on the sender and network resources. If the timeout is too large, the sender waits too long to resend, and thus perceived performance at the sender is reduced. The “right” value, from the perspective of a single client and server, is thus to wait just long enough to detect packet loss but no longer.

However, there are often more than just a single client and server in a distributed system, as we will see in future chapters. In a scenario with many clients sending to a single server, packet loss at the server may be an indicator that the server is overloaded. If true, clients might retry in a different adaptive manner; for example, after the first timeout, a client might increase its timeout value to a higher amount, perhaps twice as high as the original value. Such an exponential back-off scheme, pioneered in the early Aloha network and adopted in early Ethernet“The ALOHA System — Another Alternative for Computer Communications” by Norman Abramson. The 1970 Fall Joint Computer Conference. The ALOHA network pioneered some basic concepts in networking, including exponential back-off and retransmit, which formed the basis for communication in shared-bus Ethernet networks for years., avoids situations where resources are being overloaded by an excess of re-sends. Robust systems strive to avoid overload of this nature.

Get hands-on with 1400+ tech skills courses.