The Concept of Causality
Learn the concept of causality in general and also in distributed systems.
Total ordering in natural phenomena
As humans, we grow accustomed to total ordering, because most of the natural phenomena around us appear to be subject to it.
When we go shopping, we are placed in a queue to be served in a total order.
Similarly, cars waiting for the signal to change at an intersection are also ordered in the same way.
However, there are scenarios especially prevalent in software systems where a total ordering is not really necessary.
Causal ordering using logical clocks
For instance, look at some of the social media platforms people use nowadays, where they can create posts and add comments to the posts of other people. Do we really care about the order in which two unrelated posts are shown to us? Probably not. As a result, the system could potentially leverage a partial ordering, where posts that can’t really be ordered are displayed in an arbitrarily chosen order.
However, there is still a need to preserve the order of some events that are tightly linked. For example, if a comment is a reply to a comment , then we would most probably like to see after . Otherwise, a conversation could end up being confusing and hard to follow.
Causality
What we just described is the notion of causality, where one event contributes to the production of another event.
Looking back at one of the introductory lessons, consistency models, we can find the description of a consistency model, called the causal consistency model. This model ensures that events that are causally related are observed by the various nodes in a single order, where causes precede the effects.
Violating causality can lead to behaviors that are really hard to understand by the users of a system.
Fortunately, as we will explore in the next lessons of this chapter, it’s possible to track causality without the need for physical time.
The notion of causality is also present in real life. We subconsciously use causality when planning or determining the feasibility of a plan.
Determining causality
Causality is determined based on a set of loosely synchronized clocks (i.e., wristwatches, wall clocks, etc.) under the illusion of a global clock. This appears to work in most cases because the time duration of events is much more coarse-grained in real life, and information “flows” much more slowly than in software systems.
For instance, compare the time a human needs to go from London to Manchester and the time needed for 10 kilobytes to travel the same distance via the Internet. So, small differences between clocks do not create significant problems in most real-life scenarios.
However, in distributed computing systems, events happen at a much higher rate and higher speed, and their duration is several orders of magnitude smaller. Consequently, if the physical clocks of the various nodes in the system are not precisely synchronized, the causality relation between events may not be accurately captured.
Causality in distributed systems
Causality can be leveraged in the design of distributed systems with the following two main benefits:
- Increasing concurrency
- Replacing real-time with the notion of logical time, which can be tracked with less infrastructure and costs
As we have seen so far, distributed systems are inherently asynchronous. By introducing coordination and synchronization between them, we essentially reduce the level of concurrency and consequently their performance. The notion of causality enables us to let these systems remain asynchronous while also supporting the causal consistency model. This prevents a big set of counterintuitive behaviors that stem from weak consistency.
Furthermore, keeping physical clocks synchronized is a task that requires hardware infrastructure with the associated costs, where the costs increase in proportion to how accurate the synchronization needs to be.
Logical clocks rely on the existing messaging exchanged between nodes of a system, which makes them less expensive to implement. Of course, logical clocks have their own pitfalls, so they are definitely not a silver bullet.
Event causality using logical clocks
The following lessons will present some kinds of logical clocks. It is important to highlight in advance that they all share some common
. characteristics M. Raynal and M. Singhal, “Logical time: capturing causality in distributed systems,” Computer, Volume 29, Issue 2, Feb 1996, 1996.
The abstraction of logical clocks consist of two main parts:
- A data structure local to every node used to represent logical time
- A protocol to update the data structures accordingly as events happen and time passes by
Each node maintains data structures that provide the following capabilities:
- A local logical clock that helps a node measure its own progress
- A global logical clock that is a good representation of a node’s view of the logical global time
Similarly, the protocol consists of two main rules:
- R1: A rule that governs how the local logical clock is updated by a node when it executes an event.
- R2: A rule that governs how a node updates its global logical clock to update its view of the global time and progress. This determines what information about the logical time needs to be piggybacked in the messages exchanged and how the receiving node uses this information.
Different types of logical clocks have the same core parts and the protocol described above, but they might differ in the actual data structures used to represent logical time, or the logic in the rules of the protocol.
Events in distributed systems
The events that happen in a distributed system can be classified into three fundamental categories:
- Local events that happen at a node and changing its state
- Send events that represent a node sending a message to another node to inform about a change
- Receive events that represent a node receiving a message from another node about a change
These events exchanged between nodes to propagate information can also propagate time changes between nodes.
The notion of causality is built on top of the happened-before relation (). This is a strict, partial order on the aforementioned events so that:
- If events and are two events happening at the same node, the relation holds if the occurrence of event preceded the occurrence of event .
Note that this is easy to determine for a single node, as shown before.
- If event is the event of a node corresponding to sending a message and event is the event of a different node corresponding to the receipt of the same message, then .
- For three events , , and , if and , then .
We say that event causally precedes event (or these two events are causally related) if . We say that event is not causally related to (), if neither of the relations or holds.
The following illustration demonstrates how this would work in a distributed system of three nodes.
Here are some of the causal relationships: , , , , and .
Note that and even though these events are temporally distant, because there was no information exchanged between the nodes that could help the logical clocks track any relation between them. This means that these events should be considered concurrent by the system and could have happened in any order. This is because could have happened at any point in time from the moment right after to just before , preserving all the causal relationships intact. The same holds for , which could have happened at any point in time until .
Potential causality
In general, the concept of causality in distributed systems (and as defined in the
As a result, when we say that , we don’t mean that has caused or affected . Instead, we mean that could have caused or affected . This is because this concept is generic and does not have an application-specific context to infer actual cause-and-effect relationships.
However, applications can leverage the algorithms presented in this chapter, and enrich them with additional metadata that makes it possible to track actual causality instead of potential causality.
Note that even the ability to track potential causality is still very useful to prevent system behaviors that will confuse users. It just means that the system might be storing and transmitting more information than necessary to achieve this purpose.
Create a free account to view this lesson.
Continue your learning journey with a 14-day free trial.
By signing up, you agree to Educative's Terms of Service and Privacy Policy