Recap of the Course
Let's look at what we have learned so far in this course.
We'll cover the following
This course has helped you understand:
- How distributed systems can be useful
- What challenges you may face when building distributed systems
- How to overcome these challenges
Building or using a distributed system is a serious undertaking that one should take only when necessary.
We will recap some key learnings from this course and will also highlight topics that we did not cover. In this way, those willing to dive deeper into some areas will have some starting points to do so.
Recap
First of all, we introduced some of the basic areas where distributed systems can help: performance, scalability, and availability.
We analyzed basic mechanisms that can help in these areas throughout the course, such as partitioning and replication. It also became evident that these mechanisms introduce tension between the characteristics mentioned above and other properties, such as consistency. This tension is formalized by basic theorems, such as the CAP theorem and the FLP result.
This tension manifests in various ways. For example, the decision on whether replication operates synchronously or asynchronously can be a trade-off between performance and availability or durability.
We explained the difference between liveness and safety properties and provided an overview of the basic consistency models that help formalise the behaviour of a distributed system and facilitate reasoning about its interaction with other systems.
Note: However, there are many more models we omitted in this course in the interest of time and simplicity, such as read-your-writes, monotonic reads and monotonic writes consistency models by
. Viotti et al. P. Viotti and M. Vukoliundefined, “Consistency in Non-Transactional Distributed Storage Systems,” ACM Computing Surveys, Volume 49, No. 1, 2016.
We also introduced the concept of failure detection and gave an example of a simplistic failure detector that makes use of heartbeats and timeouts.
In reality, failure detectors deal with many more practical problems and need to be quite more complex. This mean they:
- Avoid using timeouts in order to be applicable to quiescent algorithms.
See the
by Kawazoe et al. who achieved this.paper A. M. Kawazoe, W. Chen, and S. Toueg, “Heartbeat: A timeout-free failure detector for quiescent reliable communication,” 11th International Workshop on Distributed Algorithms, ’97, 1997. - Operate using a gossip protocol to improve scalability and fault-tolerance. See the
by Renesse et al. who did this.paper R. van Renesse, Y. Minsky, and M. Hayden, “A Gossip-Style Failure Detection Service,” Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, ’98, 1998.
Output a suspicion level on a continuous scale instead of a binary value. See the
Note: Gossip protocols were also not covered in this course and they can be considered a whole topic on their own. For the details about gossip protocols, you are advised to read
paper. this K. Birman, “The Promise, and Limitations, of Gossip Protocols,” ACM SIGOPS Operating Systems Review, October 2007, 2007.
After that, we explored several partitioning techniques.
Next, we introduced the problem of consensus and explained the two major algorithms that solve it, Paxos and Raft.
The topic of consensus is very old, but still, very useful research is conducted in this area. For example, the original Paxos algorithm operated under the assumption that all quorums need to be majority quorums to maintain the algorithm’s safety properties. However, Howard et al. actually demonstrated that this is not necessary.
The algorithm just needs to ensure quorums from the first phase overlap with quorums from the second phase. This allows one to size the quorums of each phase accordingly depending on the performance requirements and fault tolerance during the steady state or during recovery.
As described by
The next two chapters introduced the notions of time and order and their relationship.
First, we explained the difference between total and partial order, which is important in distributed systems. While consensus can be considered the problem of establishing total order amongst all events of a system, some systems do not need such strict requirements and can also operate successfully under a partial order. Vector clocks is one mechanism outlined in the course that allows a system to keep track of such a partial order that preserves causality relationships between events.
Then, we covered the topics of networking and security.
Networking is core to how the various nodes of a distributed system communicate. So, it is important to understand the various networking protocols and technologies available when designing a distributed system. The breadth of knowledge can help you see the big picture and make the right design decision, but depth of knowledge is still valuable to get the small details right and troubleshoot issues efficiently when things go wrong. The analysis of this article shows a good example of how such knowledge can be useful in mitigating and preventing issues.
We gave you a quick but complete overview of the basic protocols and technologies in this course, and we would urge you to study further the areas that you didn’t feel familiar with.
Networks themselves can be designed following the
Security is also a crucial aspect of distributed systems that is frequently neglected, unfortunately. That is why we dedicated two chapters to cover the basic concepts and techniques you can use to build secure distributed systems. However, we only discussed the main idea of the basic cryptographic primitives and did not go into much detail since it’s a vast field.
Lastly, we will advise you to always stick with existing standards and cryptanalysis libraries and battle-tested solutions instead of rolling out your own.
We believe it is a lot easier for someone to understand theory when it is put into context by demonstrating how it is used in practical systems. This is the reason we included several chapters of case studies about real systems and how they use algorithms and techniques presented in the course.
The last chapters on practices and patterns were written under the same spirit and subject to similar time constraints.We will advise our readers to study the papers by Nishtala et al., Bronson et al., Sigelman et al, Verbitski et al, and Verbitski et al. for a deeper understanding of how theory can be put into practice.
Get hands-on with 1400+ tech skills courses.