Recap of the Course

Let's look at what we have learned so far in this course.

We'll cover the following

This course has helped you understand:

  • How distributed systems can be useful
  • What challenges you may face when building distributed systems
  • How to overcome these challenges

Building or using a distributed system is a serious undertaking that one should take only when necessary.

We will recap some key learnings from this course and will also highlight topics that we did not cover. In this way, those willing to dive deeper into some areas will have some starting points to do so.

Recap

First of all, we introduced some of the basic areas where distributed systems can help: performance, scalability, and availability.

Press + to interact

We analyzed basic mechanisms that can help in these areas throughout the course, such as partitioning and replication. It also became evident that these mechanisms introduce tension between the characteristics mentioned above and other properties, such as consistency. This tension is formalized by basic theorems, such as the CAP theorem and the FLP result.

This tension manifests in various ways. For example, the decision on whether replication operates synchronously or asynchronously can be a trade-off between performance and availability or durability.

We explained the difference between liveness and safety properties and provided an overview of the basic consistency models that help formalise the behaviour of a distributed system and facilitate reasoning about its interaction with other systems.

Press + to interact

Note: However, there are many more models we omitted in this course in the interest of time and simplicity, such as read-your-writes, monotonic reads and monotonic writes consistency models by Viotti et al.P. Viotti and M. Vukoliundefined, “Consistency in Non-Transactional Distributed Storage Systems,” ACM Computing Surveys, Volume 49, No. 1, 2016..

We also introduced the concept of failure detection and gave an example of a simplistic failure detector that makes use of heartbeats and timeouts.

Press + to interact

In reality, failure detectors deal with many more practical problems and need to be quite more complex. This mean they:

  • Avoid using timeouts in order to be applicable to quiescent algorithms. See the paperA. M. Kawazoe, W. Chen, and S. Toueg, “Heartbeat: A timeout-free failure detector for quiescent reliable communication,” 11th International Workshop on Distributed Algorithms, ’97, 1997. by Kawazoe et al. who achieved this.
  • Operate using a gossip protocol to improve scalability and fault-tolerance. See the paperR. van Renesse, Y. Minsky, and M. Hayden, “A Gossip-Style Failure Detection Service,” Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, ’98, 1998. by Renesse et al. who did this.

Output a suspicion level on a continuous scale instead of a binary value. See the paperN. Hayashibara, X. Défago, R. Yared, and T. Katayama, “The phi accrual failure detector,” 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. by Hayashibara et al.

Note: Gossip protocols were also not covered in this course and they can be considered a whole topic on their own. For the details about gossip protocols, you are advised to read thisK. Birman, “The Promise, and Limitations, of Gossip Protocols,” ACM SIGOPS Operating Systems Review, October 2007, 2007. paper.

After that, we explored several partitioning techniques.

Press + to interact

Next, we introduced the problem of consensus and explained the two major algorithms that solve it, Paxos and Raft.

Press + to interact

The topic of consensus is very old, but still, very useful research is conducted in this area. For example, the original Paxos algorithm operated under the assumption that all quorums need to be majority quorums to maintain the algorithm’s safety properties. However, Howard et al. actually demonstrated that this is not necessary.

The algorithm just needs to ensure quorums from the first phase overlap with quorums from the second phase. This allows one to size the quorums of each phase accordingly depending on the performance requirements and fault tolerance during the steady state or during recovery.

As described by Castro et al.M. Castro and B. Liskov, “Practical Byzantine Fault Tolerance,” Proceedings of the 3rd Symposium on Operating Systems Design and Implementation, 1999. and LamportL. Lamport, “Byzantizing paxos by refinement,” Proceedings of the 25th International Conference on Distributed Computing, 2011., solving consensus under the presence of byzantine failures is a much more challenging task and is subject to different constraints.

The next two chapters introduced the notions of time and order and their relationship.

Press + to interact

First, we explained the difference between total and partial order, which is important in distributed systems. While consensus can be considered the problem of establishing total order amongst all events of a system, some systems do not need such strict requirements and can also operate successfully under a partial order. Vector clocks is one mechanism outlined in the course that allows a system to keep track of such a partial order that preserves causality relationships between events.

Then, we covered the topics of networking and security.

Press + to interact

Networking is core to how the various nodes of a distributed system communicate. So, it is important to understand the various networking protocols and technologies available when designing a distributed system. The breadth of knowledge can help you see the big picture and make the right design decision, but depth of knowledge is still valuable to get the small details right and troubleshoot issues efficiently when things go wrong. The analysis of this article shows a good example of how such knowledge can be useful in mitigating and preventing issues.

We gave you a quick but complete overview of the basic protocols and technologies in this course, and we would urge you to study further the areas that you didn’t feel familiar with.

Networks themselves can be designed following the same principlesB. Lebiednik, A. Mangal, and N. Tiwari, “A Survey and Evaluation of Data Center Network Topologies,” arXiv:1605.01701, 2016. outlined in this course, so that distributed systems can run on top of them at scale.

Security is also a crucial aspect of distributed systems that is frequently neglected, unfortunately. That is why we dedicated two chapters to cover the basic concepts and techniques you can use to build secure distributed systems. However, we only discussed the main idea of the basic cryptographic primitives and did not go into much detail since it’s a vast field.

Lastly, we will advise you to always stick with existing standards and cryptanalysis libraries and battle-tested solutions instead of rolling out your own.

We believe it is a lot easier for someone to understand theory when it is put into context by demonstrating how it is used in practical systems. This is the reason we included several chapters of case studies about real systems and how they use algorithms and techniques presented in the course.

The last chapters on practices and patterns were written under the same spirit and subject to similar time constraints.We will advise our readers to study the papers by Nishtala et al., Bronson et al., Sigelman et al, Verbitski et al, and Verbitski et al. for a deeper understanding of how theory can be put into practice.

Get hands-on with 1400+ tech skills courses.