Cassandra is a distributed datastore that combines ideas from the DynamoG. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-value Store,” in Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, 2007. and the BigtableF. Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” in Proceedings of 7th {USENIX} Symposium on Operating Systems Design and Implementation (OSDI), 2006. paper.

Note: Besides Dynamo there is also a separate distributed system, called DynamoDB. This is commercially available, but details around its internal architecture have not been shared publicly yet. However, this system has a lot of similarities with Cassandra, such as the data model and tunable consistency.

CassandraA. Lakshman and P. Malik, “Cassandra — A Decentralized Structured Storage System,” Operating Systems Review, 2010. was originally developed by Facebook, but it was then open-sourced and became an Apache project.During this period, it has evolved significantly from its original implementation.

Note: The information in this chapter refers to the state of this project at the time of writing this course.

Design goals of Cassandra

The main design goals of Cassandra are:

  • Extremely high availability
  • Performance (high throughput/low latency with emphasis on write-heavy workloads) with unbounded, incremental scalability

Note: In order to achieve these goals Cassandra trades off some other properties, such as strong consistency.

Data model

The data model is relatively simple: it consists of keyspaces at the highest level, which can contain multiple, different tables.

Table

Each table stores data in sets of rows and is characterised by a schema.

Schema

The schema defines the structure of each row, which consists of the various columns and their types. It also determines the primary key.

Primary key

The primary key is a column or a set of columns that have unique values for each row. The primary key can have two components:

  • The first component is the partition key and, it’s mandatory
  • The second component contains the clustering columns and is optional

If both of these components are present, then the primary key is called a compound primary key.

Note: Furthermore, if the partition key is composed of multiple columns, it’s called a composite partition key.

The following illustration contains two tables, one has a simple primary key and the other has a compound primary key.

Tables having different keys

Simple Primary Key

Compound Primary Key

CREATE TABLE Employees (

employee_id uuid,

first_name text,

last_name text,

PRIMARY KEY (employee_id)

);

CREATE TABLE ProductCatalog (

product_id uuid,

size int,

price decimal,

PRIMARY KEY (product_id, size)

);

The primary key of a table is one of the most important parts of the schema because it determines how data is distributed across the system and also how it is stored in every node.

Partition key component

The first component of the primary key, the partition key determines the distribution of data. The rows of a table are conceptually split into different partitions, where each partition contains only rows with the same value for the defined partition key. All the rows corresponding to a single partition are guaranteed to be stored collocated in the same nodes, while rows belonging to different partitions can be distributed across different nodes.

Clustering columns component

The second component of the primary key, the clustering columns, determines how rows of the same partition will be stored on a disk. Specifically, rows of the same partition will be stored in ascending order of the clustering columns defined unless specified otherwise. The following illustration elaborates the previous example, showing how data from the two tables would be split into partitions and stored in practice:

We did partitioning using consistent hashing. So, the partition key determines the partition in which a row goes.

Distributing the partitions of the table over the available nodes

Cassandra distributes the partitions of a table across the available nodes using consistent hashing. It also uses virtual nodes to provide balanced, fine-grained partitioning. As a result, all the virtual nodes of a Cassandra cluster form a ring.

Each virtual node corresponds to a specific value in the ring, called the token, which determines which partitions will belong to this virtual node.

Each virtual node contains all the partitions whose partition key (when hashed) falls in the range between its token and the token of the previous virtual node in the ring, as shown in the following illustration:

Cassandra also supports some form of range partitioning via the ByteOrderedPartitioner. However, this is available mostly for backward compatibility reasons, and it’s not recommended since it can cause issues with hot spots and imbalanced data distribution.

Every Cassandra node can be assigned multiple virtual nodes.

Replicating partitions across the nodes

Each partition is replicated across NN nodes, where NN is a number that is configurable per keyspace, and it’s called the replication factor. There are multiple available replication strategies that determine how the additional N1N-1 nodes are selected.

The most straightforward strategy selects the subsequent nodes clockwise in the ring. More complicated strategies also take into account the network topology of the nodes for the selection.

Storage engines for nodes

The storage engine for each node is inspired by Bigtable. It is based on a commit log containing all the mutations and a memtable that is periodically flushed to SSTables, which are also periodically merged via compactions.

Get hands-on with 1400+ tech skills courses.