Cassandra Performing Queries Efficiently
Look into how Cassandra tries to perform queries efficiently.
In Cassandra, performing a query that does not use the primary key is guaranteed to be inefficient because it will need to perform a full table scan querying all the cluster nodes.
Methods to perform queries efficiently
Two alternatives can be used to solve the above problem:
- Secondary indexes
- Materialized views.
Secondary indexes
A secondary index can be defined on some columns of a table. This means each node will index this table locally using the specified columns. A query based on these columns will still need to ask all the system nodes, but at least each node will have a more efficient way to retrieve the necessary data without scanning all the data.
Materialized views
A materialized view can be defined as a query on an existing table with a newly defined partition key. This materialized view is maintained as a separate table, and any changes on the original table are eventually propagated to it. As a result, these two approaches are subject to the following trade-off.
Trade-offs with secondary indexes and materialized views
-
Secondary indexes are more suitable for high cardinality columns, while materialized views are suitable for low cardinality columns as they are stored as regular tables.
-
Materialized views are more efficient during read operations than secondary indexes because only the nodes that contain the corresponding partition are queried.
-
Secondary indexes are guaranteed to be strongly consistent, while materialized views are eventually consistent.
Denormalizing data for efficiency
Cassandra does not provide join operations since they would be inefficient due to the distribution of data. As a result, users are encouraged to denormalize the data by potentially including the same data in multiple tables to be queried efficiently, reading only from a minimum number of nodes. This means that any update operations on this data will need to update multiple tables, but this is expected to be quite efficient.
Updating multiple tables
Cassandra provides two flavors of batch operations that can update multiple partitions and tables: logged and unlogged batches.
Logged and unlogged batches
Logged batches provide the additional guarantee of atomicity, either all of the statements of the batch operation will take effect or none of them. This can help ensure that all the tables that share this denormalized data will be consistent with each other. However, this is achieved by first logging the batch as a unit in a system table which is replicated and then performing the operations, making them less efficient than unlogged batches.
Note: Both logged and unlogged batches do not provide any isolation, so concurrent requests might only temporarily observe the effects of some of the operations.
Get hands-on with 1400+ tech skills courses.