Apache Cassandra: The NoSQL Scalable Database

May 9, 2017

13611

This article introduces readers to the Apache Cassandra NoSQL database, and provides them with use cases for which it is suitable.

Apache Cassandra is a free and open source distributed, massively scalable database management system designed to handle large amounts of data across many commodity servers, while providing highly available service and no single point of failure. Apache Cassandra offers capabilities like continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centres and cloud availability zones.

History
Apache Cassandra was originally developed by Avinash Lakshman (one of the authors of Amazon’s Dynamo) and Prashant Malik at Facebook for inbox search. Cassandra was published as an open source project in Google Code in July 2008. It was accepted into Apache Incubator in March 2009. Since February 2010, Cassandra has been an ‘Apache top-level project’.

Figure 1: Apache Cassandra

Figure 2: Cassandra supports multiple data centre and cloud deployments

Features of Apache Cassandra
The following are the key features of Apache Cassandra.
Decentralised: There are no single points of failure and no network bottlenecks. Every node in the cluster is identical.

Supports replication and multiple data centre replication: Cassandra supports replication across multiple data centres (in multiple geographies) and multi-cloud availability zones for writes/reads.
High scalability: Read and write throughput increase linearly as new machines are added, with no downtime or interruption to applications.

Fault-tolerant: Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centres is supported. Failed nodes can be replaced with no downtime.

Tunable data consistency: Cassandra supports configured consistency levels to manage availability versus data accuracy. We can configure consistency on a Cassandra cluster, data centre, or as per individual read or write operations.

Tunable consistency is one of the strongest features of Cassandra. There are two types of consistency—strong and eventual. To ensure that data is written and read correctly, Cassandra extends the concept of eventual consistency by offering tunable consistency. Tunable data consistency allows individual read or write operations to be as strongly consistent as required by the client application. The consistency level of each read or write operation can be set, so that the data returned is more or less consistent, based on need.

Data compression: Data can be compressed up to 80 per cent without any performance overhead.
Cassandra query language: Cassandra provides a query language that is very similar to the SQL language. It helps developers moving from a relational database to Cassandra.

Figure 3: Cassandra’s masterless ‘ring’ architecture

Architecture of Apache Cassandra
Apache Cassandra’s architecture offers the ability to scale, perform and provide continuous uptime. Rather than using a legacy master-slave or a manual and difficult-to-maintain sharded architecture, Apache Cassandra has a masterless ‘ring’ design that is elegant, easy to set up, and easy to maintain. It has a peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.
All nodes play an identical role—there is no concept of a master node, with all nodes communicating with each other equally.
In an Apache Cassandra cluster, each node is capable of handling large amounts of data and thousands of concurrent users or operations per second — even across multiple data centres. This is done with as much ease as when managing much smaller amounts of data and user traffic.
Apache Cassandra’s architecture also ensures that, unlike other master-slave or sharded systems, it has no single point of failure and therefore is capable of offering true continuous availability and uptime — users need to simply add new nodes to an existing cluster without having to take it down.

Key structures
The various structures of Apache Cassandra, along with a brief description of each, follows.

Node: This is where you store your data. It is the basic infrastructure component of Cassandra.
Data centre: This is a collection of related nodes. This could be a physical data centre or a virtual one. Different workloads should use separate data centres, either physical or virtual. Replication is set by the data centre. Using separate data centres prevents Cassandra transactions from being impacted by other workloads and keeps requests close to each other for lower latency. Depending on the replication factor, data can be written to multiple data centres. However, data centres should never span physical locations.

Cluster: A cluster contains one or more data centres. It can span physical locations.
Commit log: All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it can be archived, deleted or recycled.

Table: This is a collection of ordered columns fetched by rows. A row consists of columns and has a primary key. The first part of the key is a column name.

SSTable: This is a sorted string table (SSTable) and is an immutable data file to which Cassandra writes memtables periodically. SSTables are append-only, stored on disk sequentially and maintained for each Cassandra table.

Figure 4: Data model: Rows in a column family (CF)

The Apache Cassandra data model
The data model of Cassandra is different from the relational DBMS. Cassandra does not support joins or sub-queries like in RDBMS. Instead, Cassandra emphasises denormalisation through features like collections.

Cassandra is basically a key-value and a column-oriented (or tabular) database. Rows are organised into tables — the first component of a table’s primary key is the partition key. Within a partition, rows are clustered by the remaining columns of the key. Other columns can be indexed separately from the primary key.

The Cassandra data model consists of keyspace, column families, columns and rows.
Keyspace: This is the outermost container for your application data. It is similar to the schema in a relational database. The keyspace can include operational elements, such as the replication factor and data centre awareness. Keyspace is a group of many column families.

Column family: A column family is a container for an ordered collection of rows, each of which is itself an ordered collection of columns. A column family is similar to a table in an RDBMS and is a logical separation of similar data.

Column: This is a basic data structure of Cassandra with three values—name, value and timestamp.
Super column: The super column stores a map of sub-columns.

Row: This is a collection of columns labelled with a name.

Different use cases of Apache Cassandra
Apache Cassandra can be used for various applications. Here are some use cases where Apache Cassandra is the best choice compared to other NoSQL databases.

Internet of Things applications: Cassandra is the right choice for applications where data is travelling at very high speeds between different devices or sensors.
In activity-tracking and monitoring applications: Numerous entertainment and media organisations use Cassandra to monitor user activity based on parameters such as movies, music, album, artist, etc.
In heavy write systems or in time-series based applications: Cassandra is perfect for the very heavy write system—for example, in Web analytics where data is logged for each and every request based on hits, by type of browser, traffic sources, location, behaviour, technology, devices, etc.
Social media analytics and recommendation engines: Cassandra is used by many social media providers to analyse data and provide suggestions to their customers.

Product catalogues and retail applications: One very popular use case of Cassandra is to quickly display product catalogue inputs and lookups in retail applications.

Messaging: Cassandra serves as the database backbone for numerous mobile phone and messaging providers’ applications.
Apache Cassandra is one of the most popular open source distributed database systems available. It provides a more flexible data model than what’s offered in the relational database world. You can scale it up to any number of concurrent user connections and/or data volume. It can easily distribute data among multiple geographies, data centres and the cloud. So if your application has a large amount of data, and if you are planning to scale it, then Cassandra will definitely help you.