Recent Posts

Sunday, 11 September 2016

Cassandra Data Model


     When creating a data model for your keyspace, the most important thing to do is to forget everything you know about relational data modeling. Relational data models are designed for efficient storage, relational lookups, and associations between concerns. The Cassandra data model is designed for raw performance and storage of vast amounts of data.


     Unlike relational databases, the data model for Cassandra is based on the query patterns required. This means that you have to know the read/write patterns before you create. your data model.

Cassandra Data Model
     To understand how to model in Cassandra, you must first understand how the Cassandra data model works. When creating a table using CQL, you are not only telling Cassandra what the name and type of data are, you are also telling it how to store and distribute your data. This is done via the PRIMARY KEY operator. The PRIMARY KEY tells the Cassandra storage system to distribute the data based on the value of this key; this is known as a partition key. When there are multiple fields in the PRIMARY KEY, as is the case with compound keys, the first field is the partition key (how the data is distributed) and the subsequent fields are known as the clustering keys (how the data is stored on disk).

     Clustering keys allow you to pregroup your data by the values in the keys. Using compound keys in Cassandra is commonly referred to as “wide rows.” “Wide rows” refers to the rows that Cassandra is storing on disk, rather than the rows that are represented to you when you make a query.

     Following figure shows how the data might be stored in a five-node cluster using PRIMARY KEY.

CREATE TABLE animals (
name TEXT PRIMARY KEY,
species TEXT,
subspecies TEXT,
genus TEXT,
family TEXT
);

SELECT * FROM animals;

name |family   | genus | species          | subspecies
------------------------------------------------------------
dog  | Canidae | Canis | C. lupus         | C. l. familiaris
cat  | Felidae | Felis | F. catus         | null
duck | Anatidae| Anas  | A. platyrhynchos | null
wolf | Canidae | Canis | C. lupus         | null
     Following figure shows how the data might be stored in a five-node cluster using COMPOUND KEY.

CREATE TABLE animals (
name TEXT,
species TEXT,
subspecies TEXT,
genus TEXT,
family TEXT,
PRIMARY KEY(family, genus)
);

SELECT * FROM animals;

name | family   | genus | species          | subspecies
-------------------------------------------------------------
dog  | Canidae  | Canis | C. lupus         | C. l. familiaris
wolf | Canidae  | Canis | C. lupus         | null
cat  | Felidae  | Felis | F. catus         | null
duck | Anatidae | Anas  | A. platyrhynchos | null
     When we use a COMPOUND KEY, the data for wolf and for dog is stored on the same server. This is because we changed the partition to “family” and clustered on “genus.” Literally, this means that the data for each family will be stored on the same replica sets and presorted, or clustered, by the genus. This will allow for very fast lookups when the family and genus for an animal are known.

1 comment:

  1. Hello Ashok,
    Can you explain a bit about garbage collection(tombstones) in Cassandra and algo used behind the scenes?

    Regards,
    Tejas Udeshi

    ReplyDelete