Recent Posts

Tuesday, 5 July 2016

Cassandra Architecture

Cassandra Architecture
* Cassandra is designed to handle big data. 

* Cassandra main feature is to store data on multiple nodes with no single point of failure. 
* The reason for this kind of Cassandra’s architecture was that the hardware failure can occur at any time. Any node can be down. In case of failure data stored in another node can be used. Hence, Cassandra is designed with its distributed architecture.

* Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.

* All the nodes in a cluster play the same role. Each node is independent and at the same time interconnected to other nodes.

* Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.

* When a node goes down, read/write requests can be served from other nodes in the network.

* All the nodes exchange information with each other using Gossip protocol. Gossip is a protocol in Cassandra by which nodes can communicate with each other. 

* A Node is an instance of Cassandra. A Cassandra cluster is made up of many nodes.
* Node is the basic component of Cassandra.
* Node is the place where data is stored. 

Data Center
* Collection of nodes are refered as a data center. Many nodes are categorized as a data center.

* Cluster is the collection of many data centers.

Commit Log
* Commit log is used for crash recovery.

* When you perform a write operation, it’s immediately written to the commit log. The commit log is a crash-recovery mechanism that supports Cassandra’s durability goals

* After data written to the commit log, the value is written to a memory-resident data structure called the memtable.

* Data is written in Mem-table temporarily.

* When the number of objects stored in the memtable reaches a threshold, the contents of the memtable are flushed to disk in a file called an SSTable. A new memtable is then created.

Cassandra Data Replication
     In a distributed system like Cassandra, data replication enables high availability and durability. As hardware problem can occur or link can be down at any time during data process, a solution is required to provide a backup when the problem has occurred. So data is replicated for assuring no single point of failure. Cassandra places replicas of data on different nodes based on these two factors.

1. Where to place next replica is determined by the Replication Strategy.
2. While the total number of replicas placed on different nodes is determined by the Replication Factor.
     One Replication factor means that there is only a single copy of data while three replication factor means that there are three copies of the data on three different nodes.

     For ensuring there is no single point of failure, replication factor must be three. There are two kinds of replication strategies in Cassandra.

1. SimpleStrategy
     SimpleStrategy is used when you have just one data center. SimpleStrategy places the first replica on the node selected by the partitioner. After that, remaining replicas are placed in clockwise direction in the Node ring. Here is the pictorial representation of the SimpleStrategy. It is commonly used when nodes are in a single data center.

2. NetworkTopologyStrategy
     NetworkTopologyStrategy is used when you have more than two data centers. In NetworkTopologyStrategy, replicas are set for each data center separately. NetworkTopologyStrategy places replicas in the clockwise direction in the ring until reaches the first node in another rack. This strategy tries to place replicas on different racks in the same data center. This is due to the reason that sometimes failure or problem can occur in the rack. Then replicas on other nodes can provide data. Here is the pictorial representation of the Network topology strategy.

Write Operation
     The coordinator sends a write request to replicas. If all the replicas are up, they will receive write request regardless of their consistency level. Consistency level determines how many nodes will respond back with the success acknowledgement. The node will respond back with the success acknowledgement if data is written successfully to the commit log and memTable.

     For example, in a single data center with replication factor equals to three, three replicas will receive write request. If consistency level is one, only one replica will respond back with the success acknowledgement, and the remaining two will remain dormant. Suppose if remaining two replicas lose data due to node downs or some other problem, Cassandra will make the row consistent by the built-in repair mechanism in Cassandra.Here it is explained, how write process occurs in Cassandra,
1. When write request comes to the node, first of all, it logs in the commit log.

2. Then Cassandra writes the data in the mem-table. Data written in the mem-table on each write request also writes in commit log separately. Mem-table is a temporarily stored data in the memory while Commit log logs the transaction records for back up purposes.

3. When mem-table is full, data is flushed to the SSTable data file.

Read Operation
     There are three types of read requests that a coordinator sends to replicas.
1. Direct request
2. Digest request
3. Read repair request

     The coordinator sends direct request to one of the replicas. After that, the coordinator sends the digest request to the number of replicas specified by the consistency level and checks whether the returned data is an updated data. After that, the coordinator sends digest request to all the remaining replicas. If any node gives out of date value, a background read repair request will update that data. This process is called read repair mechanism.

Next Tutorial  Cassandra Data Model

No comments:

Post a Comment