Recent Posts

Saturday, 4 June 2016

Apache Cassandra Tutorial

Apache Cassandra

* Cassandra is an open-source distributed database management system.

* Apache Cassandra is highly scalable, distributed and high-performance NoSQL database.

* Cassandra is designed to handle a huge amount of data across multiple data centers with no single point of failure.

* Cassandra handles the huge amount of data with its distributed architecture. Data is placed on different machines with more than one replication factor that provides high availability and no single point of failure.

    Let us first understand what a NoSQL database does.

* A NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations such as relational databases.

* These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.

* NoSql databases use different data structures compared to relational databases. It makes some operations faster in NoSQL.

* NoSQL databases include MongoDB, HBase, and Cassandra. There are following properties of NoSQL databases.
1. Design Simplicity
2. Horizontal Scaling
3. High Availability

* Data structures used in Cassandra are more specified than data structures used in relational databases. Cassandra data structures are faster than relational database structures.

Nosql Cassandra Database Vs Relational Databases

History of Cassandra
* Apache Cassandra was originally developed at Facebook in 2008 to power Facebook’s in-box search feature.

* The original authors were Avinash Lakshman, who also is one of the authors of the Amazon Dynamo paper, and Prashant Malik.

* After being in production at Facebook for a while, Cassandra was released as an open-source project on Google Code in July of 2008.

* In March of 2009, it was accepted to the Apache Foundation as an incubator project.

* In February of 2010, it became a top-level Apache project.

Who Uses Cassandra?
     Cassandra is in wide use around the world, and usage is growing all the time. Companies like Netflix, eBay, Twitter, Reddit, and Ooyala all use Cassandra to power pieces of their architecture, and it is critical to the day-to-day operations of those organisations.

     Before we get too deep into Cassandra, it is important to understand some of the basic concepts that surround databases so you know what concessions you may have to make when choosing a system. There are three main sets of properties that define what database systems are capable of. Those are ACID, CAP, and BASE.

     ACID stands for Atomicity, Consistency, Isolation, and Durability. In order to understand ACID and how it relates to databases, we need to talk about transactions. When it comes to databases, a transaction is defined as a single logical operation. For example, if you are shopping online, every time you add an item to your shopping cart, that item and its quantity make up the database transaction. Even if you add multiple items or multiple quantities of the same item with a single click, that entire shopping cart addition is just a single transaction.
     Atomicity means that each transaction either works or it doesn’t. This is to say that if any single part of the transaction fails, the entire transaction fails. This should hold true for every situation related to a transaction that could cause a failure. Network failure, power outage, or even a node outage occurring at transaction time should cause a complete transaction failure in an atomic system.

     Consistency ensures that when a transaction is complete, whether it is successful or not, the database is still in a valid state. This means that any data written to the database must also be valid. When writing data to the database, you also need to include any database application-level rules such as constraints, cascades, triggers, or stored procedures. The application of those rules should also leave the data in a valid state.

    Isolation is a property that ensures that all transactions that are run concurrently appear as if they were executed serially (one right after the other). Each transaction must be run in a vacuum (isolation). This is to say that if two transactions are run at the same time, they remain independent of each other during the transaction. Some examples of isolation are locks (table, row, column, etc.), dirty reads, and deadlocks. The reason these are relevant is concurrency. Multiple changes can be attempted on the same data or set of data. Knowing what version of the data is the correct one is important for keeping the entire system in a sane state.

     Changes must be saved permanently to the database, that is the amount deposited must be added to your previous balance. It might happen that the transaction survives all hurdles and is written in the log which stores the entries to be finalized in the database.

      CAP stands for Consistency, Availability, and Partition tolerance. Although the C in CAP also stands for “consistency” (similar to the C in ACID), the meaning is different. Consistency means that all nodes in a grouping see the same data at the same time. In other words, any particular query hitting any node in the system will return the same result for that specific query. Consistency also further implies that when a query updates a value in one node, the data will be updated to reflect the new value prior to the next query.

     The availability of a system speaks to the guarantee that regardless of the success or failure of a request, the requestor will receive a response. This means that system operations will be able to continue even if part of the system is down, whatever the reason. Availability is what lets the software attempt to cope with and compensate for externalities such as hardware failures, network outages, power failures, and the like.

     Partition tolerance refers to the capability of a distributed system to effectively distribute the load across multiple nodes. The load could be data or queries. This implies that even if a few nodes are down, the system will continue to function. Sharding is a commonly used management technique for distributing load across a cluster. Sharding, which is similar to horizontal partitioning, is a way of splitting data into separate parts and moving them to another server or physical location, generally for performance improvements.

     BASE stands for Basically Available, Soft state, and Eventual consistency.Having a system be basically available means that the system will respond to any request. The caveat is that the response may be a failure to get the data or that the data may be in an inconsistent or changing state.

     The idea of a soft-state system means the system is always changing. This is typically due to eventual consistency. It is common for soft-state systems to undergo changes even when there is no additional input to them.

     Eventual consistency refers to the concept that once a system stops receiving input, the data will propagate to wherever else it needs to be in the system sooner or later. The beauty of this is that the system does not check for consistency on every transaction as is expected in an ACID-compliant system.

Next Tutorial  Cassandra Installation

1 comment:

  1. Thanks for sharing Apache Cassandra related some information are very beautiful & your point by point Cassandra details to suggest to me.