With Hadoop, we can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access).
Hadoop Random Access Databases
Applications like HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.
What is HBase
HBase is an open source, distributed column-oriented database built on top of the Hadoop file system (HDFS), developed by Apache Software foundation. Initially, it was Google Big Table, afterwards it was re-named as HBase to provide quick random access to huge amounts of structured data. and is primarily written in Java. HBase can store massive amounts of data from terabytes to petabytes.
HBase is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
- Table is a collection of rows.
- Row is a collection of column families.
- Column family is a collection of columns.
- Column is a collection of key value pairs.
Row-oriented data stores
- Data is stored and retrieved one row at a time and hence could read unnecessary data if only some of the data in a row is required.
- Easy to read and write records
- Well suited for OLTP systems
- Not efficient in performing operations applicable to the entire data set and hence aggregation is an expensive operation
- Typical compression mechanisms provide less effective results than those on column-oriented data stores
Column-oriented data stores
- Data is stored and retrieved in columns and hence can read only relevant data if only some data is required
- Read and Write are typically slower operations
- Well suited for OLAP systems
- Can efficiently perform operations applicable to the entire data set and hence enables aggregation over many rows and columns
- Permits high compression rates due to few distinct values in columns
Features of HBase
- Strictly consistent to read and write operations
- HBase is used extensively for random read and write operations
- HBase stores large amount of data in terms of tables
- Provides linear and modular scalability over cluster environment
- HBase is built for low latency operations
- Automatic and configurable sharding of tables
- Automatic fail-over supports between Region Servers
- Convenient base classes for backing Hadoop MapReduce jobs in HBase tables
- Easy to use Java API for client access
- Block cache and Bloom Filters for real-time queries
- Query predicate pushes down via server side filters.
Where to Use HBase
A table for a popular web application may consist of billions of rows. If we want to search particular row from such huge amount of data, HBase is the ideal choice as query fetch time in less. Most of the online analytics applications uses HBase. Traditional relational data models fail to meet performance requirements of very big databases. These performance and processing limitations can be overcomed by HBase.
Applications of HBase
- HBase is used whenever we need to provide fast random access to available data.
- HBase is used whenever there is a need to write heavy applications.
- Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.