HBase is a non-relational database that runs on top of HDFS. This article gives an overview of HBase, discussing its benefits and limitations.
HBase is an open source, distributed, non-relational database, developed by the Apache Software Foundation, which runs on top of HDFS. Initially, it was referred to as Google Big Table, and later re-named HBase. Mainly written in Java, it is a data model that is designed to provide quick random access to a large amount of structured data.
One can store data in HDFS either directly or through HBase. It is natively integrated with Hadoop and works flawlessly alongside other data access engines through YARN.
1. HBase tables are distributed in the cluster via regions, which are automatically split and redistributed as the data grows.
2. HBase is linearly scalable and has automatic failure support.
3. It integrates with Hadoop, both as a source and as a destination.
4. HBase supports an easy-to-use Java API for programmatic access.
HBase’s architecture is composed of three types of components — the client library, a master server and a region server, the last of which is optional as it can be used based on requirements.
Master server: This acts as a monitoring agent and monitors all region server instances present in the cluster. It also operates as an interface for all the metadata changes. It maintains the state of the cluster by negotiating the load balancing. It is responsible for schema changes and other metadata operations such as the creation of tables and column families.
Regions: Regions are the basic building elements of the HBase cluster, which consists of the distribution of tables and column families. They contain multiple stores, one for each column family. They comprise mainly two components, which are Memstore and Hfile. Regions are mainly tables that are split up and spread across the region servers.
Region server: When a region server receives writes and read requests from the client, it assigns the request to a specific region, where the actual column family resides. However, the client can directly make contact with region servers — there is no need of mandatory master permission to the client for communication with region servers. The client requires master help when operations related to metadata and schema changes are required.
Why use HBase?
Today, every Web application consists of billions of rows. Searching for a few particular rows from a large amount of data takes a lot of time. In such a situation, HBase is the ideal choice as the query fetch time is short. Conventional relational data models fail to meet the performance requirements of very big databases.
1. HBase cannot execute functions like SQL. It doesn’t support the SQL structure, so it does not contain any query optimiser.
2. We cannot expect to completely use HBase as an alternative for conventional models, some of which cannot hold HBase.
3. HBase, when integrated with Pig and Hive jobs, results in some time memory issues in the cluster.
You can download HBase from the Apache website, the latest version of which is 1.2.4. The HBase team recommends that you install it on a UNIX/Linux environment; if you run it in Windows, you might want to download and install Cygwin to do so.