What If Your HBase Seems to be Working but Still Needs to be Working?

HBase Seems to be Working

HBase is used as a robust NoSQL database for handling voluminous data. It is based on the Hadoop platform, so preferred by those who use Hadoop extensively. HBase focuses on Availability and partition tolerance (A and P of the CAP Theorem) and is favored for data representing key-value pairs.

We employ HBase as a popular data store for several clients handling streaming data. The distributed master-slave architecture augurs well for several purposes.

Having worked on HBase, we have encountered several scenarios wherein HBase has reported problems. Each case was an interesting one, and with the help of our knowledge of the technology, good experience, data from the log files, and sometimes plain common sense, we were able to get out of the problem comfortably. However, recently, we encountered a unique case in our HBase distribution – HBase seemed to be working but was not working!

I know that this sounds plain stupid, but it is precisely what I have written. We are working on one of the interesting real-time Internet of Things (IoT) project development. We were using CDH 5.11 as the Hadoop distribution, and the status of HBase on the Cloudera Manager console showed green/amber. However, we could not get data to fire Impala queries on HBase tables.

Restarting HBase and the cluster did not help, as nothing seemed to be a problem. The Stdout and Stderr logs of the HBase Master instance also did not indicate an error. The HBase logs told some problems, though – it seemed the master could not initialize. A closer inspection of the error messages revealed a problem when HBase tried to split the log files.

A few recommended solutions were as follows –

  1. Increase the configuration parameter for xxxxxxxx from the default value of 300000 ms
  2. Check Zookeeper for errors
  3. Stop the HBase service, restart all Regionservers first, and then start the master
    Lastly, another gem of a solution
  4. Delete HBase and reinstall the service!

None of the above (we did not dare try the fourth one!) worked, though I’m sure they might have worked in a few cases.

We realized the error pointed to a “splitting” file on one of the Regionservers. We checked the Write Ahead Logs and noticed that for each Regionservers, there was a Splitting file under the directory “/hbase/WALs.”

When none of our approaches worked, we decided to try the option of deleting the —splitting files. Since this was a part of the WAL, we had to be careful, or else we would have lost data. The steps we followed were as follows –

  • Stop the data processing on HBase
  • Bring down HBase
  • Manually delete the —splitting files from the Regionservers
  • Restart HBase

And voila! HBase was up and running and started accepting queries and processing them.

Why does splitting happen?

Before understanding the split policy of HBase, it is essential to know how the HBase writing process happens.

Source: http://bit.ly/2qUJrY8

As per the split policy of HBase, it splits the region when the total data size for one of the stores (corresponding to a column family) in the region gets bigger than the configured “hbase. region.max.filesize”, which has a default value of 10GB.

Source: http://bit.ly/2qM7kEK

In case, if cluster is down due to some issue, HBase services won’t restart. It is because temporary files do not get flushed out while trying to restart. Hence, you need to delete those corrupted split files manually.

Deleting anything from the HBase system file is a tad risky.

Before trying it out, a good understanding of the architecture and various components is essential. Our team was glad that we figured out a way of taming an unexpected behavior of HBase. And our client was happy that we resolved it…!