How we Resolved the dncp_block Verification Log File Issue in HDFS, Saving Critical Time

How we Resolved the dncp_block Verification Log File Issue in HDFS, Saving Critical Time

Hadoop is a distributed master-slave type architecture. The architecture is something like one name node, one secondary name node, and eight data nodes.

We were using Hadoop to handle a large amount of streaming data from smartphones for one of the leading telecom companies in India. It was an eight-node cluster with Cloudera CDH 5.3.0.

The project was at a critical stage. We were facing a problem wherein the size of “/pdfs/dn/current/Bp-12345-IpAddress-123456789/dncp-block-verification.log.curr” and “dncp-block-verification.log. prev” kept increasing to 100 of GBs within hours, which was slowing down machine, leading to data node service outage.

It was an HDFS bug (HDFS-7430). There needed to be help on how it should be resolved. After having a good discussion with Hadoop experts at Ellicium, I could solve this issue. I have two options to resolve this.

Option 1

By stopping the data node services and deleting dncp block verification files manually. Implementing this would require continuous monitoring, as log files may increase in size on either data node (even on the same node after deleting it).

Option 2

Although slightly drastic, it was to turn off the block scanner entirely by setting into the HDFS DataNode configuring the key dfs.datanode.scan.period.hours to 0 (default is 504 in hours). The negative effect of this could have been DNs not auto-detecting corrupted block files.

After considering the pros and cons, we went ahead with option 1. After implementation, as expected, the service was up and running. Hopefully, this issue will be resolved in the next version of CDH 5.4.x.

It was a big relief, and I felt proud, as it saved a lot of cluster downtime.