Enemy #1 for Hadoop adoption – Bad Data Quality

Enemy #1 for Hadoop adoption

“Fast is fine, but accuracy is everything” – Wyatt Earp.

What Wyatt Earp said about gunfights in the Wild West strangely applies to the Big Data / Hadoop / IoT movement that is equally wild! In their 2015 Big Data market update, Forbes confirmed that data quality is the biggest problem in developing big data applications.

Poor Data Quality is the biggest.
challenge to Hadoop adoption at the Production level.

Initial enthusiasm in enterprises about Big Data / Hadoop / IoT / M2M, etc., fizzles out quickly as the stakeholders get frustrated about the poor quality of data and resulting analytics.

In my last 3 implementations of streaming data IoT on the HBase platform, I observed a consistent pattern of data quality issues. We implemented a data quality measurement framework for streaming data, which I will describe in this blog. This framework is built on dimensions of data quality, which is defined by the DAMA (Data Management Association)

Timeliness

In the case of streaming data, timeliness assumes the highest significance. A particular sensor may stop sending data for a few hours and suddenly send a backlog of data. This results in the late arrival of data. Another case we observed is ‘out of sequence data.’ In an IoT implementation, sensors stop transmitting data for a few minutes, and then when the issue is resolved, the latest data is sent first, and then the older data is sent.

The timeliness of High-streaming data can be measured as follows:

Number of records arriving within the given time limit (for example, 1 minute)
Total number of records generated in a given timeframe. 

Timeliness is often a post facto measurement since the total number of records generated can be measured only after those records are loaded into the database. We measure timeliness by periodically running a script that compares the timestamp of record generation at the sensor end and the timestamp to get load in the database.

Completeness

Data completeness can be defined as the availability of mandatory elements of data. Some IoT systems transfer data in text files that contain data accumulated at the sensor end over a short period (say 1 minute). To ensure completeness of data, a header and footer are in addition to the files. Incomplete transmittal of files results in files being transferred to the server without a header or footer. This indicates that data in the file in question may be lacking completeness. We have measured completeness for streaming data as follows:

The number of files received needs to be completed on the server.
Total number of files received on the server

Integrity

Data missing important linkages to other elements of data (mainly the master data) have integrity issues. In multiple IoT implementations, we observed that the master data about the devices has to be first configured on the server, and then the devices can start sending the data. In a telecom IoT project, we had a condition in which the handset had to be registered first, and then it could send the streaming data to the server for tracking user experience and network performance.

There were usual missouts in configuring the device master data, and the devices started sending the data even before the configuration was done. This resulted in the “Orphan sensor data” condition. We measured this Data Integrity as
(Number of orphan / un-configured devices)
1 – Total number of devices in the installation

Accuracy and Consistency

Traditional data quality literature has defined accuracy and consistency as two separate dimensions. However, in streaming/sensor / IoT data, these 2 closely interrelated. A sensor that records the temperature of a device as 70 degrees and, in 1 second, registers temperature as 125 degrees is mostly faulty. These kinds of data accuracy and consistency rules are unable to be generalized but need to be captured as business rules by the domain experts. Records need to be flagged as inaccurate based on the accuracy rules. Accuracy can defined as:
Number of records not flagged as inaccurate as per business rules
Total number of records

We are piloting this data quality framework for streaming data at multiple implementations. Also, we are developing a combined DQ-Index, which differs for each implementation depending on the importance of each dimension of data quality. Let us stay awake to the devils of bad data quality!