What 5 Things to Consider Before Streaming Data into Hadoop?

What 5 Things to Consider Before Streaming Data into Hadoop

Data is generated at a rapid pace these days. Thanks to Big Data technologies like Hadoop, generating data processing becomes easier.

Having made it possible for several of our clients to process streaming data and gain valuable insights, we have come up with a list of a few points that need to be considered as a part of streaming data into Hadoop. Some of these seem pretty obvious, but based on experience, I can confidently say that they can be easily missed, for want of experienc.

1. How are you going to transfer data into the Hadoop cluster

Making use of FTP or HTTP data can be transferred. Unless there is a specific need to use FTP, we recommend HTTP due to the usage of persistent connections and automatic compression, thereby making transfers efficient.

What we do – We normally begin small, using a Tomcat server on Linux to receive data. Tomcat is open-source and very easy to install and configure. It also gives its best performance on Linux. Since Hadoop is also Linux-based, this augurs well. Data is received in the landing folder and then preceded by shell or Python scripts.

For production deployments where data is expected to be received from a few hundred thousand sources, we go for an Apache HTTP server (or an IIS server in a few cases, where the client is a Microsoft shop).

2. How much raw data needs to be stored –

This is a tricky question. The more raw data you need to store, the bigger the Hadoop cluster needs to be (remember, Hadoop uses a replication factor of 3, so any data entering Hadoop will be stored as 3 copies). Though Hadoop uses commodity hardware cost-effectively, our experience is that clients are very particular about every extra node to add.

What we do – We normally recommend storing raw data for not more than 7 days, as data beyond that is normally unessential. If there are any bottlenecks in processing, they can be addressed in 7 days, which still gives adequate time to recover the backlog.

3. Data received in what format –

Still continues to be the preferred format due to ease of transfer and parsing and processing for simple data models or data coming from legacy systems. However, it works well only where the data is evenly structured. So, if there is an option to receive data in CSV, we recommend going for it. Other formats like JSON or XML are better suited for complex data models.

Also, popular languages like Python and PHP offer libraries for parsing and dealing with these formats, which makes processing easy.

What we do – We normally use shell or Python scripts to parse and massage the data and convert it to CSV format. It passes to Flume for loading to Hadoop.

4. Have you considered pre-processing the data –

This is an important aspect where special attention needs to be paid. Data needs to be formatted, or unwanted fields need to be skipped. Trimming data and taking only the requisite data ahead to Hadoop is always worthwhile.

What we do – We always use multi-threading in shell scripts on the Linux servers to speed up the processing multi-fold. Wherever possible, we also distribute the processing on the different nodes to make the data available to the Flume agents running on the data nodes, thereby improving throughput.

5. Have you considered delayed data –

While data needs to be loaded, processed, and made available for visualization in a short time, the architecture and processing need to take care of delayed data. This is especially important for time-series data, visualized and analyzed based on specific time spans. Data might get delayed, but it still needs to be processed and put in the appropriate bucket.

What we do – We normally impose a time limit for the delay. For example, data that arrives only in the last 48 hours will be processed. If the data arrives later than this period, it is unable to be processed in the regular cycle. This period will vary based on specific client requirements and use cases, but the concept remains the same. Impose a restriction on the amount of delay to tolerate.

Once data is loaded in Hadoop, It can be processed as required and made available for visualization. Details of the same will be covered in a later blog.

Hope you found these tips useful. I look forward to your comments and experiences in this area.