Resolving Data Processing Challenges on The Hadoop Data Lake

Recently, we worked on an exciting project. It was for our client, one of the leading insurance companies dealing in specialty insurance. The client required curated information for better decision-making and analytics by reviewing large amounts of policy and claims data.

This crucial business data was stored across multiple source systems. Since the data was spread across various source systems, extracting actionable business insights was a complex and time-consuming process. In such cases, it becomes imperative for businesses to get a homogenous data system that becomes a single source of truth. We proposed to create a data lake that would act as a single source of truth.

Architecture

Through Sqoop, we moved all the data from our client’s sources to Hadoop hive tables. After this, we validated the ingested data using business and Spark validation rules. Handling data on such a large scale we faced many challenges and problems, especially while loading the data into the data lake.

Challenges

  1. We had more than 200 hive tables. Each has 500+ columns.
  2. Multiple tables are needed to read and load into a single table.
  3. The amount of data we were fetching was very voluminous regarding 10’s of TB’s.
  4. Processing and analyzing the data involved complex queries; for example, each query involved 15—20 tables joining a couple of sub-queries and business logic.

Impact of these Challenges

  1. We observed increased data loading time due to handling vast amounts of data generated by the client. Queries used to take 2—3 hours to execute.
  2. Increased processing time to load data in tables sometimes causes the query to fail, prompting the timeout error.
  3. Managing the data is of the utmost priority: checking and rechecking the joins made. An error with a single join could make the query stuck and extremely difficult to debug and check.

Data processing took an excessive amount of time, and we could not afford to spend more time on it due to stringent timelines. We were desperate to simplify things and, at the same time, to speed up the data processing tasks. We had multiple technical discussions and tried various approaches; some worked, some did not.

I would like to walk you through this exciting information on a couple of things that helped us achieve our goals.

Things that worked for us were

  1. Using Spark Engine

    Processing and loading a large quantity of data took a long time. Loading the data in Hive using the MapReduce engine proved very slow and tedious. Hence, we decided to use the spark engine. Using the spark engine decreased our loading time by from a few hours to 20—40 minutes.

  2. Using Intermediate stages

    The data being loaded to Hive tables was populated from multiple tables. This required an immense amount of time to load, and so the intermediate table system was used. This method simplifies the data loading in the target tables.

  3. Selecting only required columns

    We had source tables having numerous columns, many exceeding 500+ columns. This necessitated us to load data from source to target hive tables more precisely and concisely. We used sub-queries to select only required and essential columns from the source table to load into the Hive table.

  4. Using an optimized storage format

    We used the Parquet file format to store data in tables. Parquet stores nested data structures in a flat columnar format, making it 3x more efficient in storage and query performance.

This was an exciting use case that we tackled. It amplifies the importance of creating a robust data lake to get actionable insights. There will always be roadblocks and obstacles during the implementation of any kind of solution, but the key lies in:

  1. Keeping your priorities clear on what to achieve
  2. Team Discussions
  3. Practically trying out various approaches rather than relying on pure theory
  4. Some things from the old school of thought always help