How We Built a Complex Data Lake On Azure

How We Built a Complex Data Lake On Azure

Recently, we worked to build a data lake on Azure for a healthcare organization. Our client had crucial and confidential data stored in silos across systems like IBM DashDB, Oracle DB, and SQL servers. The client is required to get all this structured, unstructured data in one place to empower various data consumers and stakeholders across the organization for better decision-making and analytics. Azure was their preferred cloud technology to build this data lake.

Data Lake on Azure

Azure provides various options to implement a data lake. It has storage services like Blob Storage, Azure Data Lake Storage Generation 1, and Azure Data Lake Storage Generation 2.

Overview of Azure Storage Services

  1. Azure Blob Storage: Azure blob storage is a storage system that can hold unstructured data like text, audio, video, and binary. Blob storage is highly scalable. It stores the data in the form of containers, which are nothing but a set of blobs.
  2. Azure Data Lake Storage Generation 1 (ADLS Gen1): ADLS Gen1 is Hierarchical file-based storage that supports the nesting of files within the folders. This storage enables folder-level security and performance optimizations. It also supports parallel read and write operations.
  3. Azure Data Lake Storage Generation 2(ADLS Gen2) : ADLS Gen2 is the combination of Azure blob and ADLS Gen1. It provides the choice of object-level access or file-level access. In ADLS Gen2, we need to enable a ‘hierarchical namespace’ on the storage account level for using this facility, which we do not need in ADLS Gen1. The data stored in ADLS Gen2 can be accessed through endpoints based on data access patterns defined.

In our use case, we stored health-related data like patient details(pat_records department), prescriptions given to the patients(Med_Records department), etc., in a raw layer in the native form. About the departments, this data gets generated on a daily, weekly, or monthly basis. The objective was to process the data in a processing layer and relocate it in appropriate folders.

After much research and considering client-specific requirements, we selected ADLS Gen 2.

Below are some reasons why ADLS Gen2 fits best for this project

  1. One of the major requirements was the ability to trace the history of processed data without considerable cost overhead. We could achieve this in ADLS Gen 2 by archiving data in each layer. ADLS Gen2 supports Hot and Cool storage tiers to store the data. This helps to segregate the data in an optimized way.
  2. As we were dealing with healthcare domain data, data security was critical. Hence, it was crucial to provide access to the users to a specific folder according to their role. ADLS Gen2 fits the best in such a scenario as it provides file and directory-level security with the help of Access Control List(ACL) and Role-Based Access Control(RBAC). Also, the data was getting updated daily. Hence, the need was to keep the data in the form of day-wise partitions in different folders. It was possible to achieve this with ADLS Gen2 due to its hierarchical file system feature.

Implementation

We divided the implementation process into three layers :

  1. Staging layer
  2. Processing Layer
  3. Publishing Layer

Staging Layer: Being a healthcare organization, our client needs to maintain various kinds of data about the patients, their medical history, prescriptions, etc. This data is owned by multiple departments (Ex.Med_Records, Pat_Records, etc). Before Azure, this data was stored across IBM DashDB, Oracle DB, and SQL servers.

This data was a mix of structured and unstructured data in text and PDF files. We ingested this data to Azure data lake in its raw form using Azure data factory.

Processing Layer : Data stored in the staging layer is cleansed, deduped, and validated using the Azure data factory. This data is transformed using business rules and loaded into the publishing layer using the Azure data factory.

To recall one specific challenge, data from ‘med_records’ was coming daily for 3 incremental days. This resulted in the need for deduplication of records. We achieved this deduplication of records by dedup workflow in Azure data factory. This cleansed data was stored back in ADLS Gen2.

Publishing Layer : The publishing layer resides in the Azure SQL data warehouse. This enables downstream systems to fetch the data from the Azure SQL data warehouse within a few seconds to further analyze the data. The processed data was used to visualize on Power BI dashboards in the form of KPIs like patient follow-up, average treatment charges, weekly medicine usage, available stock of medicines details, patient satisfaction deviation, etc.

It was a great experience building a data lake on Azure. At the same time, we did go through a good amount of experimentation. Based on our experience, I have compiled recommendations that might help you while building a data lake on Azure:

Things to consider and recommendations for building a data lake on Azure

  1. Selecting the right tool for data movement: Before selecting a tool for data movement, make sure to check the constraints and compatibility of that tool with the ADLS. The connectivity of the selected device with your source data system and ADLS plays a vital role. I would highly recommend using Azure Data Factory for data movement. Because it is an integral part of the Azure ecosystem and works well with other components like HD Insights, SQL Server, and ADLS.
  2. Data should be query able data: Data stored in the data lake should be queryable. In our case, we held data on Azure SQL data warehouse as users required frequent data access. So that end users can derive insights by running queries and reports.
  3. The file should be significant: Another essential factor to consider while building a data lake on Azure is the data file sizes. As per Azure documentation, the file size should be large (> 256MB ) if you want to store the data in ADLS, as small file sizes can create overhead. Hence, batching data into larger files is important while holding it into ADLS.
  4. Make sure to implement ETL best practices on Azure: Auditing, restart ability, and error logging should be implemented in Azure data lake to trace the data movement. This also helps to rectify the error that might occur during data movement/mapping. In our use case, we were dealing with a large number of files daily. Hence, it was important to keep track of vital details like the number of records processed, the number of records rejected and reasons for rejection, etc., for every execution.

For this, we implemented an auditing mechanism by designing Azure data factory pipelines, which helped in reconciliation. To avoid the entire execution of the workflow after failure, we implemented the restart ability mechanism, which would restart the workflow execution from the last point of failure.

Our experience with Azure

From an operational perspective, Azure is an easy-to-use cloud service, especially for those more aligned with the Microsoft ecosystem. Also, Azure is Cost-effective and offers various tools for various (examples), all under a single umbrella.

Another considerable advantage Azure has is that it is easy to learn, and its documentation is good. I would also like to mention the ever-supportive Azure community, which has rescued me several times.

If you have questions or need to discuss anything around Azure from the article, you can get in touch with me here: [email protected]