How We Built A Complex Data Lake On Azure
About Rutuja Deshmukh
Recently we worked to build a data lake on Azure for a healthcare organization. Our client had crucial and confidential data stored in silos across systems like IBM DashDB, Oracle DB, and SQL server. The client required to get all this structured, unstructured data at one place to empower various data consumers and stakeholders across the organization for better decision making and analytics. Azure was their preferred cloud technology to build this data lake.
Data Lake on Azure:
Azure provides various options to implement a data lake. It has storage services like Blob Storage, Azure Data Lake Storage Generation 1, and Azure Data Lake Storage Generation 2.
Overview of Azure Storage Services
1. Azure Blob Storage
Azure blob storage is a storage system that can hold unstructured data like text, audio, video, and binary. Blob storage is highly scalable. It stores the data in the form of containers which are nothing but a set of blobs.
2. Azure Data Lake Storage Generation 1 (ADLS Gen1) :
ADLS Gen1 is Hierarchical file-based storage that supports nesting of files within the folders. This storage enables folder level security and performance optimizations. It also supports parallel read and writes operations.
3. Azure Data Lake Storage Generation 2(ADLS Gen2) :
ADLS Gen2 is the combination of Azure blob and ADLS Gen1. It provides the choice of object-level access or file-level access. In ADLS Gen2, we need to enable ‘hierarchical namespace’ on storage account level for using this facility which we do not need in ADLS Gen1. The data stored in ADLS Gen2 can be access through endpoints based on data access patterns defined.
In our use case, we were storing health-related data like patient details(pat_records department), prescriptions given to the patients(Med_Records department), etc in a raw layer in the native form. About the departments, this data gets generated on a daily, weekly, or monthly basis. The objective was to process the data in a processing layer and relocate it in appropriate folders.
After lots of research and considering client specific requirements, we selected ADLS Gen 2.
Below are some reasons why ADLS Gen2 fitted best for this project:
- One of the major requirements was the ability to trace the history of processed data without big cost overhead. We could achieve this in ADLS Gen 2 by archiving data in each layer. ADLS Gen2 supports Hot and Cool storage tiers to store the data. This helps to segregate the data in an optimized way.
- As we were dealing with healthcare domain data, data security was critical. Hence it was crucial to provide access to the users to a specific folder according to their role. ADLS Gen2 fits the best in such a scenario as it provides file and directory level security with the help of Access Control List(ACL) and Role-Based Access Control(RBAC). Also, the data was getting updated daily. Hence, keeping the data in the form of day-wise partitions in different folders was the need. It was possible to achieve this with ADLS Gen2 due to its hierarchical file system feature.
We used the following Azure cloud services to build the data lake: Azure data factory, Azure data lake, and Azure data warehouse.
We divided the implementation process into three layers :
- Staging layer
- Processing Layer
- Publishing Layer
1. Staging Layer :
Being a healthcare organization, our client needs to maintain various kinds of data about the patients, their medical history, prescriptions, etc. This data is owned by various departments (Ex.Med_Records, Pat_Records, etc). Before Azure, this data was stored across IBM DashDB, Oracle DB, and SQL server.
This data was a mix of structured and unstructured data in the form of text and PDF files. We ingested this data to Azure data lake in its raw form using Azure data factory.
2. Processing Layer :
Data stored in the staging layer is cleansed, deduped, and validated using the Azure data factory. This data is transformed using business rules and loaded into the publishing layer using the Azure data factory.
To recall one specific challenge, in our case data from ‘med_records’ was coming daily for 3 incremental days. This resulted in the need for deduplication of records. We achieved this deduplication of records by dedup workflow in Azure data factory. This cleansed data was stored back in ADLS Gen2.
3. Publishing Layer :
The publishing layer resides on Azure SQL data warehouse. This enables downstream systems to fetch the data from Azure SQL data warehouse within a few seconds to perform further analysis of the data. The processed data was used to visualize on Power BI dashboards in the form of KPIs like patients follow-up, average treatment charges, weekly medicine usage, available stock of medicines details, patients satisfaction deviation, etc.
It was a great experience building a data lake on Azure. At the same time, we did go through a good amount of experimentation. Based on our experience, I have compiled recommendations that might help you while building a data lake on Azure:
Things to consider and recommendations for a building a data lake on Azure:
1. Selecting the right tool for data movement
Before selecting a tool for data movement, make sure to check the constraints and compatibility of that tool with the ADLS. The connectivity of selected tool with your source data system and ADLS plays a vital role. I would highly recommend using Azure Data Factory for data movement. Because it is an integral part of the Azure ecosystem and works well with other components like HD Insights, SQL Server, and ADLS.
2. Data should be queryable data
Data stored in the data lake should be queryable. In our case, we stored data on Azure SQL data warehouse as users required frequent data access. So that end users can derive insights by running queries and reports.
3. The file should be large
Another important factor to consider while building a data lake on Azure is the file sizes of data. As per Azure documentation, the file size should be large (> 256MB ) if you want to store the data in ADLS as small file sizes can create overhead. Hence, the batching of data into larger files is important while storing it into ADLS.
4. Make sure to implement ETL best practices on Azure
Auditing, restart ability, error logging should implement in Azure data lake to trace the data movement. This also helps to rectify the error that might occur during data movement/mapping. In our use case, we were dealing with a large number of files daily. Hence, it was important to keep track of vital details like the number of records processed, the number of records rejected and reasons for rejection, etc. for every execution. For this, we implemented an auditing mechanism by designing Azure data factory pipelines which helped in reconciliation. To avoid the entire execution of the workflow after failure, we implemented the restart ability mechanism which would restart the workflow execution from the last point of failure.
Our experience with Azure
From an operational perspective, Azure is an easy to use cloud service, especially for those who are more aligned to the Microsoft ecosystem. Also, Azure is Cost-effective and offers a wide variety of tools for various (examples) all under a single umbrella.
Another huge advantage Azure has is that it is easy to learn and documentation around it is good. I would also like to mention the ever-supportive Azure community which has rescued me several times.
If you have questions or need to discuss anything around Azure from the article, you can get in touch with me here: [email protected]