Data Lake For Beverage Company In Canada
About the Client
North America’s most diversified and successful private beverage company focused on the alcohol beverage sector.
Business Requirement
- Create a data lake to collect data from a variety of sources and make it available for analysis.
- Unstructured data contained in PDF/Doc/Docx should be quarriable and searchable as well.
Our Solution
- To import SQL Server source data and PDF/DOC files onto HDFS/Hbase, an ingestion framework was built.
- Python was used to extract valuable information from unstructured data sources.
- Data files were transformed to an efficient format using Spark in order to optimize storage.
- Cloudera search was enabled on the documents using Apache Solr.
Solution Architechture

Business Outcomes
- There was quick and easy access to information from a variety of sources.
- Converting to an efficient format, storage was increased by 50%.
- To capture and search the data, the speed was doubled.