5 Data Warehouse implementation mistakes to avoid in Big Data Projects

5 Data Warehouse Implementation Mistakes to Avoid in Big Data Projects

Data warehouse implementations are tricky. An enterprise data warehouse takes months to build. Most importantly, failure rates of data warehousing projects are very high. Various studies have reported a 50 to 60 percent failure rate for data warehouse implementations.

Over the last 15 years, I have worked with dozens of clients ranging from the world’s largest banks to start-up companies. I’ve consistently experienced that “Data warehouse projects do fail, and failure occurs for many reasons.” During my initial implementations of Big Data solutions, a client sponsor for a streaming big data project asked a question. ‘Can you build upon your experience and tell me why we may not succeed in this initiative.’ Oh yes! I have many reasons that caused the failure of Data warehouse projects.

By ‘Big Data,’ I refer to analytical systems built on Hadoop, using big data technologies to augment existing data warehouses, systems built to analyze streaming data for predictive analytics, and other similar systems. By successful Data warehouse project, I am referring to projects that are delivered in a given time and budget, but more importantly, business users use the data warehouse for decision making.

Here is a list of common aspects I observed in failed Data warehouse implementations.

Many unsuccessful Data Warehouse implementations are characterized by Long development cycles. I have seen failed EDW projects, which took 2 years to complete. During the project, there were changes in team composition, replacement of end users, and a skyrocket in budget.

A CIO or a sponsor was more concerned about justifying the cost of the implementation from the data warehouse. Taking a clue from this aspect, big data implementations should avoid long development cycles. Agile methodology is well suited for big data implementations, considering the exploratory nature of these projects. Short sprints of 2-3 weeks, continuous testing and deployment, and regular reprioritization of requirements are necessary for Big Data projects.

Lack of focus on Data Quality

It has been a major irritation for business users to refrain from using data warehouses. I have often seen scenarios where data warehouses develop in budget, but data quality is poor. Lack of focus on quality testing and lack of source data understanding are some reasons for this. This situation is very likely in any big data project.

Consider a scenario where a manufacturing company has implemented a predictive maintenance system based on Hadoop. It has been tested and is now ready to predict failures. Some incoming data from machine sensors is not loaded due to data format errors, and the system does not alert the user about impending failure. This will cause a huge disbelief in the system.

Treating data projects as pure IT projects rather than a business endeavor

It has been a common factor in failed data warehouse implementations that I have observed in very large organizations. A head of Business Intelligence claiming that his team understands business requirements better than business users has been a common scenario. This is already happening in Big Data implementations.

IT departments are implementing Hadoop clusters and pumping in data without the meaningful involvement of the business community. The business community will question the implementation of spend and ROI.

Conducting weekly user demos, early release of partial data to business users, and workshops to keep business users updated. It updates on how other companies use similar data to keep the business user community “involved” in the big data initiatives.

Data silos and data proliferation

During early 2000, a number of data marts, data warehouses, and personal data marts were developed in the mid–large organizations. Data extracted from operational systems multiple times into these analytical systems. Over a period, Data silos and data proliferation were developed. With easy access to external / social media data, I foresee similar data silos and proliferation scenarios with Big Data / Hadoop systems. Nothing can stop an analyst from dumping Facebook comments, census data, and government data into her personal Hadoop datasets and running R algorithms on those.

Imagine doing it by multiple analysts in the same organization! Strong Metadata management systems and data governance contribute to avoiding Data silos and data proliferation, especially in large organizations. During the last decade, several metadata and data governance initiatives were launched in organizations with data warehouses. Rather than treating data governance as an afterthought, it should be part of big data implementations.

As a fallout of treating data warehouse initiatives as IT projects, there often needs to be more focus on non-functional requirements. Long-running queries, slow response of reports, badly designed UI, and long cycles of Data warehouse load all have contributed to killing many data warehouse initiatives. Apart from technology and data angle, big data projects will have to focus on these non-functional or ‘Usability’ aspects to get a buy-in from business users.

In short, strong data governance, close involvement of business stakeholders, an agile approach, and a focus on user experience are mandatory for the success of a big data program.