No one likes to hear “I told you so,” but those words are ringing out across data analytics circles these days. Business leaders who thought Big Data was going to be easy are feeling the splash of cold water on their faces. But, even though Big Data is hard, doesn’t mean it’s going away. Data pioneers are pushing forward with new approaches that are showing tremendous promise.
Most businesses struggle with Big Data because it’s counter to classic data architecture models. Companies are accustomed to spending a tremendous amount of time, money and effort trying to organize disparate data from dispersed systems and synthesizing it into a simple, singular database so everyday employees can make use of it. Data standardization is engrained in the culture.
I can remember the days when businesses invested thousands of dollars trying to define one data point, not even knowing if it would serve any useful purpose, such as predicting buying patterns. Executives engaged in endless discussions about the various ways that different audiences could interpret a single nugget of information.
In today’s world, the acceleration of data volumes, velocity, variety and veracity are overwhelming traditional data management architectures. Data warehousing still has its place, but creating robust interfaces to systems and data sources is too slow and too expensive to supply data to an innovation process with an unknown value. Information driven decision making requires a more agile and innovation-driven approach.
A new trend in Big Data is emerging called data lakes. Forward-thinking enterprises across industries are “dumping” data for analytics into repositories—often Hadoop based—without perfecting the data. By taking this unkempt approach, companies aren’t trying to make the data accessible to a mass audience and they don’t always know what they’ll find. Now that data storage and technology is cheap and information is vast, discovery analytics is finally possible. Businesses have the luxury of keeping the information they collect. And, they don’t have to know what they are looking for to pave a path to hidden opportunities.
With data lakes, companies employ data scientists who are capable of making sense of wild data as they trek through it. They can find correlations within the data as they get to know it. As PhDs and other statisticians and business experts traverse the terrain, they leave guideposts for those who follow. Think of them as modern day Lewis and Clarks scouting what’s up ahead to gather information for the rest of the colony to build the town and lay down track for the railroad.
Challenges and Opportunities of Data Lakes
Data lakes is such a fundamentally different way of attacking the data dilemma that enterprises are struggling to adapt their processes and culture. Following are a few challenges that businesses need to overcome:
• Political Power Struggles: Even though Hadoop makes it easy, business owners can be resistant to sharing data for political reasons. Data is power.
• Complexity of Legacy Data: Many legacy systems contain a hodgepodge of software patches, workarounds, and poor design. As a result, the raw data may provide limited value outside its legacy context. The data lake performs optimally when supplied with unadulterated data from source systems, and rich metadata built on top.
• Metadata Lifecycle Management: Data lakes require advanced metadata management methods, including machine assisted scans, characterizations of the data files, and lineage tracking for each transformation. Should schema on read be the rule and predefined schema the exception? It depends on the sources. The former is ideal for working with rapidly changing data structures, while the latter is appropriate for sub-second query response on highly structured data.
• Desolate Data Islands: Business units often discover that lakes are cheap and fast and they build them on their own and then abandon them. By circumventing the centralized IT function, business units can create a chain of desolate data islands instead of a “land of lakes” that can flow into each other.
• The Issue of Integration: The integration required to turn that data into actionable insights is a substantial challenge. While integrating the data takes place at the Hadoop layer, contextualizing the metadata (providing views of selected data, in other words) takes place at schema creation time. A secure “integration fabric” is necessary to link the lakes and centralize data from multiple sources to provide a comprehensive, rich repository of enterprise-wide information for analysis and insight.
• Virtually every industry has the potential to tap the power of data lakes. Implemented with some foresight, a data lake can be a way to gain more visibility into operations and put an end to data fiefdoms. Many companies see data lakes as an opportunity to capture a 360-degree view of their customers, to analyze social media trends and to strengthen compliance.
In the financial services industry, the Dodd-Frank regulation is a catalyst. One institution has begun centralizing multiple data warehouses into a big repository. The institution is moving reconciliation, settlement, and Dodd-Frank reporting to the new platform. In this case, the approach reduces integration overhead because data is communicated and stored in exactly the same format. The system also provides a consistent view of a customer across operational functions, business functions, and products.
Some companies have built Big Data sandboxes for analysis by data scientists. These sandboxes are somewhat similar to data lakes, albeit narrower in scope and purpose. PwC, for example, built a social media data sandbox to help clients monitor their brand health by using its Social Mind application.
Make no mistake about it; Big Data is increasingly the beating heart of a thriving company. The term Big Data might seem tired, but the practice itself is opening up new and exciting areas of analytics that are enabling companies to achieve a level of competitive advantage that was unimaginable only a few years ago. Every day, businesses are making Big Data breakthroughs that are like beacons of light for others still in the dark to follow.