Data Lakehouse and Data Hub: Going Beyond Just Data Storage
The limitations of data warehouses led to data lakes. Now data lakehouses and data hubs are taking data storage and analysis to another level.
Businesses want to do so much with data today. Of course, they need to store data so it’s accessible when needed, and they want to make sure it’s compliant and secure. At the same time, businesses increasingly want to be able to share and gain insights from the data.
Traditionally, organizations have relied on data warehouses as the central repository for their data and use it for everything from analytics and reporting to business intelligence. While data warehouses are still valuable, many are finding that they aren’t enough. Data warehouses work best with structured data such as operational and transactional data, while much of the data flowing into businesses today is unstructured, in the form of email, video, social media posts, audio and sensor data.
These shortcomings have led to the development of data lakes, which can store both structured and unstructured data. Unlike data warehouses, data lakes can store data in its native format, which often means that data needs cleansing and preparation before use. That can make it more difficult to use and result in a collection of disorganized data. At the same time, data lakes have proved valuable for more advanced analytics, often involving machine learning. They also tend to be more scalable and well-suited to the cloud. Many companies choose to apply cloud-native object storage to their data lakes to support data velocity and volume.
With clear benefits of each type of data environment, does it make sense for companies to have both? In many cases, that’s already happened, but it can lead to even more complications. The complexity of having multiple environments not only adds cost to the equation, but it results in storing the same data in multiple places, which can lead to just the type of data silos businesses are trying to avoid. This has led some organizations to build pipelines to move the data around.
Benefits of a Data Lakehouse
The result is the somewhat humorously named “data lakehouse,” a term that describes a combination data warehouse and data lake. The goal, said Kevin Petrie, vice president of research at Eckerson Group, a research and consulting firm for data and analytics, is to provide one source of data across the environment. Vendors such as Snowflake, Databricks and Vertica are innovators in this space, although there are others. Some come at it from the data warehouse side, while others approach from the data lake side, but all have the same goal: embracing a cloud-native object storage-based combination data warehouse and data lake.
Typically, the features of a data lakehouse include direct access to source data; a standardized storage format; support for structured, semi-structured and unstructured data; schema support; and concurrent reading and writing of data.
“The data lakehouse allows you to get the performance, governance and scale of a data lake to the data warehouse and, most importantly, have a single source of truth for all of your data,” explained Joel Minnick, vice president of product marketing at Databricks. “It can build on the system you already have instead of having to bring an entirely new technology into your environment, and it reduces complexity because you don’t have to move the data into all of those downstream systems.”
Businesses increasingly are seeing the benefits of the data lakehouse approach. A report earlier this year from TDWI, for example, found that 48% of analytics and data professionals believe the concept is very important, and 89% view it as an opportunity. Respondents said the biggest values include silo consolidation, getting more business value from data, expanding analytics into more advanced forms like machine learning and AI, and providing a better foundation for analysis of new and traditional data.
And Then There Is the Data Hub …
With the combined data warehouse and data lake concept, what more could you want? Some say the missing piece is a data hub, which centralizes data across applications and makes data sharing and collaboration easier. Some also define the data hub as the primary source of important data elements like master data and reference data. Others say the data hub connects data warehouses and data lakes.
“Data hubs allow data that’s locked away in the data warehouse to be exposed and used by other applications in other ways,” said Robert Lee, chief technology officer at Pure Storage. “You can say that a data warehouse is a perfect fit for one type of data or for a particular usage, [and] a data lake and some analytics package on top of it might be a perfect fit for another usage, but how do I make sure that these don’t become locked-away silos, and how do I avoid creating a bunch of sprawl along the way?”
But isn’t that exactly what a data lakehouse is supposed to do? Some think so—Gartner and Ventana Research among them. Minnick agrees.
“If the point of a data hub is to be a clearinghouse for data that makes sure that the right people have access to the right data to do their jobs, that ’s what a data lakehouse does,” he said. “A central tenet of the lakehouse architecture is that there is a single source of truth that can be shared externally and internally.”
Others believe it’s not exactly the same. But in some ways, it really doesn’t matter. Instead, focus on the features and functions you need, regardless of the term attached to the technology.
“Today, there is no single way to solve enterprise data needs. The problem isn’t about choosing one; the problem is how to choose the minimal number of best-of-breed tools for each of your needs,” Lee said. That means being flexible because the best solution may be a mix of technology.
IT leaders who manage the data platforms for different business units along with business managers should ask business users and data scientists about their requirements, pain points and what has worked for them in the past.
“They will probably say things like ‘we don’t know where the truth is’ or ‘we’re not getting data returned to us fast enough’ or ‘we can’t analyze as much data as we want to drive business decisions,’” Petrie said. “That’s the feedback you need to guide a decision about how you can modernize your architecture.”
About the Author
You May Also Like