Explaining the Data Lakehouse: Part 2
The second article in this five-part series compares the data lakehouse with the conventional data warehouse.
This is the second article in a series of five. The first article introduces the data warehouse and explored what is new and different about it. The third article explores the viability of the data lakehouse -- and of data lakehouse architecture -- as replacements for the data warehouse and its architecture. The fourth article looks at the role of data modeling in designing, maintaining and using the lakehouse. The final article in the series assesses the differences and, just as important, the similarities between the lakehouse and the platform-as-a-service (PaaS) data warehouse.
In this article, I focus on how the architecture of the data lakehouse compares with that of the classic, or conventional, data warehouse. The article imagines data lakehouse architecture as an attempt to implement some of the core requirements of data warehouse architecture in a modern design based on cloud-native concepts, technologies and methods. It explores the advantages of cloud-native design, beginning with the ability to dynamically provision resources in response to specific events, predetermined patterns and other triggers. It likewise explores data lakehouse architecture as its own thing -- that is, as an attempt to address new or different types of practices, use cases and consumers.
How Data Lakehouse Architecture Differs from Data Warehouse Architecture
In an important sense, data lakehouse architecture is an effort to adapt the data warehouse -- and its architecture -- to cloud and, at the same time, to address a much larger set of novel use cases, practices and consumers. This is a less counterintuitive, and less daunting, claim than it might seem.
Think of data warehouse architecture as akin to a technical specification: It does not tell you how to design or implement the data warehouse; rather, it enumerates and describes the set of requirements (that is, features and capabilities) that the ideal data warehouse system must address. For all intents and purposes, then, designers are free to engineer their own novel implementations of the warehouse, which is what Joydeep Sen Sarma and Ashish Thusoo attempted to do with Apache Hive, a SQL interpreter for Hadoop, or what Google did with BigQuery, its NoSQL query-as-a-service offering.
The data lakehouse is an example in kind. In fact, to the extent that a data lakehouse implementation addresses the set of requirements specified by data warehouse architecture, it is a data warehouse.
In the first article in this series, we saw that data lakehouse architecture eschews the monolithic design of classic data warehouse implementations, as well as the more tightly coupled designs of big data-era platforms, such as Hadoop+Hive, or platform-as-a-service (PaaS) warehouses, such as Snowflake.
Design-wise, then, data lakehouse architecture is quite different. But how is it different? And why?
Adapting Data Warehouse Architecture to Cloud
The classic implementation of data warehouse architecture is premised on a set of dated expectations, particularly with respect to how the functions and resources that comprise the warehouse are to be instantiated, connected together and accessed. For one thing, early implementers of data warehouse architecture expected that the warehouse would be physically instantiated as an RDBMS and that its components would connect to one another via a low-latency, high-throughput bus. Relatedly, they expected that SQL would be the sole means of accessing and manipulating data in the warehouse.
A second expectation was that the data warehouse was to be online and available at all times. Moreover, its constituent functions were expected to be tightly coupled to one another. This was a feature, not a bug, of its instantiation in an RDBMS. This made it impracticable (and, for all intents and purposes, impossible) to scale the warehouse’s resources independently of one another.
Neither of these expectations is true in the cloud, of course. And we are all quite familiar with the cloud as a metaphor for virtualization -- that is, the use of software to abstract and define different types of virtual resources -- and for the scale-up/scale-down elasticity that is cloud’s defining characteristic.
But we tend to spend less time thinking about cloud as a metaphor for the event-driven provisioning of virtualized hardware -- and, by implication, for an ability to provision software in response to events, too.
This on-demand dimension is arguably the most important practical benefit of cloud’s elasticity. It is also one of the most obvious differences between the data lakehouse and the classic data warehouse.
The Data Lakehouse as Cloud-native Data Warehouse
Event-driven design at this scale presupposes a fundamentally different set of hardware and software requirements. It is this event-driven dimension that cloud-native software engineering concepts, technologies and methods evolved to address. In place of tightly coupled, monolithic applications that expect to run atop always-on, always-available, physically instantiated hardware resources, cloud-native design presupposes an ability to provision discrete software functions (which developers instantiate as loosely coupled services) on-demand, in response to specific events. These loosely coupled services correspond to the constituent functions of an application. Applications themselves are composed of loosely coupled services, as with the data lakehouse and its layered architecture.
What makes the data lakehouse cloud-native? It is cloud-native to the degree that it decomposes most, if not all, of the software functions that are implemented in data warehouse architecture. There are six functions:
one or more functions capable of storing, retrieving and modifying data;
one or more functions capable of performing different types of operations (such as joins) on data;
one or more functions exposing interfaces that users/jobs can use to store, retrieve and modify data, as well as to specify different types of operations to perform on data;
one or more functions capable of managing/enforcing data access and integrity safeguards;
one or more functions capable of generating or managing technical and business metadata; and
one or more functions capable of managing and enforcing data consistency safeguards, as when two or more users/jobs attempt to modify the same data at the same time, or when a new user/job attempts to update data that is currently being accessed by prior users/jobs.
With this as a guideline, we can say that a “pure” or “ideal” implementation of data lakehouse architecture would consist of the following components:
The lakehouse service itself. In addition to SQL query, the lakehouse might provide metadata management, data federation and data cataloging capabilities. In addition to its core role as a query service, the lakehouse doubles as a semantic layer: It creates, maintains and versions modeling logic, such as denormalized views that get applied to data in the lake.
The data lake. At minimum, the data lake provides schema enforcement capabilities, along with the ability to store, retrieve, modify and schedule operations on objects/blobs in object storage. The lake usually provides data profiling and discovery, metadata management, and data cataloging capabilities, along with data engineering and, optionally, data federation capabilities. It enforces access and data integrity safeguards across each of its constituent zones. Ideally, it also generates and manages technical metadata for the data in these zones.
An object storage service. It provides a scalable, cost-effective storage substrate. It also handles the brute-force work of storing, retrieving and modifying the data stored in file objects.
With that said, there are different ways to implement the data lakehouse. One pragmatic option is to fold all these functions into a single omnibus platform -- a data lake with its own data lakehouse. This is what Databricks, Dremio and others have done with their data lakehouse implementations.
Why Does Cloud-native Design Matter?
This invites some obvious questions. First, why do this? What are the advantages of a loosely coupled architecture vs. the tightly integrated architecture of the classic data warehouse? As we have seen, one benefit of loose coupling is an ability to scale resources independently of one another -- to allocate more compute without also adding storage or network resources. Another benefit is that loose coupling eliminates some of the dependencies that can cause software to break. So, for example, a change in one service will not necessarily impact, let alone break, other services. Similarly, the failure of a service will not necessarily cause other services to fail or to lose data. Cloud-native design also uses mechanisms (such as service-orchestration) to manage and redress service failures.
Another benefit of loose coupling is that it has the potential to eliminate the types of dependencies that stem from an implementation’s reliance on a specific vendor’s or provider’s software. If services communicate and exchange data with one another solely by means of publicly documented APIs, it should be possible to replace a service that provides a definite set of functions (such as SQL query) with an equivalent service. This is the premise of pure or ideal data lakehouse architecture: Because each of its constituent components is, in effect, commoditized (such that equivalent services are available from all of the major cloud infrastructure providers, from third-party SaaS and/or PaaS providers, and as open source offerings), the risk of provider-specific lock-in is reduced.
The Data Lakehouse as Event-driven Data Warehouse
Cloud-native software design also expects that the provisioning and deprovisioning of the hardware and software resources that power loosely coupled cloud-native services is something that should happen automatically. In other words, to provision a cloud-native service is to provision its enabling resources; to terminate a cloud-native service is to deprovision these resources. In a sense, cloud-native design wants to make hardware -- and, to a degree, software -- disappear, at least as a variable in the calculus of deploying, managing, maintaining and, especially, scaling business services.
From the viewpoints of consumers and expert users, there are only services -- that is, tools that do things.
For example, if an ML engineer designs a pipeline to extract and transform data from 100 GBs of log files, a cloud-native compute engine dynamically provisions n compute instances to process her workload. Once the engineer's workload finishes, the engine automatically terminates these instances.[i]
Ideally, neither the engineer nor the usual IT support people (DBAs, systems and network administrators, and so on) need to do anything to provision these compute instances or, crucially, the software and hardware resources on which they depend. Instead, this all happens automatically -- in response, for example, to an API call initiated by the engineer. The classic, on-premises data warehouse was just not conceived with this kind of cloud-native, event-driven computing paradigm in mind.
The Data Lakehouse as Its Own Thing
The data lakehouse is, or is supposed to be, its own thing. As we have seen, it provides the six functions listed above. But it depends on other services -- namely, an object storage service and, optionally, a data lake service -- to provide basic data storage and core data management functions. In addition, data lakehouse architecture implements a novel set of software functions that have no obvious parallel in classic data warehouse architecture. In theory, these functions are unique to the data lakehouse.
These are:
One or more functions capable of accessing, storing, retrieving, modifying and performing operations (such as joins) on data stored in object storage and/or third-party services. The lakehouse simplifies access to data in Amazon S3, AWS Lake Formation, Amazon Redshift, and so on.
One or more functions capable of discovering, profiling, cataloging and/or facilitating access to distributed data stored in object storage and/or third-party services. For example, a modeler creates n denormalized views that combine data stored in both the data lakehouse and in the staging zone of an AWS Lake Formation (that is, a data lake). The modeler also designs a series of more advanced models that incorporate data from an Amazon Redshift sales data mart.
In this respect, however, the lakehouse is not actually all that different from a PaaS data warehouse service. The fifth and final article in this series will explore this similarity in depth.
[i] The software required to make this work is still very new. Arguably, some of it does not yet exist -- at least in a sense analogous to the RDBMS, whereby, for example, a query optimizer parses each SQL query, estimates the cost of running it and pre-allocates the necessary resources. This is not magic; rather, it is grounded in the rigor of mathematics. Under the covers, the RDBMS’ query optimizer uses relational algebra to translate SQL commands into relational operations. It creates a query plan -- that is, an optimized sequence of these operations -- that the database engine uses to allocate resources to process the query. This proactive model is quite different from the reactive model that predominates in cloud-native software. For example, cloud-native design principles might expect to use real-time feedback from observable components to determine the cost of running a workload and provision sufficient resources. To the degree that cloud-native software has a proactive dimension, this comes via pre-built rules or ML models. Workloads are not proactively soluble via math.
About the Author
You May Also Like