Explaining the Data Lakehouse: Part 5
The fifth and final installment in this series compares the data lakehouse and the PaaS data warehouse.
This is the final article in a series devoted to considering the data lakehouse both on its own terms and as a possible replacement for the data warehouse. (For the first article in this series, click here.)
For the most part, previous installments have treated the data warehouse as a kind of strawman. For example, I have mostly compared the data lakehouse and its cloud-native design with classic implementations of the data warehouse -- as if cloud-native concepts and methods had not also been applied to data warehouse architecture or, put another way, as if data warehouse architecture were stuck in time.
So far, I have said next to nothing about the platform-as-a-service (PaaS) or query-as-a-service (QaaS) [i] data warehouse, nor have I discussed these schemes as novel implementations that are comparable in both capabilities and (as it were) cloud “nativity” with the no less novel data lakehouse.
This idea is implicit in prior analysis, however. In the second article, for example, I made the point that data warehouse architecture is less a technology prescription than a technical specification: Instead of telling us how to build the data warehouse, it tells us what that system is supposed to do and how it is supposed to behave. It outlines the features, capabilities, and so on that are required by that system.
One implication of this is that there are several possible ways to implement the data warehouse. Another is that the requirements of data warehouse architecture are not necessarily in conflict with those of cloud-native design. A final implication is that the cloud-native data warehouse actually has quite a few things in common with the data lakehouse, even as it departs from that implementation in critical respects.
With this as background, let’s pivot to the culminating questions of this series: What do the data lakehouse and the PaaS data warehouse have in common, and how are they different?
The PaaS Data Warehouse Sure Looks Like a Data Lakehouse
Both the PaaS data warehouse and the data lakehouse have quite a bit in common. Like the data lakehouse, the PaaS data warehouse:
lives in the cloud;
separates compute, storage and other resources;
can elastically expand/contract to suit spikes in demand, seasonal/one-off use cases, and so on;
is event-driven and can provision compute or (if necessary) temporary storage resources in response to event triggers and deprovision these resources once they are no longer needed;
is co-local with (intra-cloud adjacent to) other cloud services, including the data lake;
like the data lake/house, writes data to and reads data from inexpensive cloud object storage;
can query/facilitate federated access to data stored in any of the data lake’s zones;
can eschew all but basic data modeling, as in flat or OBT schemas;
can ingest, manage and perform operations on semi- and multi-structured data;
can query across multiple data models (for example, time-series, document, graph and text);
can expose denormalized views (models) to support specific uses cases, applications and so on;
exposes different types of RESTful endpoints, in addition to SQL; and
supports -- via discrete APIs or language-specific SDKs -- GraphQL, Python, R, Java and more.
The PaaS Data Warehouse Is (More) Tightly Coupled Where It Matters
Compared to the data lakehouse, the cloud-native data warehouse seems like a tightly coupled stack.[ii]
The advantage of this is that the cloud-native warehouse is able to manage and control the software functions that read and write data, as well as schedule, distribute and perform operations on data; manage dependencies between these operations; implement consistency, uniformity and replicability safeguards; and so on. The cloud-native data warehouse is able to enforce strict ACID safeguards.
The “ideal” data lakehouse implementation is cobbled together out of discrete, fit-for-purpose services. So, for example, an ideal data lakehouse implementation consists of a SQL query service superimposed on top of a data lake service, which itself sits on top of a cloud object storage service. This is consistent with a trend in software design that aims to decompose large, complex programs into smaller, function-specific programs, which are instantiated as discrete services. These services are “decoupled” from one another in the sense that they have very little knowledge about how the complementary services with which they interact are supposed to work. By assembling multiple services together, you can approximate the behavior and performance of a large (monolithic) application. And, you also realize a few of the benefits that derive from this design. (For more on this, see the second article in this series.)
The disadvantage of this is that it poses problems vis-à-vis concurrent computing, especially when it comes to coordinating concurrent access to shared resources. As I showed in the third article in this series, the data warehouse solves this problem by having the RDBMS kernel enforce strict ACID safeguards.
It is not obvious how to solve this problem in an ideal data lakehouse implementation. One solution is to follow Databricks’ lead -- that is, to couple the data lake to the data lakehouse in a single platform.
If a data lakehouse was its own data lake -- and if it could also supervise concurrent access to the data in this lake -- a data lakehouse service might be able to enforce ACID-like safeguards. However, in this model, the data lakehouse and the data lake would be tightly coupled, creating dependency on a single software platform -- and a single provider.
A Data Warehouse Is a Data Warehouse … Is a Data Lakehouse?
Let’s pivot to a provocative question: Can the PaaS data warehouse do all of the things the data lakehouse can do? Possibly. Think about it: What is the difference between a SQL query service that queries against data in the curated zone of a data lake and a PaaS data warehouse that lives in the same cloud context, has access to the same underlying cloud object storage service and is able to do the same thing? What is the difference between a SQL query service that facilitates access to data in the lake’s archival, staging and other zones and a PaaS data warehouse that is able to do the same thing?
The data lake and the data warehouse seem to have been converging toward one another for a long time. So, on the one hand, the lakehouse looks like a textbook example of lake-to-warehouse convergence. On the other hand, the warehouse’s support for multiple data models and its retrofitting with data federation and multi-structured query capabilities -- that is, the ability to query files, objects or arbitrary data structures -- are arguably examples of a warehouse-to-lake convergence trend.
Let’s look at a few of the claimed differences between the data lakehouse and the data warehouse and see if these, too, have been obviated by convergence. Here are a few obvious ones to consider:
Has the ability to enforce safeguards to ensure the uniformity and replicability of results
I discussed this above. The PaaS data warehouse wins this one easily.
Has the ability to perform core data warehousing workloads
As I also said in the third article in this series, it is less important that a query platform provide fast results than that these results are uniform and replicable. In fact, the trick is to balance all three requirements against one another. This means the platform is able to achieve query results that are fast enough while at the same time maintaining a consistent state and ensuring the uniformity and applicability of results. The PaaS data warehouse is just a much faster query-processing platform than a SQL query service. It wins this one easily.
Eliminates the requirement to model and engineer data structures prior to storage
The fourth article in this series debunked this claim. The practice of modeling and engineering data structures prior to instantiating them in storage is not specific to the data warehouse; rather, it is performed to promote data intelligibility, data governance and data reuse.
The upshot is that organizations will opt to model and engineer data for both the PaaS data warehouse and the data lakehouse. Organizations that eschew all but basic data modeling for their PaaS data warehouses could realize significant improvements in query performance, relative not only to normalized (3NF) schemas but also to denormalized schemas.
Protects against cloud-service-provider lock-in
Advocates argue that ideal data lakehouse architecture helps insulate subscribers against the risk of service provider-specific lock-in. So, to cite one example, if a subscriber is dissatisfied with its existing SQL query service, it can swap this service out for another, comparable one.
How practicable is this, however? Will the modeling instantiated in the subscriber’s semantic layer also transfer? Will experts have to refactor modeling logic to work with the new service? And, if a subscriber wants to switch to a different data lake -- say, from AWS Lake Formation to a fit-for-purpose data lake service, also hosted in AWS -- how simple is it, really, to swap out one service for another? Quite aside from the question of data movement, there is also that of feature and function migration. For example, will developers have to refactor the data engineering and data modeling logic instantiated in their code to work with the new service?
On balance, an ideal data lakehouse implementation still has an advantage here. However, this is less true of an implementation that couples the data lakehouse to a specific lake service.
Has the ability to support a diversity of practices, use cases and consumers
The data warehouse has improved significantly in this area. However, the data lake has the advantage. It is cheaper and more convenient to ingest, store, engineer and experiment with data in the lake than it is in the data warehouse. There are fewer impediments, such as internal stumbling blocks in the form of policies, controls and processes, to constrain the use and the usefulness of the data lake in connection with data engineering, ML/AI engineering, data science and other experimental use cases. The data lake wins this one going away.
Has the ability to query against/across multiple data models
The data lakehouse sits atop the data lake, which is designed to ingest, store and manage data of any type. Ergo, if an organization puts time series, graph, document and other data into the data lake, this data will be available to, and queryable by, its data lakehouse. Right?
Usually. For example, a SQL query services such as Presto can use a connector to access, say, MongoDB, a NoSQL document database. In this scheme, Presto accesses MongoDB as if it were one or more external tables. Presto has connectors for other repositories and data producers, as well. Similarly, commercial SQL query-as-a-service providers will usually provide data source connectors. (Several of these are based on Presto, which simplifies things.)
The real trick is to link information across data models -- that is, across relational, semi-, and multi-structured data (for example, to relate CUST "Alan Smithee," who has loyalty CARD number “8086486DX2,” in a relational data model to client “Wang Ermazi,” who lives at “2500 West End Ave” in ZIP code “37203,” in a document database). Again, this gets at the difficulty of supporting use cases that require or benefit from resource sharing in any ideal implementation in which software functions are supposed to be decomposed into discrete, decoupled services.
How does one do this in ideal data lakehouse architecture -- that is, with just a SQL query service?
As for the PaaS data warehouse, it, too, can use Presto (or its own federated query services) to get at data stored in NoSQL repositories and other sources. More importantly, many relational databases can also ingest, manage, query against and perform operations on time series, graph, document, text and other data. They can link information across data models, too.
Let’s call this one a draw, although evidence suggests a multi-model RDBMS will do a better job with this if it also stores and manages the requisite time series, graph and other data internally.
Final Thoughts: The Complementary Data Lakehouse
This is not to dismiss the data lakehouse as a useful innovation. The use cases described in the first article in this series are indisputably compelling. Moreover, it is arguably easier -- in the sense that it is possible both to move quickly and at the same time to bypass internal impediments -- to use the data lakehouse to support time-sensitive, unpredictable, one-off and other workloads and use cases.
The data warehouse is (as it should be) a strictly governed system: It does not “turn,” it is not changed, on a dime. But this is to the data lakehouse’s advantage, in that it comprises a less strictly governed, more agile alternative to the warehouse. In other words, the lakehouse can be seen as a complement to, and not a replacement of, the data warehouse.
The problems I explore in this article and its companion pieces stem from the drive to replace the data warehouse with the data lakehouse. In this specific respect, the data lakehouse falls short. The upshot is that it is difficult, if not impossible, to square the circle -- to reconcile the design requirements of an ideal data lakehouse implementation with the technical requirements of data warehouse architecture.
[i] Basically, Google BigQuery, as distinct from the type of SQL query services used with the data lakehouse.
[ii] Under their covers, most PaaS data warehouse services are probably architected on similar schemes. Their constituent software functions -- that is, services -- are typically tightly coupled, however, such that they cannot be exchanged for equivalent services. So, for example, a subscriber cannot expect to take Amazon Redshift’s query optimizer and use it with Snowflake’s PaaS data warehouse. Neither AWS nor Snowflake expose API endpoints to permit anything like this use case.
About the Author
You May Also Like