Data Warehouse Automation and the Hybrid, Multi-Cloud

Data warehouse automation tools can simplify the design, management and maintenance of data warehouse systems in multi-cloud and hybrid cloud deployments.

Stephen Swoyer, Contributor

October 29, 2021

13 Min Read
Data Warehouse Automation and the Hybrid, Multi-Cloud

A data warehouse automation (DWA) tool can simplify cloud-to-cloud migration scenarios, accelerating the initial work of the migration process. In the same way, a DWA tool can simplify the design, management and maintenance of a data warehouse system in a multi-cloud deployment (that is, spanning multiple cloud service providers), as well as in hybrid (on-premises and cloud) deployments.

However, in most migration scenarios, IT and business experts will be required to adapt an existing warehouse to the new cloud IaaS or PaaS environment. A DWA tool can help with this, too--not by automating the re-engineering of the data warehouse, but, instead, by consolidating the functions associated with data warehouse design, testing, deployment and maintenance into a single tool.

First, the usual caveats apply: DWA is most useful when an organization standardizes on a tool as the sole means of designing, deploying and maintaining the data warehouse. The DWA tool captures and manages all of the work performed on the warehouse, which enables it to bring this work forward to the new data warehouse target. See the first and second installments  in this series for more about this.

DWA and Cloud-to-Cloud Migration

A DWA tool is less a means of migrating the data warehouse than of translating that system to run in a different context. In effect, a DWA tool recreates, or regenerates, the data warehouse from scratch.

In the second installment in this series, I likened a DWA tool to a kind of compiler that translates the native structures (such as tables, primary and foreign keys, indexes and constraints) that are specific to one data warehouse platform into equivalent structures that are specific to another, dissimilar platform.

So, at a high level, using a DWA tool to re-engineer a data warehouse designed for Amazon’s Redshift PaaS to run in Snowflake’s PaaS is analogous to porting C code from one Unix-like platform (say, HP-UX) to another (AIX). However, as anyone who has ever ported software between platforms will tell you, the devil lies in the details--details that include incompatible features and functions, unmet dependencies, dissimilar API implementations, unpredictable performance, and so on. Most of the work of porting software lies in troubleshooting problems. This is true in data warehouse migration, too.

To recap, an organization can use a DWA tool to quickly regenerate its existing data warehouse--along with all of its assets--for the new target platform. The DWA tool automates the following tasks:

  1. Regenerate SQL code to create data structures optimized for the target platform

  2. Move all data from the old to the new data warehouse target

  3. Bring forward (updating and/or deprecating as necessary) all native objects

  4. Port and schedule ELT jobs, data conditioning jobs, and associated logic

  5. Related: regenerate and/or re-sequence ELT code for optimal parallel execution

  6. Update, delete, and/or create metadata and documentation

  7. Test and validate steps 1-6 in the migration process

From IT’s perspective, the DWA tool regenerates existing views, indexes, models, stored procedures, UDFs, metadata--including technical metadata for data lineage--documentation, etc., for the cloud data warehouse target. The DWA tool also brings forward all of the data engineering pipelines that feed the warehouse. Most DWA tools shift the data engineering workload from a separate platform (for example, an extract, transform, load, or ETL, platform) to run in the target data warehouse.

This extract, load, transform (ELT) technique centralizes the work of managing and maintaining data engineering pipelines. In regenerating an existing cloud data warehouse for a dissimilar cloud target, the DWA tool tests, validates, and schedules ELT jobs in the destination IaaS or PaaS data warehouse. Basically all DWA tools interoperate with OS-level scheduling software, such as cron (in Un*x platforms) or Task Scheduler (on Windows). Some also have built-in schedulers, too.

This is the easy part. The last section of this article touches on the hard work of re-engineering the data warehouse to run in a dissimilar cloud IaaS or PaaS environment. A DWA tool can help with this, too.

DWA and Hybrid/Multi-cloud Deployments

The following is not intended as a step-by-step tutorial on creating and maintaining the hybrid/multi-cloud data warehouse, but, rather, as a cursory analysis of the issues involved.

More specifically, it explores the benefits of using a multi-target DWA tool to assist with moving a data warehouse system between and among dissimilar cloud PaaS services and using a DWA tool to assist with creating and maintaining the hybrid—that is, on-premises and cloud--data warehouse.

1. Decomposing the functions of a large enterprise data warehouse into smaller instances.

One trend among enterprises that move large, on-premises data warehouses to cloud infrastructure is to break up these systems into smaller units--for example, by subdividing them according to discrete business subject areas and/or practices.

IT experts can use a DWA tool to accelerate this task--for example, by sub-dividing a complex enterprise data model into several subject-specific data marts, then using the DWA tool to instantiate these data marts as separate virtual data warehouse instances, or by using a DWA tool to create new tables that encapsulate different kinds of dimensional models and instantiating these in virtual data warehouse instances. In most cases, the DWA tool is able to use the APIs exposed by the PaaS data warehouse service to create a new virtual data warehouse instance or to make changes to an existing one. The tool populates each instance with data, replicates the necessary data engineering jobs and performs the rest of the operations in the migration checklist described above.

2. Moving data warehouse instances (data marts) between cloud infrastructure providers.

If an organization uses the same cloud data warehouse across dissimilar cloud infrastructure environments, this should be relatively straightforward. As before, the DWA tool runs through the migration checklist (see above) and regenerates the existing data warehouse/data marts for the new cloud infrastructure context. In this scenario, the catch comes by way of likely data egress fees--the not-significant cost of moving data out of one infrastructure context (Snowflake running in AWS) and into another (Snowflake running in Azure).

Of course, Snowflake, like most PaaS platforms, offers tools to help automate cloud-to-cloud migration, as well as to help manage workloads across dissimilar cloud infrastructure contexts.

The real value of a DWA tool in multi-cloud deployments lies in its ability to facilitate migration--and to centralize warehouse design and maintenance--across dissimilar cloud data warehouse services.

3. Moving data warehouse workloads between on-premises and cloud contexts.

In the same way, for the same reasons, a DWA tool can simplify hybrid data warehouse deployments--especially when the on-premises data warehouse software (say, IBM DB2) is of a different type than the cloud data warehouse software (AWS Redshift or Snowflake).

You can use a DWA tool to break up the enterprise data warehouse and to move parts of it (for example, specific business subject areas or workloads such as disaster recovery) to one or more cloud data warehouse targets. Or to regenerate the data warehouse for a single cloud target.

The process is similar to that described in the migration checklist, above. Depending on the requirements of the organization, it may take several forms--for example, moving or replicating the data warehouse in toto from the on-premises environment to the cloud; breaking up the data warehouse data model into several subject-specific models; moving specific use cases or workloads (such as disaster recovery) from a recovery data center to the cloud; or relocating specific practices (such as BI discovery, data science and data engineering) to the cloud context.

4. Terraforming new data warehouse or data mart instances.

Organizations can use a DWA tool to rapidly create--to “terraform,” in the argot of cloud--data warehouse or data mart configurations for hybrid deployments, as well as for multi-cloud deployments that span dissimilar IaaS/PaaS data warehouse services and/or dissimilar cloud infrastructure environments.

Example scenarios include:

●      Different kinds of subject-specific Snowflake data marts can be deployed across dissimilar cloud infrastructure services, such as AWS, GCP and Azure. Ditto for IaaS instances of DB2, Oracle, SQL Server, etc. running in AWS, GCP or Azure.

●      Dissimilar types of data warehouses and/or subject-specific marts can be created for and deployed across dissimilar cloud infrastructure services. So, for example, a sales data mart engineered for AWS Redshift can be re-engineered for an IaaS Oracle data warehouse in Azure or for Google BigQuery. A data warehouse engineered for IaaS Oracle running in GCP can be re-engineered for Redshift running on AWS.

●      Dissimilar types of data warehouses and/or subject-specific data marts can be created for and deployed across the on-premises environment and (dissimilar) cloud IaaS or PaaS services. Data warehouses/marts can likewise span multiple, dissimilar cloud infrastructure providers. So, for example, you can use the same DWA tool to manage an on-premises Oracle data warehouse, several subject-specific data marts in AWS Redshift and Snowflake instances running in Azure that support BI discovery.

These are just a few possible scenarios; the total number of possible combinations is staggering. The value-adds in this case are convenience and, especially, portability. Need a new data mart for a specific business use case? What about a new sandbox for a one-off business analysis or experimental use cases? You can easily reuse an existing data mart and regenerate it for the appropriate cloud IaaS/PaaS data warehouse. This is roughly analogous to creating different kinds of customized virtual data warehouses that you can quickly deploy to address predictable (or unpredictable) needs.

True, almost all cloud PaaS offerings boast similar capabilities, and it is also simple to clone IaaS and PaaS data warehouse instances. This is one of the selling points of the cloud, after all. However, none of these cloud services promotes portability between and among competitive IaaS/PaaS offerings--that is, you can’t take a virtual data warehouse created for IaaS Oracle and re-deploy it in Redshift. A DWA tool can. It likewise supports versioning to manage (and, if appropriate, merge) changes between dissimilar cloud data warehouse services.

This makes it practicable to take advantage of differences in pricing between providers--for example, spot discounts on pricing--or to opportunistically exploit other efficiencies.

DWA in practice: Creating, maintaining and migrating the hybrid/multi-cloud data warehouse

So, imagine that you use a multi-target DWA tool to design and engineer an enterprise data warehouse for Azure Synapse Analytics. To make things even easier, imagine that you use a DWA tool to design and engineer several subject-specific data marts for Azure Synapse.

On the one hand, you can use the DWA tool to modify and maintain these data marts, changing them over time; all of the changes you make are automatically captured and versioned by the tool itself. On the other hand, you can use the DWA tool to regenerate your Azure Synapse data marts for a dissimilar platform--say, Snowflake--running in dissimilar infrastructure--say, GCP. You can also use the tool to perform (and capture) the changes you make as you re-engineer your data marts for Snowflake.

Once the Snowflake data marts are up and running, you can turn off your Azure Synapse marts. The thing is, you can still keep your Azure Synapse marts (that is, their different versions) in the DWA tool. If you make useful modifications to your Snowflake data marts, you can merge these into your archived Azure Synapse marts. If need be, you can redeploy these marts in Azure Synapse and switch off your Snowflake instances.

The point is that once you use a DWA tool to design or customize a warehouse or data mart for a particular IaaS or PaaS, the tool simplifies the task of customizing those assets for dissimilar IaaS or PaaS data warehouses and marts--running in either the same or in dissimilar cloud infrastructure environments--as well as that of maintaining and versioning these assets. Analogically, think of this process as akin to checking out a data warehouse branch from Git, making changes to it, creating a new branch and, if you opt to deploy it in production, recompiling it for the new environment.

In the same way, you can use a DWA tool to deploy parts of a large on-premises or cloud enterprise data warehouse to dissimilar cloud infrastructure or IaaS/PaaS contexts. So, if you want to move your finance functions from Azure Synapse to Snowflake, a DWA tool can assist with this. And it can radically simplify the process of reverting from Snowflake back to Azure Synapse.

In this way, a DWA tool makes it practicable--though not in any sense simple--to manage data warehouse instances across dissimilar clouds. It makes it practicable to exploit different kinds of aleatory advantages in the cloud--pricing disparities, performance benefits, intra-cloud data movement--and it gives you more options for managing the disaster recovery use case. It can also act as a hedge against both cloud infrastructure provider and cloud service provider lock-in.

It does require that you put in the work--the admittedly hard work--of custom-engineering your data warehouse and/or data marts to take advantage of the strengths, or to work around the weaknesses, of each specific IaaS or PaaS data warehouse. The role of a DWA tool in multi-cloud or hybrid cloud deployments is to function as a kind of version control system for data warehouse development, centralizing the work of design, testing, deployment, maintenance and versioning.

For a summary of the work involved in custom-fitting the data warehouse for its (on-premises or cloud) environment, see the final section of the second article in this series. Some takeaway notes:

Right now, DWA tools support for cloud PaaS data warehouse services is limited. Generally, a multi-target DWA tool will work with the same targets in either the on-premises or the cloud IaaS contexts. So, for example, a tool that supports DB2, Oracle, and SQL Server in the on-premises environment should work with all three platforms in the IaaS cloud.

Official support for PaaS and QaaS data warehouses is still extremely limited, however.

Among PaaS data warehouses, Amazon Redshift, Azure Synapse Analytics, and Snowflake are supported by several tools. Among QaaS warehouses, Google BigQuery also enjoys cross-tool support. However, PaaS data warehouses from Alibaba, IBM (DB2 Warehouse on Cloud), Oracle (Autonomous Data Warehouse), SAP (SAP Data Warehouse Cloud), Teradata (Vantage), and YellowBrick, among others, do not enjoy extensive support.

This will almost certainly change. Several DWA tools vendors have long-standing partnerships with Oracle and Teradata, especially. Officially, however, multi-cloud portability between PaaS and QaaS data warehouses is currently limited.

The simpler the warehouse data model, the simpler the migration. In the last two decades, especially, data management experts have started using relatively simple data models, such as denormalized one-big-table (OBT) schemas, with the data warehouse.

One benefit of this approach is simplified portability: some databases are just better than others at handling complex data models. In this case, modeling that would otherwise have been instantiated in the warehouse’s data model is, instead, instantiated in database views, in a BI semantic layer, in a data virtualization layer/data fabric, or in data pipelines (queries) that write data to and retrieve data from the warehouse. This makes it easier for the DWA tool to regenerate the warehouse for dissimilar target platforms. It also reduces the engineering required to custom-fit the new warehouse.

 

About the Author

Stephen Swoyer

Contributor

Stephen Swoyer is a researcher and analyst with 25 years of experience. His research has focused on business intelligence (BI), data warehousing, analytics, data science, machine learning and the larger information management segment for nearly 20 years. He also enjoys researching and writing about software development and software architecture – or technology architecture of any kind, for that matter. Swoyer is a recovering philosopher, with an abiding focus on ethics, philosophy of ideas and philosophy of science.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like