Executing a Zero-Downtime Storage Hardware Refresh
Refreshing storage hardware with downtime and data loss requires careful planning. Here’s a real-world example.
February 22, 2024
In a recent article, I explained how I planned a storage refresh in my environment. I outlined five basic requirements that my refresh had to meet:
Increase storage capacity to meet my needs for the next five years.
Complete the storage upgrade without any downtime.
Perform the storage upgrade without experiencing data loss.
Ensure that the new storage maintains or improves upon the current level of resilience.
Match the performance of the new storage with my current setup.
Given these requirements, I would like to discuss how I executed the storage refresh to ensure zero downtime and prevent data loss (meeting requirements 2 and 3).
The Production Environment’s Setup
Before the hardware refresh, my production environment consisted of two Hyper-V hosts connected to a dedicated NAS. I have a single, very large virtual machine that contains all my data. The VM is replicated across both servers via the Hyper-V replication feature.
I built my production environment this way (instead of creating a failover cluster) to achieve genuine shared-nothing redundancy. The replication process occurs automatically every 30 seconds. As such, in the event of a critical failure, I could activate the standby replica, and so, theoretically, should never lose more than 30 seconds’ worth of data.
A Redundancy-Driven Approach
I decided to maintain this redundancy type since it has always worked well for me. For the hardware refresh, my plan involved creating an offline backup, which would act as a last line of defense if something went horribly wrong. From there, I would:
Verify that the replicas are in sync and then break the replica pair.
Shut down the replica NAS and the replica host.
Remove and replace the replica NAS, bring it online, and re-enable Hyper-V replication.
Once all data gets replicated to the new NAS, I would perform a lossless failover to the replica server, making it host the running copy of the production virtual machine.
Break the replica pair again, replace the other NAS, bring it back online, and then reestablish the replication process.
Finally, I would perform one more lossless failover to return the running copy of the VM to its original host.
Verifying a replica’s health in Hyper-V is simple. Just open the Hyper-V Manager, right-click on the virtual machine, and select the Replication | View Replication Health commands from the shortcut menus.
It’s a good idea to perform this check on both replication partner hosts. In rare circumstances, two replication partners can report completely contradictory health data. Given that one replication partner will be taken offline, it’s important to confirm the replication’s health.
Figure 1. It’s important to verify that Hyper-V replication is healthy.
After verifying the replication health and confirming that all data has been replicated between the two hosts, the next step is to disable replication. In the Hyper-V Manager, right-click the virtual machine and select the Replication | Remove Replication commands from the shortcut menus. This action needs to be performed on both Hyper-V hosts. The process does not delete the virtual machine copy (the replica) but does stop further data replication.
Figure 2. You can use the Remove Replication menu option to terminate the replication partnership.
Downtime and Data Loss
When migrating a virtual machine, consider these two critical points:
1. Downtime and data loss
As previously noted, my refresh requirements included zero downtime and no data loss. Technically, you cannot accomplish the migration I am performing without any downtime and data loss. A lossless failover (Microsoft refers to it as a "planned failover") requires powering down the virtual machine during the process. It ensures no data loss but does have a brief downtime (usually lasting a minute or so). The alternative, an "unplanned failover," avoids downtime but risks losing data that hasn't been replicated.
2. Minimized downtime with risks
The migration method I used requires a very short downtime but relies on the primary Hyper-V host remaining operational during the storage refresh. During this phase, no standby virtual machine replica is available. Even so, there is some hardware-level redundancy that will help mitigate the risk of a failure. For example, my Hyper-V host servers have redundant power supplies, while my existing NAS appliances are configured with redundancy to protect against disk failures.
Read more about:
Technical ExplainerAbout the Author
You May Also Like