AlwaysOn Availability Groups and SQL Server Jobs, Part 12: Issues with Enabling and Disabling Batch Jobs
An in-depth look at some of the concerns involved with enabling/disabling entire SQL Server jobs based upon whether they should be run, or not—and then look at options for addressing those concerns.
October 30, 2014
In my previous post, we re-evaluated whether Batch Jobs (see Post #3 – Defining Batch Jobs, for more on that term) should simply be configured or injected with If/Else logic that helps them determine if Logic should be executed or not at run time, or whether entire Batch Jobs should simply be Enabled/Disabled based upon whether or not they're hosted on the same server (or instance) that is hosting the Primary Replica for the AlwaysOn Availability Group database being targeted.
In this, and the next few posts, we'll start looking more in-depth at some of the concerns involved with Enabling/Disabling entire jobs based upon whether they should be run, or not—and then look at options for addressing those concerns.
Examining the Concerns
While the idea of Disabling or Enabling SQL Server Agent Jobs is fairly straightforward, there are some wrinkles. First, we need to remember that this approach or technique doesn't—or shouldn't—apply to Instance/Server-level jobs, as those should be enabled on all servers hosting your AlwaysOn Availability Group. The same thing goes for backups—they're a whole other beast. (Again, see Post #3 for more insight into these concerns).
Otherwise, the entire idea of Enabling/Disabling jobs is that if we have a Batch Job that's designed to target MyDatabase, and that database is being hosted in a simple 2-node Availability Group such that the MyDatabase database is straddling two different Servers/Instances, we only want the logic in the batch job in question to be run against the primary replica, such that the SQL Server Agent Job on the Primary Server should be Enabled, while it's duplicated/cloned counterpart on the Secondary Server should be Disabled (so that it won't run and try to make changes to its own recovering copy of the database, and throw errors).
Implementation of the Logic defined above, though, can easily run afoul of a few practicalities:
There obviously has to be some sort of Logic running behind the scenes that will Enable/Disable jobs as needed (and especially during/after/around failovers).
We obviously need some way of keeping track of which Jobs should be Enabled/Disabled based upon their affinity with the Availability Group database being targeted. Or, in other words, if we have 18 SQL Server Agent jobs deployed on a server but only three of them are Batch Jobs that target one of our Availability Group (AG) databases, we can't just blindly assume that we can Enable/Disable all 18 jobs on a server based upon whether or not the Server in question is hosting the Primary Replica we're targeting. Instead, we only want to toggle the Enabled/Disabled status of our three AG jobs.
Further, just because we've been able to differentiate our three AG-level jobs from the other 15 jobs on the server doesn't mean that we can make blind assumptions about which of those jobs should be enabled on a server after a failover occurs. For example, assume that of our three AG-Level jobs, one of them has been intentionally disabled by administrators for, say, a few days. Assume, likewise, that Server1 is currently hosting the Primary Replica of our target database. Then, assume that a day after we disable one of those three jobs, we fail over. We CAN logically assume that once we failover that all three AG-level jobs on Server1 should be disabled. But, we can't blindly assume that all three AG-Level jobs should be enabled on Server2. (Or, if we did assume that we'd obviously run into issues.) Granted, we might try to query Server1 to see what the state of those three jobs were, but the whole idea of Availability Groups and failover is the notion that Server1 most likely won’t be available to query if/when a failover occurs. So, we need some other means of tracking the state of AG-Level Jobs (i.e., whether they've been explicitly Disabled or not by Admins so that they can REMAIN disabled after a failover).
Addressing the Concerns
To address the three concerns listed above, we’ll use two techniques.
Identifying AG-Level Databases. First, to identify which SQL Server Agent Jobs on a given server are batch jobs (or target an Availability Group database), we’ll use a pattern or technique from SQL Server Database Mirroring—exposed by the SQL Server Customer Advisor Team—of creating custom SQL Server Agent Job Categories—where the name of the Category matches the Availability Group name itself, and (then) where Jobs that target the Availability Group (or databases within it) are linked or associated with the custom Category in question.So, for example, if I have an Availability Group called SSV (with a single database in it, also called SSV4), and I have a Batch Job that I want to associate with the SSV4 database, I'd start by right clicking on the SQL Server Agent > Jobs node, selecting "Manage Job Categories," and create a new Job Category entitled SSV (i.e., targeting the name of the Availability Group NOT the database). I'd then go into the batch job in question and specify that it should belong to the SSV category, as shown below: And, from this point on, it's very easy to tell that this SQL Server Agent Job is associated with an Availability Group.
Identifying Enabled/Disabled AG-Level Jobs. While the SQLCAT article mentioned above does provide logic (or code) that showcases how to enable/disable Jobs based upon what SQL Server Agent Job Category they belong to, it doesn't address the need to account for SQL Server Agent Jobs that might've been temporarily—but explicitly—Disabled by administrators for whatever reason. To address this issue, we'll end up with an approach that involves a state table—or a simple table that will keep track of Job Names and their Enabled/Disabled status. This, of course, will impose some additional complexity, and cause a number of additional sub-concerns that will need to be addressed as well. The good news, though, is that storage of this table is pretty simple—as we can put it into one of our Availability Group databases (i.e., in this case I'd put it in the SSV4 database) and that means that whatever Server is currently hosting the SSV Availability Group will have access to this state data.
In my next post, we'll look further into what it takes to set up a state table and address some of the potential pitfalls that might entail as well.
Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5: Setting Up Failover Alerts
Part 6: High-Level Options and Rationale for Handling Batch Jobs
Part 7: Detecting Primary Replica Ownership
Part 8: Dynamic Detection Gotcha #1
Part 9: Dynamic Detection Gotcha #2
Part 10: Dynamic Detection Gotcha #3
Part 11: A Fork in the Road
About the Author
You May Also Like