Troubleshooting Windows Server 2012 Failover Clusters

How to get to the root of the problem with Windows Server 2012 failover clusters

John Marlin

June 28, 2013

18 Min Read
Troubleshooting Windows Server 2012 Failover Clusters

In "Troubleshooting Windows Server 2008 R2 Failover Clusters," I discussed troubleshooting failover clusters—specifically, the locations and tips for where you can go to get the data you need to troubleshoot a problem. Now I'll discuss some of the improvements made to the troubleshooting tools for Windows Server 2012 failover clusters and show you how to take advantage of those tools.

Introducing the New Event Channels

There are some new event channels for failover clustering to help with troubleshooting. Figure 1 shows all the available channels.

Note that the events are specific to the node you're on.

Knowing the purpose of each event channel can help you find the errors more quickly, which in turn will help you troubleshoot the problem more quickly. Here's an explanation of each channel:

  • FailoverClusteringo    Diagnostic. This is the main log that's circular in nature and runs anytime the cluster service starts. Events can be read in the Event Viewer if logging is disabled. They can also be converted to text file format.o    Operational. Any informational cluster events are registered in this log, such as groups moving, going online, or going offline.o    Performance-CSV. This channel is used to collect information pertaining to the functionality of Cluster Shared Volumes (CSVs).

  • FailoverClustering-Cliento    Diagnostic. This channel collects Cluster API trace logging. This log can be useful in troubleshooting the Create Cluster and Add Node Cluster actions.

  • FailoverClustering-CsvFlt (new in Server 2012)o    Diagnostic. This channel collects trace logging for the CSV Filter Driver (CsvFlt.sys) that is mounted only on the coordinator node for a CSV. This channel provides information regarding metadata operations and redirected I/O operations.

  • FailoverClustering-CsvFs (new in Server 2012)o    Diagnostic. This channel collects trace logging for the CSV File System Driver (CsvFs.sys), which is mounted on all nodes in the cluster. This channel provides information regarding direct I/O operations.

  • FailoverClustering-Managero    Admin. This channel collects errors associated with dialog boxes and pop-up warnings that are displayed in Failover Cluster Manager.

  • FailoverClustering-WMIProvidero    Admin. This channel collects events associated with the Failover Cluster WMI provider.o    Diagnostic. This channel collects trace logging associated with the Failover Cluster WMI provider. It can be useful when troubleshooting Windows Management Instrumentation (WMI) scripts or Microsoft System Center applications.

Using the FailoverClustering-Client/Diagnostic Channel

Because administrators often encounter problems when creating clusters and joining nodes, I want to show you how to use the FailoverClustering-Client/Diagnostic channel. This channel is disabled by default, so it won't be collecting any data. To enable it, you need to right-click the channel and choose Enable Log. The Diagnostic channel will then start collecting data relevant to a join or create operation.

For example, suppose you previously enabled the Diagnostic channel and you're having a problem creating a cluster. To view the data collected, you need to right-click the channel and choose Disable Log. In the FailoverClustering-Client/Diagnostic event log, you see the following events:

Event ID: 2Level: ErrorDescription: CreateCluster (1883): Create cluster failedwith exception. Error = 8202, msg: Failed to createcluster name CLUSTER on DC \DC.CONTOSO.COM. Error 8202.Event ID: 2Level: ErrorDescription: CreateClusterNameCOIfNotExists (6879): Failedto create computer object CLUSTER on DC \DC.CONTOSO.COMwith OU ou=Clusters,dc=contoso,dc=com. Error 8202.

Because you have errors, you can use the Net.exe command to see what their status code (8202) means:

NET HELPMSG 8202

The command returns the message: The specified directory service attribute or value does not exist. With the new features of Server 2012 Failover Clustering, the cluster will be created in the same organizational unit (OU) as the nodes. For the cluster name to be created, the logged-on user must have at least Read and Create Computer Objects permissions. If the user doesn't have those rights, the name won't be created and you'll receive this type of error.

Now suppose you're trying to add a node to the existing cluster and the operation fails. You review the events in the FailoverClustering-Client/Diagnostic log, and see the following:

Event ID: 56Level: WarningDescription: AsyncNotificationCallback (1463): ApiGetNotifyon hNotify=0x0000000021EBCDC0 returns 1 with rpc_error 0Event ID: 2Level: ErrorDescription: SCMStateNotify (837): Repost ofNotifyServiceStatusChange failed for node'NodeX': status = 1168

Although their wording is a bit on the cryptic side, the descriptions give you the information that you need. The description for the first event tells you that a remote procedure call (RPC) error occurred. The description for the second event gives you a status code of 1168. Once again, you can use the Net.exe command to see what that status code means:

NET HELPMSG 1168

This time, the command returns the message: Element not found. When a node tries to join a cluster, the running cluster node needs to make an RPC connection to the node being added. In this case, it couldn't find the node.

So, from the information returned by the two events, you can deduce that the running cluster node can't make an RPC connection to the node being added because it can't find that node. After further investigation, you discover that the DNS server has an incorrect IP address for the node being added. After you correct the IP address, the node successfully joins the cluster.

Introducing the New Tests in the Validate a Configuration Wizard

Another helpful troubleshooting tool that you can use is the Validate a Configuration Wizard in Failover Cluster Manager. Several new clustering tests have been added in Server 2012. All the new tests for Server 2012 clustering are in bold:

  • Hyper-V (available only if the Hyper-V Role is installed)o    List Hyper-V Virtual Machine Informationo    List Information About Servers Running Hyper-Vo    Validate Compatibility of Virtual Fibre Channel SANs for Hyper-Vo    Validate Firewall Rules for Hyper-V Replica Are Enabledo    Validate Hyper-V Integration Services Versiono    Validate Hyper-V Memory Resource Pool Compatibilityo    Validate Hyper-V Network Resource Pool and Virtual Switch Compatibilityo    Validate Hyper-V Processor Pool Compatibilityo    Validate Hyper-V Role Installedo    Validate Hyper-V Storage Resource Pool Compatibilityo    Validate Hyper-V Virtual Machine Network Configurationo    Validate Hyper-V Virtual Machine Storage Configurationo    Validate Matching Processor Manufacturerso    Validate Network Listeners Are Runningo    Validate Replica Server Settings

  • Cluster Configuration (available only if a cluster is running)o    List Cluster Core Groupso    List Cluster Network Informationo    List Cluster Resourceso    List Cluster Volumeso    List Clustered Roleso    Validate Quorum Configurationo    Validate Resource Statuso    Validate Service Principal Nameo    Validate Volume Consistency

  • Inventoryo    Storage      -  List Fibre Channel Host Bus Adapters      -  List iSCSI Host Bus Adapters      -  List SAS Host Bus Adapterso    System      -  List BIOS Information      -  List Environment Variables      -  List Memory Information      -  List Operating System Information      -  List Plug and Play Devices      -  List Running Processes      -  List Services Information      -  List Software Updates      -  List System Drivers      -  List System Information      -  List Unsigned Drivers

  • Networko    List Network Binding Ordero    Validate Cluster Network Configurationo    Validate IP Configurationo    Validate Network Communicationso    Validate Windows Firewall Configuration

  • Storageo    List Diskso    List Potential Cluster Diskso    Validate CSV Network Bindingso    Validate CSV Settingso    Validate Disk Access Latencyo    Validate Disk Arbitrationo    Validate Disk Failovero    Validate File Systemo    Validate Microsoft MPIO-Based Diskso    Validate Multiple Arbitrationo    Validate SCSI device Vital Product Data (VPD)o    Validate SCSI-3 Persistent Reservationo    Validate Simultaneous Failovero    Validate Storage Spaces Persistent Reservation

  • System Configurationo    Validate Active Directory Configurationo    Validate All Drivers Signedo    Validate Memory Dump Settingso    Validate Operating System Editiono    Validate Operating System Installation Optiono    Validate Operating System Versiono    Validate Required Serviceso    Validate Same Processor Architectureo    Validate Service Pack Levelso    Validate Software Update Levels

Except for the Storage tests, all the tests can be run at any time because they aren't disruptive to the cluster.

Using the Validate a Configuration Wizard

Let's explore how to take advantage of the Validate a Configuration Wizard. Using the previous example of the problem related to adding a node, let's say that the DNS server had the proper IP address and you can connect between the nodes outside the cluster. In this case, you can run the Validate a Configuration Wizard.

When you run the wizard, the Network/Validate Windows Firewall Configuration test fails. This test specifically looks at the Windows Firewall settings to ensure that port 3343, which is used by the cluster, hasn't been enabled. When this port is disabled, all communications on that port are blocked and you get errors in the Diagnostic channel.

Introducing the New Get-ClusterLog Command Switch

The Windows PowerShell command Get-ClusterLog lets you convert the events in a channel (e.g., FailoverClustering/Diagnostics) to a text file format. PowerShell will name the text file Cluster.log and place it in the C:WindowsClusterReports folder. If you run the command by itself, each node will have its own Cluster.log file. You can use switches to change this default behavior. Here are the switches, including the new -UseLocalTime switch:

  • -Cluster , where is the name of the cluster you want to run the command against. This allows you to specify a remote cluster. If you omit the switch, it will run against the cluster you're currently on.

  • -Node , where is the name of the node you want to run the command against. You use this command when you want to generate the Cluster.log file for a certain node only.

  • -Destination , where is the folder to which you want to copy the Cluster.log files. If you include this switch, PowerShell will not only create a Cluster.log in each node's C:WindowsClusterReports folder but also copy all of the log files to the specified destination folder. This switch will add the node's name as part of the filename (e.g., Node1_Cluster.log, Node2_Cluster.log) for the log files copied to the destination folder. This way, each node's log files are easily identifiable.

  • -TimeSpan . You use this switch if you just want to get a log file that spans the last specified number of minutes, where is that number (e.g., 5). This will give you a much smaller log file to review. You can use this switch if you're trying to reproduce an error. For example, you can reproduce the error you think might be occurring, then generate the log for the last 5 minutes to see if that's the case.

  • -UseLocalTime. As mentioned previously, this switch is new in Server 2012. Clusters write all their information in GMT. For example, if you have a cluster that's in the GMT-5 time zone and your local time is 13:00 (1:00 p.m.), Cluster.log will show a time of 18:00 (6:00 p.m.) by default. With this switch, the local time is used, so the log will show a time of 13:00. When you use the -UseLocalTime switch, the times returned by the Get-ClusterLog command can easily be matched with the Event Log times.

Now that you know how to get Cluster.log files, it's time to learn how to read and search through them.

Reading Cluster.log Files

Reading Cluster.log files takes a long time to master, because they contain a lot of information that can be confusing. However, I'll give you some tips that can help you get started.

The first thing you need to understand is the anatomy of a Cluster.log file. Every entry has the same basic structure. Here's an entry for an IP address resource coming online:

00000bb8.000001d4::2013/05/15-01:13:24.852INFO [RES] IP Address :Online: Opened object handle for netinterface353c85ee-7ea7-4b2a-927d-1538dffcdecd

Let's break this entry down into smaller pieces to make better sense of it:

00000bb8. This is the process ID in hexadecimal notation. Typically, the process is the Resource Host System (RHS). You can see what resources the process is using by sorting or searching for the lines that include this process ID. This is useful when debugging an RHS dump if you have multiple files present. Each of these dumps is identified by a process ID, so knowing what the process ID is will ensure that you're working with the correct process dump. If you have a complete memory dump, there will be multiple RHS processes. Each is identified by the ID, so you can get to the correct one.

000001d4. This is the thread ID in hex notation. You can see what the thread is doing by sorting or searching for lines that include this thread ID. When you're debugging an RHS process that has 100 threads, you can jump right to the correct one using this ID.

2013/05/15-01:13:24.852. This is the date and time in GMT (unless the -UseLocalTime switch was used to generate the log). So if you're using GMT-5, the local time in this case is May 14, 2013, at 8:13 p.m. The time goes down to milliseconds.

INFO. This is the level of the entry. The level can be INFO (informational), WARN (warning), ERR (error), or DBG (debug). There are a few others, but these levels are what you'll see the majority of the time. Generally, a line with ERR in it indicates a problem with a resource. When you open a Cluster.log file after a failure, you can search for a specific level to try to get to the problem area quicker.

[RES] IP Address. This is the resource type. A resource will always identify its type in the log. With this information, you can more quickly follow the resource going online when there are multiple types of resources all coming online at the same time.

. This is the actual resource, as shown in Failover Cluster Manager.

Online: Opened object handle for netinterface 353c85ee-7ea7-4b2a-927d-1538dffcdecd. This is a description of what's going on with the resource. What's going on here is that the resource is opening a handle to the network card driver in order to bind the IP address to it. If it fails here, it's most likely a problem with the network card driver not accepting anything, which means it's bad. Alternatively, the network card might have died. Your next step would be to review the System event log entries to check for any network type events, such as the network going down or a card failing. With many of the descriptions, the more you see them, the more you'll understand what they mean. A description can be particularly helpful if it's describing the last action that occurred before a failure.

Searching Cluster.log Files

When reviewing Cluster.log files, it helps to search for keywords. Table 1 provides a list of keywords that I use when searching for resources.

Keyword

Table 1: Keywords to Use When Searching for Resources

-->OnlinePending

-->OfflinePending

-->Offline

-->Online

-->ProcessingFailure

Note that you should type these keywords exactly as you see them. In other words, include the hyphen hyphen greater-than symbol (-->) and don't include any spaces.

You can also use these keywords to quickly determine how long a resource took to go offline or come online. For example, suppose that a group is taking longer than normal to come online. You can use -->OfflinePending as a starting point, then use -->Offline for all resources in the group. Alternatively, you can use -->OnlinePending followed by -->Online. For each resource, add up all the times to see how long it took to come online. After you've done that for all the resources, you can compare the resources' total times to see which resource took the longest amount of time. You can then reviewthe entries in Cluster.log to determine why. For example, if a group took 30 seconds total to come online on one node and 3 minutes total to come online on another node, you should generate Cluster.log files for both nodes and compare them.

You can use the same keywords for groups, except that there must be a space after the greater-than symbol. For example, if a group goes offline, you would first use --> OfflinePending, followed by --> Offline. The only other difference between the resource entry and the group entry is that the group failure is --> Failed, whereas the resource failure is -->ProcessingFailure.

Putting It All Together

To see how all the information presented fits together, let's walk though solving a problem. Suppose that you have a two-node cluster configured with multiple file servers using different networks and a Fibre Channelconnected SAN. Here's the setup for the networks:

  • Cluster Network 1 = IP scheme 192.168.0.0/24

  • Cluster Network 2 = IP scheme 1.0.0.0/8

  • Cluster Network 3 = IP scheme 172.168.0.0/16

In the nodes' network connections, the network adapters are identified as:

  • CORPNET = IP scheme 192.168.0.0/24

  • MGMT = IP scheme 1.0.0.0/8

  • BACKUP = IP scheme 172.168.0.0/16

FILESERVER1 is using Cluster Network 1, which is running on NODE1. FILESERVER2 is using Cluster Network 2, which is running on NODE2.

Let's say that there was a failure overnight and a file server group named FILESERVER2 was moved from NODE2to NODE1. You need to find out why the failure occurred.

The first thing you do is open Failover Cluster Manager, right-click the FILESERVER2 group, and select Show Critical Events. This brings up the following events:

Event ID: 1069Description: Cluster Resource 'IP Address 1.1.1.1' oftype 'IP Address' in Clustered Role 'FILESERVER' failed.Event ID: 1205Description: The Cluster service failed to bring clusteredservice or application 'FILESERVER2' completely online oroffline. One or more resources may be in a failed state.

The first event tells you that IP Address 1.1.1.1 had a failure. So, you right-click this resource in Failover Cluster Manager and choose Show Critical Events. You see these events:

Event ID: 1077Description: Health check for IP Interface'IP Address 1.1.1.1' (address 1.1.1.1) failed (status is1168). Run the Validate a Configuration wizard to ensurethat the network adapter is functioning properly.Event ID: 1069Description: Cluster Resource 'IP Address 1.1.1.1' oftype 'IP Address' in Clustered Role 'FILESERVER' failed.

Based on the description in first event (event 1077), you decide to use the Validate a Configuration Wizard. You want to run only the Network/Validate Network Communication test because that test will check the adapters and all network paths between the nodes.

After you run the Network/Validate Network Communication test, you check the test report. You don't see any errors or warnings, so you put it aside.

There are event channels you can review, so you go into the FailoverClustering/Operational channel, where you see this event:

Event ID: 1153Description: The Cluster service is attempting to failoverthe clustered service or application 'FILESERVER2' fromnode 'NODE2' to node 'NODE1'

Because of this description, you go into the FailoverClustering/Diagnostics channel, where you see these events:

Event ID: 2051Description: [RCM] rcm::RcmResource::HandleFailure:(IP Address 1.1.1.1)Event ID: 2051Description: [RES] IP Address :Failed to query properties of adapter idF3EDD1C8-6984-82BC-498806B841CA, status 87.

Based on this information, you generate a Cluster.log file for this node. In the log, you search for -->ProcessingFailure and find these entries:

[RES] IP Address : IP Interface3600A8C0 failed LooksAlive check, status 1168.[RES] IP Address : IP Interface3600A8C0 failed IsAlive check, status 1168.[RHS] Resource IP Address 1.1.1.1 has indicated failure.[RCM] Res IP Address 1.1.1.1: Online -> ProcessingFailure( State Unknown )[RCM] TransitionToState( IP Address 1.1.1.1)Online-->ProcessingFailure.

A bit later in Cluster.log, you see the entries documenting that the group was being moved. This is a good indication that the entries found with the -->ProcessingFailure search are related to the problem that caused the group to be moved. Because of the errors seen in those entries, you know for sure that the IP address resource failed. Tofind out what the errors' status code means, you use the Net.exe command:

NET HELPMSG 1168

The command returns the message: Element not found. After looking more closely at the entries, it appears as though the actual problem might be with the network adapter. So, you run some hardware tests against the adapters and find that one adapter is faulty and not even showing up in Windows anymore. Replacing the faulty adapter is the course of action to fix the problem.

But there's still the question of why the Network/Validate Network Communication test results didn't show any errors when everything else did. This test checks all network adapters, going from one node to another, whether they're on the same network or not. It does this so that it knows all the routes it can take to get to the other nodes. So, there are some expected failures just because of the way the networks between the nodes are cabled or segmented.

You decide to look more closely at the test report. That's when you spot the output shown in Figure 2.

You notice that NODE1 doesn't have a network adapter defined as MGMT. This is basically saying the same thing as the events, which is that NODE1 has two networks and NODE2 has three networks. So, the lesson here is that you need to do more than just look at the errors or warnings at the top of the report. You also need to look at the test results.

Get to the Root of the Problem

Troubleshooting a cluster is like troubleshooting just about anything. There are different ways to troubleshoot and multiple things to look at in order to get to a problem's root cause. I presented one way to get to the root cause, and I hope you're able to use it when troubleshooting problems in your clusters. For more information pertaining to failover clustering, check out the Ask the Core Team blog site and the Clustering and High Availability blog site.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like