AlwaysON Availability Group Does Not Failover to second Node Upon Stopping SQL Server Services


When SQL Server AlwaysON Availability Groups are setup in SQL Server 2012 or SQL Server 2014, DBAs perform various testing by shutting down one node and verifying that the Availability Group fails-over to the second node and another approach tried is manually stopping the SQL Server services and expect the Availability Group to failover to the second node, but sometimes you may see that the failover does not occur, instead the AlwaysON AG and IP address resources get into failed state. Manually failing over the AlwaysON AG to the second node however successfully brings the resources and Availability Group online and accessible.

This issue generally happens if the failover has been attempted multiple times which exceeds maximum number of failures supported for the availability group during a given time period. This is not specific to AlwaysON rather can also happens with SQL Server Failover Cluster instances as well.

The Default value for the maximum number of failures during this period is n-1, where n is the number of WSFC nodes and the default time period is six hours. If an availability group exceeds its WSFC failure threshold, the WSFC cluster will not attempt an automatic failover for the availability group. Furthermore, the WSFC resource group of the availability group remains in a failed state until either the cluster administrator manually brings the failed resource group online or the database administrator performs a manual failover of the availability group. Ref

With default values, Automatic failover only happens once during six hours. So, the solution to this problem is to change the failover-threshold values for a given availability group. We can use the WSFC Failover Manager Console and increase “Maximum Failures in the specified period” to higher value based on our requirement and reduce the “Period (Hours)”. For example, we can set the “Maximum Failures in the specified period” to a value like 10 and “Period (Hours)” to 1 hour, which means in one hour AlwaysOn Availability Group will be tried to failover for 10 times. If it fails for 10 times, then it will be left in failed state and DBA has to investigate based on the cause of the failure and fix it after which it can brought online manually. The values are to be set based on your requirement and may vary in different environments.

This is applicable on below versions of SQL Server

SQL Server 2012
SQL Server 2014

Hope this was helpful.

SQLServerF1 Team
In-Depth Blogs on SQL Server, Information about SQL Server Conferences and Events, SQL Server Frequently asked questions, SQL Server Trainings.