Important SQL Server Alerts for Monitoring Incidents and Performance

SQLServerF1

It is important for SQL Server DBA to configure alerting for the SQL Server instances so that an E-Mail or a alert is sent to the DBA team in case of any problems which are critical or warnings. Handling and fixing the warnings will avoid any critical events from happening. It is not possible to monitor all the servers and instances 24/7, so there should be either a custom solution or third party monitoring tool to detect and send alerts when certain condition is met.

When preparing a custom solution or enabling alerts on 3rd party tools, we need to identify the list of important alerts which are to be monitored and how frequent these checks are to be made and the threshold on which the alerts are to be sent to the DBA team. Also, need to understand which of these are critical alerts and which can be warning level alerts.

Below are some of the common and important alerts with their description, Severity and what should be the general threshold for these alerts to be chekecd on.

Critical Alerts – These alerts generally cause the system to go down or can cause an outage. These alerts should be identified and resolved quickly and in case the issue is not resolved, this should keep sending the alerts. Also there should be option to temporarily snooze or turnoff the alerts which DBA team is working on them and they expect that it may take some time to fix the issue, this will avoid unnecessary alerts to be sent continuously.

Below are some of the Critical Alerts which need to be monitored and resolved immediately.

SQL Server Service State – This check should occur every 5 minutes and should check weather the SQL Server service is running or stopped. This should check for all the SQL Server services and SQL Server Agent services, etc

SQL Server Connectivity – This check should occur every 5 minutes to check if connections can be made to the SQL Server instance.

SQL Server Cluster Failover Event – On cluster servers, we need to have check which can detect any failover events with the SQL Server cluster resources and also should provide the state of SQL Server cluster resources. This check should occur every 5 minutes.

SQL Server Errorlog Scanner – SQL Server Errorlog consists of various errors, warning messages and informational messages. As this contains a mix of critical, warning and informational messages, we need to have this check run every 5 minutes. There should be customization allowed to list the messages which can be ignored, so that the alert for informational and some warning messages can be suppressed and alerts need not be sent for those.

CPU Check – This is a windows level check which need to identify the CPU usage on the server and raise alert if the CPU condition is consistently high like above 90%. This check is to be made every 5 minutes.

Memory Check – This is a windows memory check to identify if there is any low memory condition on the server which can affect all the services running on that server. This check needs to be run every 5 minutes.

Low disk space – This check needs to identify the free disk space going down to very low like less than 10% free disk space. This can soon cause an outage as there will not be any more room for the database files too grow on the disk.

Blocking – Blocking in SQL Server is considered critical as it can cause disruption to the application and the users will face issues. This needs to be checked every 5 minutes and to be alerted. There should be mechanism to snooze the alert which DBA is looking into the issue.

Warning Alerts – These alerts generally do not cause any outage or no immediate action is required, but is important to troubleshoot and fix it as soon as possible to avoid it from escalating to an critical alert. The threshold for these alerts can be little relaxed compared to critical alerts.

Below are some of the Warning Alerts which need to be monitored and resolved in a timely fashion to avoid escalating in to Critical alerts.

Disk space below 20% Threshold – This should check of the free disk space on any drive falls below 20% free, then alert needs to be sent to DBA team. DBA can check and identify the cause of low disk space and try to resolve immediately and it can go further low and can raise a critical alert or can cause an outage if there is no more free space left.

Backup Checks – It is important for the DBAs to make sure there are regular backups of all the databases. This check can be made daily once or twice to check of the Full and Log backups are happening regularly. Also, separate check can be used to just check the log backups and the check can be scheduled to run more frequently on important systems.

Job Failures – This check should check for any failures with any jobs. This check can be configured to check every 5 minutes. Depending on the importance of the job, DBAs can look into it in case of any job failure.

Long Running Queries – This check should send list of long running queries which are taking more time like more than 15 seconds or any other threshold based on the baseline.

Suspended Sessions – Although it is not common, but some environments can have too many suspended sessions thus causing exhaustion of all the available worker threads thus leading to an outage. So, it would be good to check every 15 minutes to see if there are too many suspended sessions.

Errors or Warnings from EventsLogs – There can be some critical or warning events that may have been raised in the windows Application or System eventlogs, so it is important to send alerts for such errors. This can be configured to send the list of errors or warnings every day once or few times in a day.

There can be many other alerts that can be configured and may be specific to each environment, but above should be monitored on all the server running SQL Server instances.

This is applicable on below versions of SQL Server

SQL Server 2005
SQL Server 2008 R2
SQL Server 2012
SQL Server 2014

Hope this was helpful.

Thanks,
SQLServerF1 Team
In-Depth Blogs on SQL Server, Information about SQL Server Conferences and Events, SQL Server Frequently asked questions, SQL Server Trainings.

 

Authentication Mode Changed After Applying Service Pack 4 for SQL Server 2005

I recently worked on an issue wherein after applying Service Pack 4(SP4) for SQL Server 2005 on a cluster, we were unable to connect to the SQL Server instance using SQL Server authentication.

To explain it better, let’s say, there are 2 nodes in a cluster with names nodes Node1 & Node2, SQL Server 2005 instance was active on Node1. SP4 was run from Node1 which first applies the SP4 on passive node and then applies on active node. From Management studio both windows and SQL authentication works fine on Node1, but once the SQL instance is failed over to Node2 it comes online without any issue, but when we try to connect to the SQL instance using management studio using SQL Server authentication, it gives login failed error message.

Based on our initial research we found that the connectivity issue occurred because the Authentication Mode of the SQL instance was changed to WINDOWS ONLY on passive node whereas on active node the authentication mode was still MIXED. We changed the following registry key value to 2 and now both the windows and SQL authentication works fine.

SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.INST1\MSSQLServer\LoginMode

Now, the million dollar question was why the authentication mode was changed to WINDOWS ONLY on Node2?

Based on our research, we found that in the SQL Server 2005 setup code when patch/reinstall scenario runs on secondary cluster node the authentication mode is always set to NT authentication (WINDOWS ONLY).
This registry key which has LoginMode is part of SQL Server cluster registry checkpoint. Once the failover happens the login mode on the secondary node is updated automatically from the primary node to the correct value. There is no manual work required in this process and is transparent to the user. The following are the Engine registry sub-hives that are checkpointed under the InstanceID registry hive in SQL Server 2005:

“Cluster” , “MSSQLServer” , “Replication” , “SQLServerAgent” , “PROVIDERS”, “SQLServerSCP” , “CPE”

In my clustered environment, I found that only CPE has checkpoint, all other registry hives do not have checkpoint. That means any change applied to any registry key value which belong to registry hives other than CPE will not get updated automatically from primary node to the correct value. And LoginMode registry key is one such key which belong to MSSQLServer which do not have checkpoint on it. Hence, during SP4 installation once the value of this key is changed to 1, it will remain one until it is changed manually to 2. To resolve this issue, we have to make sure that the above registry hives should have checkpoint. KB article http://support.microsoft.com/kb/953504 will help you in adding the checkpoint to the registry hives and the precautions you need to take in order to execute the steps. However, I have mentioned the steps below to add checkpoint to the above registry hives:

Run the following commands using Command Prompt:

Imp: Beware that change to checkpointed registry keys are saved only when the associated resources are online (In this case, SQL Server). When the resources are not online, changes to the checkpointed keys will be overwritten by the last saved values from the checkpointed file.
If the instance is default, please replace the InstanceName with MSSQLSERVER

cluster res "SQL Server ()" /addcheck: "Software\Microsoft\Microsoft SQL Server\MSSQL.1\Cluster"
cluster res "SQL Server ()" /addcheck: "Software\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLSERVER"
cluster res "SQL Server ()" /addcheck: "Software\Microsoft\Microsoft SQL Server\MSSQL.1\Replication"
cluster res "SQL Server ()" /addcheck: "Software\Microsoft\Microsoft SQL Server\MSSQL.1\SQLServerAgent"
cluster res "SQL Server ()" /addcheck: "Software\Microsoft\Microsoft SQL Server\MSSQL.1\PROVIDERS"

Once these changes are performed, try failover failback.

Happy Reading,
SQLServerF1 Team