It is important for SQL Server DBA to configure alerting for the SQL Server instances so that an E-Mail or a alert is sent to the DBA team in case of any problems which are critical or warnings. Handling and fixing the warnings will avoid any critical events from happening. It is not possible to monitor all the servers and instances 24/7, so there should be either a custom solution or third party monitoring tool to detect and send alerts when certain condition is met.
When preparing a custom solution or enabling alerts on 3rd party tools, we need to identify the list of important alerts which are to be monitored and how frequent these checks are to be made and the threshold on which the alerts are to be sent to the DBA team. Also, need to understand which of these are critical alerts and which can be warning level alerts.
Below are some of the common and important alerts with their description, Severity and what should be the general threshold for these alerts to be chekecd on.
Critical Alerts – These alerts generally cause the system to go down or can cause an outage. These alerts should be identified and resolved quickly and in case the issue is not resolved, this should keep sending the alerts. Also there should be option to temporarily snooze or turnoff the alerts which DBA team is working on them and they expect that it may take some time to fix the issue, this will avoid unnecessary alerts to be sent continuously.
Below are some of the Critical Alerts which need to be monitored and resolved immediately.
SQL Server Service State – This check should occur every 5 minutes and should check weather the SQL Server service is running or stopped. This should check for all the SQL Server services and SQL Server Agent services, etc
SQL Server Connectivity – This check should occur every 5 minutes to check if connections can be made to the SQL Server instance.
SQL Server Cluster Failover Event – On cluster servers, we need to have check which can detect any failover events with the SQL Server cluster resources and also should provide the state of SQL Server cluster resources. This check should occur every 5 minutes.
SQL Server Errorlog Scanner – SQL Server Errorlog consists of various errors, warning messages and informational messages. As this contains a mix of critical, warning and informational messages, we need to have this check run every 5 minutes. There should be customization allowed to list the messages which can be ignored, so that the alert for informational and some warning messages can be suppressed and alerts need not be sent for those.
CPU Check – This is a windows level check which need to identify the CPU usage on the server and raise alert if the CPU condition is consistently high like above 90%. This check is to be made every 5 minutes.
Memory Check – This is a windows memory check to identify if there is any low memory condition on the server which can affect all the services running on that server. This check needs to be run every 5 minutes.
Low disk space – This check needs to identify the free disk space going down to very low like less than 10% free disk space. This can soon cause an outage as there will not be any more room for the database files too grow on the disk.
Blocking – Blocking in SQL Server is considered critical as it can cause disruption to the application and the users will face issues. This needs to be checked every 5 minutes and to be alerted. There should be mechanism to snooze the alert which DBA is looking into the issue.
Warning Alerts – These alerts generally do not cause any outage or no immediate action is required, but is important to troubleshoot and fix it as soon as possible to avoid it from escalating to an critical alert. The threshold for these alerts can be little relaxed compared to critical alerts.
Below are some of the Warning Alerts which need to be monitored and resolved in a timely fashion to avoid escalating in to Critical alerts.
Disk space below 20% Threshold – This should check of the free disk space on any drive falls below 20% free, then alert needs to be sent to DBA team. DBA can check and identify the cause of low disk space and try to resolve immediately and it can go further low and can raise a critical alert or can cause an outage if there is no more free space left.
Backup Checks – It is important for the DBAs to make sure there are regular backups of all the databases. This check can be made daily once or twice to check of the Full and Log backups are happening regularly. Also, separate check can be used to just check the log backups and the check can be scheduled to run more frequently on important systems.
Job Failures – This check should check for any failures with any jobs. This check can be configured to check every 5 minutes. Depending on the importance of the job, DBAs can look into it in case of any job failure.
Long Running Queries – This check should send list of long running queries which are taking more time like more than 15 seconds or any other threshold based on the baseline.
Suspended Sessions – Although it is not common, but some environments can have too many suspended sessions thus causing exhaustion of all the available worker threads thus leading to an outage. So, it would be good to check every 15 minutes to see if there are too many suspended sessions.
Errors or Warnings from EventsLogs – There can be some critical or warning events that may have been raised in the windows Application or System eventlogs, so it is important to send alerts for such errors. This can be configured to send the list of errors or warnings every day once or few times in a day.
There can be many other alerts that can be configured and may be specific to each environment, but above should be monitored on all the server running SQL Server instances.
This is applicable on below versions of SQL Server
SQL Server 2005
SQL Server 2008 R2
SQL Server 2012
SQL Server 2014
Hope this was helpful.
In-Depth Blogs on SQL Server, Information about SQL Server Conferences and Events, SQL Server Frequently asked questions, SQL Server Trainings.