Getting better alerts

When I work with MSPs, I almost always find that they apply the "Agent Offline" monitor to servers with a 5 minute (or less) threshold. They tell us that false alarms are one of their biggest challenges. Changing how to monitor server availability is one of the easiest ways to reduce false alerts.

The first thing to recognize is that the monitor for Agent Check-In has nothing to do with whether a server is functioning. The agent is simply a service-based application on the server. It's job is to communicate with the VSA on a regular schedule and determine if it should either perform tasks or deliver information. This is considered a low-priority service, and - by design - the agent will relinquish resources when the server is under stressful load conditions. This could prevent the agent software from checking in with VSA for several minutes. When this triggers an alarm, it clearly isn't because the server has crashed or become otherwise unavailable. It's just busy.Servers starting to run backups shortly after midnight would trigger a rash of agent offline alerts, waking up the on-call team member, who would find everything working just fine.

So - how can this be improved? The first step is to set the check in time to a longer period to identify when the agent software has a problem. Our default is one hour.

Next, use an Out of Band monitor to check server health. Kaseya Network Monitor is built into VSA and can handle the job nicely. Don't use ping, because a smart NIC will reply even if the O/S has crashed. We check for the Server service, which typically must be running for the server to function, but more importantly, to get a response, the Operating System has to be functional. We set the time on this monitor long enough for the server to reboot without triggering an alarm, but short enough so that a failure will be detected and reported in a timely manner. The default alarm time we use is 15 minutes for most servers, and 7 minutes for critical systems.

We then add a set of monitors that report when the server has rebooted during business hours or booted into system recovery modes. These are Smart Monitors that run at startup and check for these specific conditions. Alarms for business hours reboots fire immediately, while recovery mode detection delays a brief period so an engineer can boot, perform a quick recovery operation, and reboot again.

Finally, we configure the agent to generate alarms for certain crash conditions. This might result in two alarms for the same event - one for the work-day reboot and one for the reason why.

With this method, our customers get better information, faster and more appropriate alerting, and virtually zero false alarms for server-down conditions.

Comments