If you’ve worked in IT for more than 10 minutes, you know that things go wrong. In fact, that there’s no end of job security in IT because things go wrong. But with the advent of IT monitoring and automation – building systems that automatically mind the shop, raise a flag when things start to go south, and give you the information needed to know what happened and when it happened so you can fix it – the future seems a little brighter.
After over a decade implementing monitoring systems at organizations large and small, I’ve become all too familiar with what might be called monitoring grief. This is what often occurs when you are tasked to monitor something, anything, and they ask you to do things you know are going to cause problems. It involves a series of behaviors I’ve grouped into five stages. Get it –the five stages of (IT monitoring) grief?
While agencies often go through these stages when rolling out monitoring for the first time, they can also occur when a group or department starts to seriously implement an existing solution, when new capabilities are added to a current monitoring suite, or simply when it’s Tuesday.
Spoiler alert: If you’re at all familiar with the standard Kubler-Ross model of the five stages of grief model, acceptance is not on this list.
Stage One: Monitor Everything
This is the initial monitoring non-decision, a response to the simple and innocent question, “What do I need to monitor?” The favorite choice of managers and teams who won’t actually get the ticket is to simply open the fire hose wide and request you to monitor “everything.” This choice is also frequently made by admins with a “hair-on-fire” problem in progress. This decision assumes that all the information is good information, and can be “tuned up” later. I guess everyone is in denial that there’s about to be an alert-storm.
Stage Two: The Prozac Moment
This stage follows closely on the heels of the first, when the recipient of 734 monitoring alert emails in five minutes comes to you and exclaims, “All these things can’t possibly be going wrong!” While this may be correct in principle, it ignores the fact that a computer only defines “going wrong” as specifically as the humans who requested the monitors in the first place. So, you ratchet things down to reasonable levels, but “too much” is still showing red and the reaction remains the same.
Worse, because the team is overloaded, they get angry and feel that monitoring must be wrong again. Except this time it isn’t wrong. It’s catching all the stuff that’s been going up and down for weeks, months, or years, but which nobody noticed. Either the failures self-corrected quickly enough, users never complained, or someone somewhere was jumping in and fixing it before anybody knew about it.
It’s at this moment you wish you could give the system owner Prozac so they will chill out and realize that knowing about outages is the first step to avoiding them in the future.
Stage Three: Painting the Roses Green
The next stage occurs when too many things are still showing as “down” and no amount of tweaking is making them show “up” because, ahem, they are down.
In a fit of stubborn pride, the system owner often admits something like: “They’re not down-down, they’re just, you know, a little down-ish right now.” And so they demand that you do whatever it takes to show the systems as up/good/green. This behavior characterizes the bargaining stage.
And I mean they’ll ask you to do anything like changing alert thresholds to impossible levels (“Only alert if it’s been down for 30 hours. No, make that a full week.”) and disabling alerts entirely. I can understand the pressure to adjust reporting to senior management, but let’s not defeat the purpose of monitoring, especially on critical systems.
What makes this stage even more embarrassing for all concerned is that the work involved is often greater than the work to actually fix the issue.
Stage Four: An Inconvenient Truth
If issues are suppressed sometimes for weeks or months, they will reach a point when there’s a critical error that can’t be glossed over. At that point, you and the system owner find yourselves on a service restoration team phone call with about a dozen other engineers and a few IT directors where everything is analyzed, checked and restarted in real-time.
This is about the time someone asks to see the performance data for the system — the one that’s been down for a month and a half, but hasn’t shown up on reports. For a system owner who has been avoiding dealing with the real issues, there is nowhere left to run or hide.
Stage Five: Finding the Right Balance
Assuming the system owner has managed through stage four with his or her job intact, stage five involves trying to get it right. Agencies need to make the investment to get their alerting thresholds set correctly, and vary them based on the criticality of the systems. There’s also a lot that smart tools can do to correlate alerts, and reduce the number of alerts the IT team has to manage. And you’ll just have to migrate some of your unreliable systems and fix the issues that are causing network or systems management problems, as time and budget allow.
And what of the system owners who started off by demanding, “monitor everything?” Don’t worry, they’ll be back after the next system outage — to give you more grief.
Looking for good resources to thwart the stages of IT monitoring grief? Here’s the 101 from the SolarWinds Lab team:
Video not playing? Try it here.