Monitoring Headaches?

When infrastructure monitoring works, the business doesn’t often notice, monitoring just works, the monitoring process works…

But when faults arise and it affects the business, then all hell can break loose…

So many companies I have discovered over the years don’t have a clear view on what they monitor in their IT estate, how they monitor it (IT?) and if indeed they monitor everything they should be monitoring in the first place…

This can create a headache… just one of those niggling little aches we all ignore from time to time and hope they will just go away…

Or they might have multiple tools, multiple teams, multiple managers – all putting their multiple opinions and requirements into the monitoring melting pot…

And this can create a real headache that won’t go away without intervention…

The pain can be alleviated though of course… with the right diagnosis… and the right treatment…

And it’s not always that the monitoring system itself that’s to blame…

Common symptoms I often see include:

No alerts were received when the fault occurred…
Alerts were received and weren’t actioned properly…
Alerts were simply ignored…
Maybe there were too many alerts and operators were overwhelmed – they “couldn’t see the wood for the trees”…
Maybe they just didn’t know what to do with the alerts…
The tools couldn’t monitor what was required effectively…
Maybe nobody asked for that fault conditions to be monitored in the 1st place…
Maybe the processes & procedures failed…

Sometimes it can be 1 or 2 of these symptoms, often it can be a lot, or even all of these issues…

And that can cause a real migraine

Many organisations of course evolve, having multiple disparate or overlapping systems, and only a few I have spoken to actually realise the problem is not the tools themselves, it’s often themselves as a monitoring function that are the true cause of the problems they are facing…

Blaming 1 particular tool and simply “buying more” doesn’t fix the underlying issues

Some organisations add to their toolsets and then don’t (or can’t or won’t) consolidate the remaining tools, leaving them often with a bigger headache than before…

It’s like putting a band-aid on when you have a splitting headache

Diagnosis is critical

Correct treatment is essential

Solving the underlying issues and establishing firm requirements is often what is required here:

Assess the current tools, monitoring, people, processes & procedures…
Identify quick-wins to relieve the immediate pressure…
Identify monitoring gaps and plug the gaps where possible as fast as possible…
Define the monitoring requirements…
Determine if the current environment can meet the requirements… and plan accordingly…
Refine the current monitoring environment to meet the requirements and re-identify any gaps…
Plan a consolidation and/or migration strategy if required…
Ensure monitoring is defined, implemented, documented, supported and trusted by all…
Ensure tools, people, processes and procedures support the monitoring function at all times..
Re-assess continually

“Oh but we’re all in the cloud now!” I sometimes hear…

This is another issue I have seen – a reliance on cloud or cloud-based tools…

Just because it’s “in the cloud” doesn’t mean it can’t fail…

Just because a cloud provider might keep your server running in the cloud doesn’t mean it’s monitored how you need it monitored…

Even with monitoring capabilities supplied by a cloud provider, this might not be sufficient for the needs of your business…

You might have additional requirements, additional “things” to monitor…

Things that have been missed by vendors, bespoke requirements, additional monitoring requirements you would like to prevent outages, prevent chaos, avoid headaches…

I’ve also never been a fan of “top-down” views in isolation… something green on a dashboard is useless unless it’s meaning is understood and complete…

I believe a 2-way approach is needed – top-down and bottom-up

As I remember saying to a CTO years ago (back in 2006)…

“There’s no point in having a nice shiny dashboard with green traffic lights on it if those indicators are not a true and complete reflection of all underlying connected systems”

This is still true today, some 16 years later

Monitoring must be in place, complete, tested, and trusted 100% by all

Yes – 100%

If it’s not, then it’s not a matter of IF you will get a headache, but WHEN

Rip off the band aid and diagnose the root of the problem now, not when things fail

Avoiding pain is far easier than pain management in the long-run, and will save you valuable time, effort and money.

#protocol #itmonitoring #itinfrastructuremonitoring