Top 10 Monitoring & Observability Issues You Want to Avoid

To be honest, I wrote these notes a while ago, but when I re-checked them (with a view to writing this article) a lot of them still seem to exist in companies, so I thought I would publish it anyway to stimulate thought & discussion.

I’ve been an IT Monitoring specialist & freelance IT contractor for more than 25 years, initially focussed on infrastructure monitoring, but more recently also involved in observability tools and cloud monitoring technologies.

Here are my top 10 problems I’ve seen with IT monitoring environments over the years, in the hope that you are aware of them, and can therefore ensure you have addressed each area so that it’s not a problem for you and your organisation…

1. Monitoring & Observability Platform Issues

Obviously this is less of an issue if you are using a SaaS software solution for your monitoring & observability, but if you’re hosting your monitoring solutions on-prem then make sure you don’t have these issues:

  • Unstable platforms or systems
  • Old tools, or tools which have not been upgraded, patched or maintained regularly
  • Cutting corners with free software that doesn’t deliver functionality required
  • Lack of adequate support

2. Unmanageable Alert Volumes

Virtually everywhere I see the problem of “we get too many alerts”, but what key things should you try and avoid:

  • Too many alerts so that Operations teams viewing the alerts can’t see the wood for the trees (overwhelm) – this then means you are more likely to miss critical alerts
  • Not enough resources to deal with the alert volumes
  • Sometimes “too many alerts” can be due to incorrectly configured monitoring (which leads onto the next problem)

3. Message Noise

So how is this is different to the “too many alerts” issue above?

This is more about the quality of the alerts – are they required, are they meaningful, and are they being presented to the right people?

Address these key issues where possible:

  • Stop alerting on things you don’t care about
  • Review alerts that are incorrectly classified – which should be reviewed, changed, suppressed or marked as informational for example (and taken away from the Operations view perhaps)
  • Review alerts that only require a trouble-ticket and investigate if integration into your ticketing system can be automated for these alerts (and therefore the alerts could be hidden from the Operators view)
  • Again this “noise” can cause overwhelm which can lead to operational mistakes, missing real alerts that are maybe more critical for example
  • Often this a result of poorly configured monitoring, or monitoring that is not aligned to the monitoring requirements

4. Unclear Actions

If you have configured alerts, they need to clear, and it is imperative that people viewing the alerts know exactly what is expected of them. Avoid these issues:

  • Unclear alerts which are hard to decipher what the real issue is
  • Lack of instructions or procedures – what are Ops meant to do with these alerts?
  • Unclear actions causes delay, uncertainty, mistakes & ambiguity

I have seen this example in many organisations:

“I called Bob earlier and he said to not to call him out again for this problem”

but for how long? For for the next hour, day, week, month or year? You get the problem.

And should we ignore this issue for all servers or just the one that alerted first?

CONFUSION!

Alerts should have clear instructions for Ops, or there should be a clear set of processes & procedures for how alerts are handled. I prefer explicit & clear alert handling instructions, for example:

CALLOUT 24X7

CALLOUT 9X5

TICKET-ONLY (which ideally should not be seen by Ops if you integrated your ticketing system accordingly, as previously mentioned)

5. No Alerts

Of course, the other side of getting too many alerts is not getting enough alerts when there IS a fault in the environments that are being monitored.

Things to avoid & address:

  • Missing alerts – when there’s an IT issue, but no corresponding alert (this should be addressed via incident post-mortems to investigate why the alert was not picked up or configured in the first place)
  • Witch-hunts – I have seen so much time wasted on finding out “who” was to blame – usually it was a case of no alert relating to the issue or a problem with the alerting processes & procedures, whatever it is identify it and fix it fast

Normally missing alerts are due to either a monitoring gap (more later), or monitoring not being configured properly.

If monitoring has not configured properly this may highlight inadequate testing.

Fix the issue and ensure it can’t happen again.

Employ better testing processes to prevent the situation from happening again.

If the monitoring was not configured at all this is an opportunity to iteratively improve your monitoring, making it better step by step, always improving…

However, people do make mistakes, but it’s what you do with the information gained that is important – use these situations wisely as opportunities to learn and improve.

6. Monitoring Gaps

As mentioned in point 5, maybe there were no alerts because you’re NOT monitoring that component!

If so then do the following:

  • Full monitoring review & gap analysis
  • Ensure all areas are monitored
  • Ensure teams are aware of monitoring – they are responsible to ensure their infrastructure / area is monitored sufficiently

7. Problems with Alert Handling Procedures

Sometimes monitoring HAS been configured correctly and alerts WERE received, so what could have gone wrong here?

  •  Ops not alerting & notifying teams correctly?
  •  Alert noise issues? Are alerts configured correctly with clear instructions?
  • Are there well-defined Ops processes & procedures for handling alerts?
  • Are there “ignore” lists (these are VERY dangerous – lists of alerts that Ops see on a regular basis where they are told to ignore them for 1 day and then the problem is fixed and the alert remains on the ignore list)
  • Unclear callout processes – what if Ops are told to ignore it? What if the alert happens 1h later at 3.30am in the morning? Should they callout the alert again? What if they called the alert out and the person on the end of the phone was really grumpy or annoyed at being woken up? What should they do?
  • Clear processes & procedures can help to remediate all of this, but only if they are agreed and actioned religiously.

8. Lack of Visibility

 Often there can be a real lack of visibility of alerts/problems to other interested teams.

  • Teams should be able to see alerts for their domain during the working day, so they can remediate issues before they even get a phone call from Ops. This can be achieved by ensuring teams have adequate access to the monitoring tools themselves, perhaps via large screens if appropriate.
  • Teams should be aware of alerts/problems via ticketing + notification rules (for example auto-emails to all members of the trouble-ticket assigned group for example) – plus “interested parties” – they might be owners of the server for example, but they might have a vested interest if the server was a mission-critical component of their service.
  • Increased awareness regarding monitoring can help – roadshow presentations & management passing on information to divisions/teams
  • Adequate procedures for go-lives or new projects can also help – teams must all engage with monitoring for example, and all go-lives must have monitoring sign-off before they are able to go live with their service.

9. Disaster Recovery (DR) & Failover Scenarios

Often DR is not considered, with inadequate processes & procedures, often untested. Avoid these common issues:

  • Failover issues / delays – implies bad design
  • Failover process not clear or understood. Are their manual steps to be performed? What is automated and what is not?
  • Many organisations with on-prem hardware implement 6-month “role-swap” activities which gives them a chance to test failover processes & procedures
  • Many organisations don’t consider DR U& failover scenarios until it happens – which is often too late

10. People

Sometimes issues are not with the tools themselves, but with the people (or lack of). Avoid these potential problem areas:

  • No BAU team handling day-to-day monitoring requests and activities
  • No “engineering” team working on additional functionality and improvements (for example integration with ticketing systems etc.)
  • Inadequate amount of people to support the tools
  • Inadequate amount of people watching for alerts (Ops teams for example)
  • Inadequate leadership, management, and support internally for the tools and function of the monitoring service