The importance of clear communications during a crisis is something emergency planners and first-responders live with every day. They spend countless hours planning for contingencies and working out the worst-case scenarios. They make sure that they don’t lose the ability to coordinate. In contrast, Information Technology professionals rarely engage in this type of activity, although it’s clear from our experience they should.
Crisis Policy Development
When we’re asked to develop crisis policies for our clients it isn’t uncommon to find they lack any plan whatsoever. Further, it isn’t uncommon to find during a mass outage that’s impacting a company’s most critical operations, the IT staff is improvising, trying to make the best of a bad situation. Simultaneously, a large number of different subject matter experts are brought in, each looking at different alerts, different toolsets, and coming to (understandably) different conclusions as to the root cause of the problem. The takeaway is that resulting recovery inevitably takes much more time and effort than anyone anticipated.
Any organization that relies on IT for daily operations must spend some time developing a crisis communications and alerting policy. What elements comprise a good alerting policy? A good starting point is to define “What” variables should be monitored, “Who” should be alerted in case of a problems, “How” should personnel be notified, and “When” it is appropriate to notify personnel, and “What” methods should be used to alert those persons. Of course, the biggest question to ask when setting up an alerting policy is “Why” are we sending out alerts and deploying an NMS system at all? The answers to the “Why” question will help frame your answers to the others. This document then provides a baseline to determine where capability gaps are present and can be used as a blueprint for how to improve outage response and reduce mean time to repair.
Why Not Monitor Everything?
Often the question of “why not monitor everything” is posited. In our experience, monitoring everything only results in causing excessive noise and makes real problems harder to spot. Instead, focus monitoring on the mission-critical services, applications, and infrastructure. Over-monitoring masks real issues and causes operators to ‘tune out’ alerts or even configure filters to block them. While collecting information on many aspects of your systems and networks can be beneficial from a forensic or troubleshooting standpoint, notifications should be strictly limited to only actionable items – things that will cause an operator or engineer to take corrective action. Anything else should be limited to on-demand retrieval or scheduled reports.
Once you have identified what to monitor, centralizing notifications and workflow through a ticketing system or alert manager is the best course of action. There are two primary benefits here: First, it will reduce the number of disparate points of administration required as personnel changes occur. Second, there is a singular source for IT service-level metrics that are crucial in tracking the efficiency of you Infrastructure & Operations team. Whatever the selected system, it must be simple to integrate both upstream and downstream with other IT components, offer flexible notification options, and intelligent filtering to group related notifications or alarms together to avoid creating a flood of alerts during a mass outage. It’s also very helpful if it automates the process of keeping everyone updated as people take ownership of issues or begin working the problem.
Internally-hosted email isn’t sufficient as a method of communication during a crisis. Hosted email providers are generally better as they are less dependent on your organizational infrastructure. However, hosted providers do require Internet access to be used successfully. Therefore, thought must be given to creating redundancy for the communications channels used during a major incident. Redundant Internet connectivity or providing true off-the-wire backup (such as 4G access points) can go a long way to creating more options during a crisis. Make sure your monitoring and automation systems can easily integrate with whatever platform you choose.
Group communications with ad-hoc organization are a powerful way to improve your crisis response, provided your team can access them when the network is dark. There are a number of these services available now, ranging from extremely simplistic (e.g., Jabber) to much more advanced and complex options that offer voice or video and can integrate with your organization’s unified communications system.
No matter which method is chosen, many requirements need to be met; They must be clearly understood, well-documented, communicated internally, thoroughly monitored, and redundantly connected. The last, and most important requirement is that the system get used in crisis situations. If your organization does not regularly run crisis drills, then consider adding such a drill to the operational plan. Practice makes perfect, and it’s always preferable to get surprised in a controlled situation rather than during the heat of an actual outage.