7 Habits of Highly Effective Network and Systems Administrators

Introduction

Network & Systems Administrators are the backbone of any IT organization.

They provide critical monitoring and support for all IT resources—be they networks, SD-WANs, servers, apps, or virtual and cloud resources.

After working with hundreds of Network & Sys Admins, we’ve identified 7 crucial habits of the most effective ones.

1. Avoid Data Deluge

A typical Network or Systems Admin receives as many as 200 alerts/day. As many as 80% of these can be triggered during regular work hours. That means an average of 160 alerts need triaging in a 10-12 hour work day. This comes to about 1 alert every 5 minutes!

Most of these alerts are either redundant or lower priority, and can be triaged easily. However, it still takes time and manual effort to triage them.

Not only is the task of triaging redundant and low priority alerts overwhelming, it also has sinister side effects on a Network Admin’s work.

  • The sheer volume of alerts and notifications crowd out the most important tasks a Network Admin has to perform. This, in turn, leads to slow response times, missed deadlines and unhappy customers.
  • The Network or Systems Admin’s quality of work suffers.
  • Their quality of life suffers even more.

The most effective Network and Systems Admins have realized that, for them to be productive and useful to their teams, they have to deal with this problem head on. Their solution:

  1. Reduce alerts (see this blog post for helpful suggestions for reducing alerts: The Oxymoron Attack on Network and Systems Management).
  2. Automate triaging of the remaining alerts. In other words, take an inventory of all the alerts received over the course of a week or so and then build out business rules in your monitoring/alerting platform which silence the redundant and low priority ones.

2. Deploy a Single Pane of Glass Dashboard

At any given time, Network or Systems Admins have to monitor at least 10 to 12 different type of resources.

As we mentioned above, these include networks, SD-WANs, server workloads, apps, and virtualized and cloud resources.

On top of that, each resource will have its own tools. For example, networks use Network Management Systems like Zabbix and Xymon, virtualized server workloads use tools like vCenter, applications use APM (New Relic and AppDynamics), and cloud resources like Meraki have their own separate view into their hardware.

Each tool comes with its own dashboard.

Many claim to provide a “Single Pane Of Glass” (SPOG) view. But, not all live up to the claim. Ideally, a Single Pane Of Glass should demonstrate these three characteristics:

  1. Clear and unambiguous status of monitored elements.
  2. Quick and easy drill down into problem conditions.
  3. Minimal to no “care and feeding” requirements.

For example, here is a sample screenshot of what a real SPOG should be able to produce.

There are only a handful of tools that can claim a real SPOG. Among the products prevalent in the field are IBM Tivoli, EMC Smarts and Netreo OmniCenter. (We’re currently working on a handy blog post to help you determine if your SPOG is really a SPOG. So be on the lookout for that in the near future.)

3. Automate Repeatable Tasks

Ask any Network or Systems Admin how they spend the majority of their time, and the most common answer you’ll get is: fighting fires.

These folks also acknowledge that the most pressing problems boil down to only three things:

  1. A high number of redundant alerts.
  2. An overwhelming number of new resources to be managed.
  3. Human error.

We’ve already addressed the redundant alert problem above.

For the remaining two, automation is the key.

The most effective Network and Systems Admins work ruthlessly at automating all the tasks they can. However, before they start, they need to figure out two fundamental things:

  1. What to automate?
  2. How to automate it?

Surprisingly, determining what to automate can be almost as complex as how to do it.

So, how do the most effective Network and Systems Admins build their prioritized list for automation? Well, it starts with an understanding that their daily activities can be broken down into four broad categories:

  • Important and Urgent Tasks – These are items that you’ll find are typical in a day-of-the-life for all Network and Systems Admins, such as responding to alerts,.
  • Not Important, but Urgent Tasks – The most common activity here, by far, is dealing with redundant and false-positive alerts from the myriad equipment and applications Network and Systems Admins are responsible for.
  • Important, but Not Urgent Tasks – In this category are items that aren’t “drop everything” tasks, but are still things that, under most circumstances, Network and Systems Admins are the most qualified to address; such as capacity planning, deployment of new tools and upgrades, generating reports for management decision support, and managing the infrastructure
  • Not Important and Not Urgent Tasks – Last on the list are activities that usually fall under the Network or Systems Admin’s purview, but are pushed down the priority list when other infrastructure related fires flare into existence. Examples here include tracking equipment metadata (such as serial numbers) and support contract status, as well as patch management and new equipment provisioning.

4. Use Templates

Today, new resources come online every minute. To ensure that they comply with your information systems policies, all resources should be:

  • Configured the same way.
  • Follow the same monitoring rules.
  • Notify stakeholders in a uniform manner.
  • Alert using consistent thresholds and conditions.
  • Report in a homogenous way.

The most effective Network and Systems Admins recognize the importance of these directives and deploy a template-based solution to address them.

An excellent example here would be the monitoring of memory utilization on Microsoft SQL servers. It is well known that MS SQL systems will use all of the memory allocated to them. Therefore, it follows that you’d want a different template that applies to your SQL Server infrastructure as opposed to your MS Windows servers running middleware applications. Your SQL servers have special operating parameter not present elsewhere.

Templatization isn’t easy. You have to understand the pattern for each templatized resource/workflow. You also have to build your corporate policies and expectations into the developed templates. Finally, all templates must be versioned, saved and backed-up automatically.

5. Accelerate Root Cause Analysis

If you’re already using the techniques listed up to this point, chances are you’ve already eliminated 50-60% of your potential problems. You’re also most likely already recognized as a highly effective Network or Systems Admin, and your peers likely come to you for help and advice.

But while you’re advising them, and discussing the philosophy of changes IT will go through over the next 5 years, yet another resource failure is detected. Its status turns red and the alerts start to come in. But, because you have already eliminated 80% of the redundant alerts, and automated a huge chunk of triage work an average administrator has to perform, you know that this problem is no false alarm.

The whole team is now racing to find the root cause. Every minute spent on decoding the problem means one more minute of disruption. You don’t like it. Your boss doesn’t like it. And, above all, your customers don’t like it.

But, as an effective Network or Systems Administrator, you have one more ace up your sleeve.

Within minutes, you know exactly how and where to look for the problem and perform a root cause analysis.

Along with automation, the most effective Network and Systems Administrators understand that having access to the right tools, that not only show a consolidated dashboard but also come with one click drill downs, is the key to success and happy customers (and a happy manager too).

Combined with a proper Single Pane Of Glass, these tools are so effective that they can save you 10-15 hours each week! (That’s like having one free day every work week!)

Here’s an example of how a one click drill down capability should work.

6. Say No to Tools with Large Care and Feeding Requirements

As we discussed earlier, a typical Network or Systems Admin has about 10-12 tools available to them to monitor their IT resources.

At least a few of these can take months to configure, deploy and customize. And, vendors typically don’t mention the large and complex infrastructures required to support their tools.

The most effective Network and Systems Administrators understand two important things about such tools:

  1. They have a high cost of deployment and management.
  2. For all their bells and whistles, no more than 20-30% of their capabilities will ever be used!

So how do they pare this large list of tools down to the most productive ones?

Stated simply, the most effective Network and Systems Admins ask vendors one key question, “What’s the cost of managing your management system?”

They then ruthlessly kick out any offending products.

An “offending product” is any IT management software that violates the 4 principles of modern management platforms.

  1. Zero cost deployment.
  2. No management/maintenance cost.
  3. All in one solution.
  4. Provides a genuine Single Pane Of Glass view.

7. Use Predictive and Prescriptive Reports and Analytics

For all the tools Network and Systems Administrators have, failures are still often reported by the users.

This tends to happen because all monitoring tools work off of thresholds set by someone other than the end-user of the resource. Even most Network or Systems Admins don’t have full control over all the thresholds across all the tools they have to use.

This results in one of the two things:

  1. If a threshold is set too high, at least some users will experience a problem before an alert goes out.
  2. If a threshold is set too low, a system can generate hundreds of meaningless alerts.

The most effective Network and Systems Administrators understand that the solution is not to simply tweak alert thresholds.

Instead, they develop a comprehensive strategy to go from reporting failures to predicting them.

How do they do that?

By deploying solutions with built-in predictive reporting capabilities.

Using data from the past load, thresholds and available resources, the predictive reporting engine applies machine learning (ML) algorithms to determine if there is a real possibility of failure or not.

Here’s a good example of a set of predictive reports.

We’ve covered predictive reporting before—and even proposed a solution. However, be mindful that this is still an evolving technology.

Conclusion

That was a lot of information. But, if put into practice, these seven habits can take your team from just being good to being great!

If you know more techniques, or would like us to explain anything in more detail, leave a comment below.

Ready to get started? Get in touch or schedule a demo.