Growth for an enterprise is an exciting thing, but it often presents a unique challenge for IT professionals. There are common roadblocks that are encountered when trying to upscale an IT management environment. In this second blog of our Managing IT Infrastructure at Scale series, we discuss how to find the happy medium between monitoring software scalability and ease of use.
The two types of infrastructure management software
Infrastructure management software has traditionally fallen into one of two categories: easy to use software that’s slick and fast but was probably never designed to truly scale, and big-4 legacy systems that are expensive, confusing, and difficult to customize.
Windows-based systems for small environments:
In the first category are often Windows-based systems designed for small environments. These are very common in small to medium businesses and organizations try to hold onto these systems for as long as possible as they grow and develop. These solutions often use the approach of adding additional servers to scale but usually run into issues with performance and admin overhead. For example, windows patching, licensing, anti-virus, and maintenance requirements all scale up at the same rate we add servers and performance often suffers, limiting the maximum scale we can really manage with this kind of solution. They also sometimes fall victim to their target audience, and the UI that works great for managing a hundred servers or routers suddenly becomes a huge pain when we are trying to manage thousands, or even tens of thousands, of devices. Any software solution that requires you to implement changes one device at a time has no chance of being able to handle a large global network.
The appeal of many of these solutions is the ‘unlimited’ licenses, which seems appealing, as it should allow you to add as many devices as you need, but only as many devices as one server can handle and in a lot of cases, this is a lot fewer than you’d think. One software solution we tested started to have serious issues at only 1500 devices on a single server, and that was without adding a lot of complex service or application checks or even running net flow. One design guide only recommends a maximum of 500 devices per server!
Old school ‘big-4’ solutions
The second category of management systems, the old-school legacy ‘big-4’ solutions are designed to handle very large environments, but have user interfaces that are ‘arcane and cumbersome’ and small changes or customizations to these systems can be expensive and require specialized training or consultants. It’s important to understand that this is actually by design since most of those companies make a lot more money from their service and consulting operations than they do from selling the software. This means a lot of companies end up with a platform that never gets fully implemented before it has to be upgraded. One of our customers spent three years and over two million dollars trying to implement one of these solutions, having to start over twice due to annual upgrades, before finally having to abandon it and change directions. And that’s not even including the ongoing licensing costs and the excessive administrative burden.
Netreo’s approach to solving this challenge is to keep administration, reporting, and configuration centralized, to make scaling simpler. By making sure there’s a single user interface for these tasks – one that’s designed to scale – we remove the requirement to create custom web pages, access multiple UIs, or learn scripting languages to manage and maintain the platform. This allows you to scale faster by deploying service engines as needed with just a few clicks, while everything is managed and accessed from the same centralized web interface you’re used to.
Netreo’s infrastructure management software update process is designed so that you can either fully automate and schedule it, or with a single click, you can download and install the latest versions without disrupting the service – and these updates are then automatically pushed out to all the different service and remote collection engines, making this process completely painless.
Netreo also takes a distributed approach to the problem, making sure remote collectors – what we call “service engines” – are doing much of the heavy computational and database work, so that while you can still get all your data from the single central console, the heavy lifting is spread around, again helping to scale to the largest network environments in the world.
The appliance-based approach means there are no external databases to administer or provision, and no operating system or antivirus requirements to keep up with.
The built-in automation is designed to make provisioning simple, without manual intervention, even in highly-dynamic environments, so the solution can be deployed in days instead of months or years, and can stay in sync even in DevOps-paced environments.
Overwhelmed with Alerts
Alert overload is something most administrators have dealt with at one time or another, but it becomes more of an intense issue as the environment scales up. This is the most common complaint: inboxes, phones, and chat apps are all exploding with notifications, sometimes hundreds or more a day, and no one can possibly keep up with them, much less make progress on solving the underlying issues. As a result, your operators and engineers start ignoring – or even filtering out – these alerts, leaving you waiting for user complaints when a production application fails.
Ignoring alerts like this is asking for trouble, because someone, someday is inevitably going to ignore the wrong alert, and you’ll have a major outage on your hands that could easily have been prevented. When the users call the help desk because the LAST member of the database cluster fails, you’re already deep in the weeds. And this can even be worse than not having monitoring in place because management might feel secure that things are being watched closely when the reality is as good as being blind. This means that even organizations with multiple advanced monitoring systems in place are often stuck in fire-fighting mode, reacting to alarms as they occur without ever really getting ahead of them and preventing outages before they affect users.
So how do you keep everything well-monitored, while not flooding your command center with alarms? As a general rule, you should only be reserving active alerts for actionable items only. If you’re not going to react to a notification as soon as it comes in, you shouldn’t be alerting on it. For example, if you’re getting alerts when hard drives reach 90% utilization, and then just ignoring them because it’s not urgent, two things will happen: first, you’re sooner or later going to get a server failure because you forgot to circle back on that alert before the drive filled up and second, you’re conditioning yourself and your teams to ignore alerts which can be costly. Instead, you should use detailed reports that tell you key metrics about each server.
A key point for making this approach work though is to schedule and automate these reports, so nothing gets missed. Netreo’s platform makes it so anything you can view in the web interface, you can turn into an automated report with just a few clicks without any special training or SQL knowledge, and without having to use external reporting tools that can be a nuisance.
You can observe that some solutions are things that can be implemented before you lay the foundation of your IT infrastructure. When you are creating a new enterprise you often put in place procedures that allow for seamless growth to accomplish the organization’s goal, likewise, it is important that your IT team envisions growth for itself.