Undeniably, monitoring your servers is extremely important. Not only does it help you stop issues daily, but it also helps you with tasks like scaling and capacity planning. But no matter how advanced your monitoring is, it always starts with a simple server health indication.
Actually, maybe “simple” isn’t the best word here.
“Server health” usually gives you a “healthy/not healthy” indication. But that doesn’t mean that the underlying logic is also simple. In this post, you’ll learn what it actually takes to check server health.
What Is Server Health?
It may sound straightforward, but determining server health isn’t actually that easy. There are multiple metrics that you need to take into account in order to determine server health.
For starters, healthy doesn’t only mean that the server is running. A server may be up and running, but there can be multiple issues with it. CPU use can be at a constant 100%, disks can be (almost) full, or network throughput can be really low.
These issues are relatively easy to spot. But there are also cases where everything may look OK, but issues will come up occasionally. This is especially the case when it comes to disks and networking.
Random disk and/or network issues are difficult to spot, but they’ll definitely create some issues. Therefore, server health monitoring isn’t as simple as resource consumption and uptime.
So, what should good server health indications actually include? Let’s dive into it.
It definitely takes more than the basic metrics to properly assess the health of the server. However, that doesn’t mean you should skip these basics. In fact, in most cases, they’ll give you a good indication of server health.
What are these basics, then?
Server Status and Uptime
Let’s start with something that (in theory) is the most direct indication of server health—the server status.
If the server is up and running, that means it’s healthy. If the server is down or not responding, then it’s not healthy. But is that really the case?
Think about it. Most companies have moved to cloud environments, which has made things a little bit complicated. One of the perks of the cloud is flexibility and autoscaling. Your infrastructure in the cloud may automatically start and stop servers based on the current needs. Therefore, if a server is down, it doesn’t automatically mean there’s something wrong. It may mean that an autoscaler stopped it because it’s not needed at this time. For the same reason, server uptime or restart count is also no longer an indication of server health.
Well, not necessarily.
You may think that it doesn’t matter whether a server uses 10% or 90% of the CPU. If it uses any amount of CPU, then that means it’s up and running. So it should be “healthy,” right?
Depending on the situation, very high or very low resource consumption can be an indication of issues. It all depends on the context and patterns. Let’s say you have a server that has steady 40% to 60% usage over the past year, and suddenly it spikes to 100%. That tells you there’s probably something wrong.
Imagine you have servers that are doing the heavy lifting in your company. Their typical usage is close to 100%. (For example, maybe they do batch processing of large amounts of data or GPU-powered graphics rendering.) If suddenly one of these servers drops to near 0% of usage, there’s definitely something wrong. So, what’s the problem? Maybe the software that’s doing important processing crashed.
Ideally, you should include resource consumption in your server health indicators—but only when you’re able to compare it with a baseline, and only for servers that have steady, predictable usage patterns.
What Else Should You Consider?
As mentioned at the beginning of this post, a server may be up and running, but that may not mean it’s healthy. We covered the basics, but even those metrics may not always tell you the whole truth. Here are a few extra things to consider when assessing server health.
Network and Storage
Both network and storage are very important for assessing server health. However, it’s not as simple as with CPU or memory usage. Here, we aren’t really interested in plain usage numbers. Instead, it makes sense to look at some specific metrics that can directly indicate health issues.
For networking, instead of looking at throughput or network saturation, you should be looking (for example) at latency and packet loss count. Of course, latency can also vary during the day, depending on the overall network traffic. But if it’s really off, then you know something’s wrong.
Latency counted in seconds instead of milliseconds is something to look into. Sometimes it may be just a software issue, but very high latency can also indicate a general networking issue on the server.
Actually, combining latency with packet loss count can help you determine if it’s a hardware or software issue. If you see very high latency and a large amount of packet loss, then it makes sense to mark a server as unhealthy.
When it comes to storage, we have a similar situation. Disk throughput isn’t that important for server health, although very slow writes or reads can indicate disk issues.
More interesting is I/O wait time. If you often see high I/O wait times, then I would consider such a server to be unhealthy. It actually doesn’t necessarily mean that there’s something wrong with the disk. Similarly, as with a network, it can be an indication of the disk not being able to handle the load. But it can also mean that there are some issues with the actual hardware.
How can you put this knowledge into practice? Let’s find out.
Now that you have a general idea of how to assess server health, it’s time to discuss how to actually perform a health check. There are a few ways to do so—it’ll mainly depend on the monitoring tool of your choice. But the general idea’s the same for all of them.
One option is assessing the server health based on a monitoring system that is in turn based on some indicators. For example, you could create some complicated logic that takes into account all the metrics we mentioned above. Based on that, you could create a “healthy/not healthy” entry in your monitoring tool.
Another option is to perform a remote health check ad hoc. That means you send some sort of call to the server and wait for the response. Based on the response, you assess the server health.
This call can be in many forms, from simple ping (ICMP) or TCP packets to advanced HTTP calls directly to the software running on the server. Simple calls will tell you only if the server is up or not. More advanced, HTTP-based calls can not only tell you whether the server is running but also if it’s doing the job that it’s supposed to.
Summing Up and Finding an Advanced Monitoring Solution
As you can see, fully understanding server health isn’t as simple as knowing whether a server is up and running. However, many companies still treat server health just like that. A very simple indicator like that can often indicate that a server’s “healthy” when there are actually some problems with it.
If you don’t want to be one of these companies, take a look at advanced monitoring solutions that can help you build better server health assessments. One such tool is Netreo’s Server Management Software. No matter what operating systems you’re running on your server, Netreo can help you avoid being bombarded with useless alerts. Even better, it can provide you with helpful insights into the health of your servers. Netreo has a free trial and a useful, searchable blog.
This post was written by Dawid Ziolkowski. Dawid has 10 years of experience as a Network/System Engineer at the beginning, DevOps in between, Cloud Native Engineer recently. He’s worked for an IT outsourcing company, a research institute, telco, a hosting company, and a consultancy company, so he’s gathered a lot of knowledge from different perspectives. Nowadays he’s helping companies move to cloud and/or redesign their infrastructure for a more Cloud Native approach.