The word observability has its root in control theory. R.E. Kálmán in 1960 defined it as a measure of how well you can infer the internal states of a system from knowledge of its external outputs. Observability is such a powerful concept because it allows you to understand the internal state of a system without the complexity of the inner workings. In other words, you can figure out what’s going on just by looking at the output.
Software and architectures have evolved over the previous decade from monolithic applications running a single process to complex architectures including hundreds, if not thousands, of services distributed across numerous nodes. This evolution calls for different tools and techniques to reason about the disparate components that make up modern software.
When you adapt observability to software, it allows you to interact with and reason about the code you write in a new way. An observable system allows you to answer open-ended questions and understand:
- The inner workings of your application
- The current state of your system, no matter how extreme or unusual
- Issues users are currently experiencing
- Why a system behaves a certain way—without guessing
People have varied levels of understanding of what observability entails. For some, it’s just old-fashioned monitoring disguised as a new buzzword. But what exactly is observability, and how does it differ from monitoring?
How Observability Differs From Monitoring
Many companies have long used metrics-based tooling systems and monitoring to reason about their systems and troubleshoot problems. The IT operations in these organizations have metrics aggregated and use a massive screen hung around their IT rooms that displays numerous dashboards.
Before the era of distributed architectures, monitoring was a reactive method that worked well for traditional applications. However, monitoring has a flaw: You can’t see or understand your systems completely. Monitoring forces you to be reactive, speculating on what’s wrong.
Monitoring is based on many assumptions that no longer apply to modern applications. For example, monitoring assumes that:
- Your application is monolithic.
- You have a static number of nodes or hosts to monitor.
- You observe a system only when disasters strike.
- The application runs on virtual machines or bare metal, giving you full access to system metrics.
- You can cover operations engineers’ needs through dashboards and telemetry.
These assumptions are no longer valid in reality for modern architectures for the following reasons:
- There are many services to manage.
- There are many storages with different technologies to observe.
- Infrastructure is highly dynamic, with capacity appearing and disappearing on a whim, depending on demand.
- The number of hosts or nodes to monitor is always changing and unpredictable.
- Automatic instrumentation is insufficient for understanding what’s happening in complex systems.
- Many disparate and loosely coupled services are under management, many of which are out of the site reliability engineering team’s direct control.
Failure modes in distributed systems are unpredictable. They occur frequently and repeat infrequently enough that most teams are unable to set up appropriate and relevant dashboards to monitor them.
This is where observability is so crucial. It enables engineering teams to collect telemetry in various methods, allowing them to diagnose issues without having first to foresee how errors can occur.
Why Observability Is Important
An observable system is simpler to know (in general and in great detail), monitor, update with new code, and repair than a less visible one. But beyond that, there are even more reasons to make your system observable. Observability will help you:
- Uncover and resolve “unknown unknowns,” or problems you aren’t aware of: One of the most significant limitations of monitoring systems is that they only look for “known unknowns,” or unusual circumstances that you’re already aware of. Observability identifies circumstances you wouldn’t be aware of or wouldn’t think to search for. Then it monitors their link to specific performance concerns, providing context for root cause identification and resolution.
- Identify and address concerns early in the development process: Observability will give you insight into the early stages of the software development process. Before issues in new code impair the user experience or service-level agreements (SLAs), development teams are able to spot and fix them.
- Saves time: Without observability, developers are left to make wild guesses about why X and Y happen. This is unproductive and costly in terms of time and resources.
- Provides deep insight: Observability lets you understand root causes.
To summarize, you need observability in order to empower your DevOps teams to investigate any system, no matter how complex your system is, without leaning on experience or intimate system knowledge to analyze root causes.
Pillars of Observability
Observability provides unparalleled visibility into a system’s state. But this visibility comes with some guiding pillars or principles.
When you dissect observability properly, there are two key elements to it. These are, first, the people who need to understand a complex system, and second, the data that aids that understanding. You can’t have proper observability without acknowledging people and technology and the interactions that exist between them.
With this understanding comes the question of:
- How does one gather data and assemble it for inspection to provide the insight needed?
- What are the technical requirements for processing and transmitting the data?
This is where the three pillars of observability known as metrics, logs, and traces come into play. Let’s look at them one by one.
Metrics (also known as time series metrics) are basic indicators of application and system health across time. A metric could be how much memory or CPU capacity an application consumes over a specific period. A metric will include a time stamp, a name, and a field to represent some value.
Metrics are an obvious place to start when it comes to monitoring because they’re useful to describe resource status. That is, you can ask questions based on known issues, such as “Is the system live or dead?”
Metrics are designed to provide visibility into known problems. For unknown problems, you need more than metrics. You need context—and a valuable source of information for that context is logs.
Logs are detailed, immutable, and time-stamped records of application events. Developers can use logs to create a high-fidelity, millisecond-by-millisecond record of every event, complete with context, that they can “play back” for troubleshooting and debugging, among other things. As a result, event logs are particularly useful for detecting emergent and unanticipated behavior in a distributed system. As such, failures rarely occur in complex distributed systems as a result of a single event occurring in a single system component.
A system should record information about what it’s doing at any given time through logs. Hence, logs are possibly the second most significant item in the DevOps team toolbox. In addition, logs provide more detailed information about resources than metrics. If metrics indicate that a resource is no longer operational, logs help you to figure out why.
The key to getting the most out of logs is to keep your collection reasonable. Do this by restricting what you gather. Also, where possible, focus on common fields to discover the needles in the haystack more quickly.
A trace is a representation of a series of causally associated distributed events that encodes a distributed system’s end-to-end request flow. A trace starts when a request enters an application. As user requests move from service to service, a trace makes the behavior and state of the entire system more visible and understandable.
Because a trace has rich context, you can get a holistic view of what happened in all the various parts of the system as a request passes through that’s traditionally hidden from the DevOps team. Traces provide vital visibility into an application’s overall health.
Traces also make it possible to profile and monitor systems, such as containerized apps, serverless architectures, and microservices architectures. They are, however, mainly concerned with the application layer and provide only a limited view of the underlying infrastructure’s health. So, even if you collect traces, metrics and logs are still required to gain a complete picture of your environment.
What DevOps Teams Need
Although the term observability was coined decades ago, its use or adaptation to software systems provides a new approach to think about the software we make. As software and systems have grown more complicated, one encounters issues that are difficult to predict, debug, or plan for. DevOps teams must now be able to continuously gather telemetry in flexible ways that allow them to diagnose issues without first needing to foresee how errors may occur in order to troubleshoot issues and build reliable systems.
While logs, metrics, and traces are important, they aren’t enough to have visibility unless you use them the right way.
To generate an understanding that helps with troubleshooting and performance tuning, observability necessitates combining this data with rich context. In short, a system is said to be observable if the current state can be determined using just information from outputs for all feasible development of state and control vectors (physically, this generally corresponds to information obtained by sensors).
This post was written by Samuel James. Samuel is an AWS solutions architect, offering five years of experience building large applications with a focus on PHP, Node.js, and AWS. He works well with Serverless, Docker, Git, Laravel, Symfony, Lumen, and Vue.js.