Table of contents
A Journey Through Time From My Perspective
I've been in the observability market long before it even had that name. Over the years, observability has undergone a significant transformation. As someone who has witnessed these changes firsthand, I can attest to the dynamic nature of this field.
In the early days, it was largely about basic monitoring: tracking system metrics, lots of logs, and simple alerts. A typical setup would use Nagios or Zabbix (sometimes both), Cacti for charting system metrics, and lots of grep and vi for searching and reading log files split out of your monolith.
This led in the mid-2000s to the foundation of various cloud-based SaaS platforms to help engineers do the job. Splunk for logging, AppDynamics, New Relic and Dynatrace for APM, Zabbix for infra, and later early 2010s Datadog with a clear Infrastructure and data consolidation mindset.
Those products democratized access to the data and offered solutions to manage the various aspects of the later-called Observability (as Honeycomb would call it later). As a fun fact from the early days, I still remember when vendors would refer each other Datadog to New Relic, New Relic to Sumo Logic, and so on. That wasn’t the all-in-one platform billionaire’s business it is today. Each vendor had a particular angle: Logging, APM, Infrastructure Monitoring, Traces...
The Rise of Cloud and Distributed Systems
However, the rise of cloud-native architectures, microservices, and distributed systems made volumes of data explode. Most of previously mentioned providers evolved to make the correlation of the different observability pillars (metrics, events, logs and traces - MELT for short) easier for engineers in all-in-one platforms.
The CNCF was founded and nested projects to help engineers get the best out of this new cloud revolution to run “systems that are resilient, manageable, and observable.”
The vendors' mission became to consolidate all MELT data in one platform and break the silos between the different engineering teams and functions Network Engineers, SysAdmins, SREs, Developers, etc. Those observability systems and practices became essential for maintaining the health and performance of modern complex applications as they grew in complexity and broke into smaller pieces orchestrated by Greek captains.
The advent of DevOps and Site Reliability Engineering (SRE) further accelerated the adoption of observability practices. Companies worldwide that wanted to move fast, recognized the importance of continuous integration and continuous delivery (CI/CD) and the need for real-time visibility into their system's operations. This shift was marked by a growing emphasis on automation, in which having "everything"-as-code became a vital necessity to cope with the ever growing sophistication of workloads.
The Unfairness of Coming from the Right
Despite the advancements in observability tooling and practices both from open source and commercial vendors, now in the early 2020s, there remains to be a significant disparity in how these tools are leveraged across the different stages of the Software Development Life Cycle (SDLC) and the different engineering functions.
Many vendors over-rotated and targeted the Ops side of the SDLC (all the way to the right), focusing on post-deployment production problems and troubleshooting. This operations-centric approach often neglects the needs of developers who require visibility into the impact of their code in the grand total throughout the development process. They care about what they can do and change, and it’s often hard for them to understand when looking from the right.
I met hundreds of developers that more often than not were lacking context to understand the information available in Observability tools. Specially when paged in the middle of the night. There’s just too much data to look at, too much noise, and too much toil to comprehend what’s going on.
This disparity is evident in the way many observability platforms are designed. They often emphasize brute-force data collection and analysis, which can lead to data overload and increased costs without necessarily providing actionable insights for developers.
In most cases, brute force is at the roots of a billionaire business, so the incentives to increase signal and reduce noise by rationalizing the volume of data ingested are not there. And we all know Greek captains can generate an insane amount of data, do you really need it all? All the time? Most probably not, but that’s the way this was designed.
Bridging the Gap with Monitoring as Code
There’s a latent need to bridge the gap between monitoring and development. Too much has been written in the last two years about “Shifting Left”. And I’m a fierce proponent of shifting left, but I honestly consider most of that literature a fallacy. Let me explain why.
Shifting left observability clearly looks like the way to go, but in my opinion, it is not economically efficient if you are shifting left while coming from the right.
Well… it works, but it requires deep pockets, as the volumes of data are just exponentially growing. That reminds me to the lift and shift strategies to move Infra to the cloud. Some say AI will solve the problem of too much data, too much noise, lack of context, and self-remediation.
But I, personally, think AI is still too expensive to run at scale in production, at least the implementations I’ve seen are prohibitive. AI will probably solve the problem, but I’m skeptical about the short term. And again, I truly believe there is a different, more efficient way of doing things at hand today. And it is called Monitoring as Code.
Monitoring as Code (MaC) is a term we at Checkly started using a couple of years ago, focusing on integrating observability directly into the development workflow in search of high signal and low noise. It is a practice that focuses on control and efficiency.
MaC ensures that monitoring is not an afterthought but a fundamental aspect of the SDLC. So as opposed to the observability platforms founded on the Ops side of the SDLC, MaC is founded on the Dev side of the SDLC, and is part of what you do daily as an engineer.
That gives engineers control over what they choose to observe and how from the end user perspective with levels of efficiency never seen before. Once you understand MaC’s approach, traditional observability vendors will look to you like Incandescent bulbs, while MaC will look like LED.
Benefits of Monitoring as Code
By embedding monitoring into the codebase, engineers can define and maintain their monitoring configurations as part of their version control systems. This approach provides the same benefits as Infrastructure as Code but for Monitoring:
- Consistency and Reproducibility: Monitoring configurations are versioned alongside the application code, ensuring that changes are tracked and can be rolled back if necessary.
- Collaboration: Engineers and operations teams can collaborate more effectively, as monitoring configurations are treated as code and reviewed through pull requests.
- Automation: Automated checks and alerts can be integrated into CI/CD pipelines, providing real-time feedback on the application's health and performance before it reaches production.
Up until now, customers adopting Monitoring as Code have been extremely efficient at detecting issues much earlier in the SDLC. It has proven to be a very simple, well-understood approach for developers to know if their APIs or Browser Applications work as expected.
Observability as Code: Detect and Answer—See What Things Are Breaking, 10x Faster
The challenge we faced at Checkly with MaC until last week, was that we were super fast at detecting issues but we couldn’t tell users why things broke. With the launch of Checkly Traces, that paradigm changed completely. We can now be the first to tell engineers what breaks and why. At lightning speed. We were glad our approach not only appealed to engineers but also caught the investment community's attention and helped us secure our Series B round.
When used contextually with synthetic checks, combined with Playwright in a Monitoring as Code environment, Traces are extremely powerful. They help engineers contextualize and understand what happened in the backend when their code was in execution, and a failure or increased response time happened, quickly pointing to the culprit.
All without the hassle of having to be an observability expert, learn a new query language, or having to dig into a pile of logs. Just like we like it: simple, efficient, and high signal.
Checkly's Traces leverage open standards like OpenTelemetry to ensure interoperability, faster. We support out-of-the-box integrations with platforms such as Grafana, Honeycomb, Coralogix, or New Relic.
In conclusion, while the observability market is evolving, fundamental challenges remain. By focusing on Monitoring as Code and integrating observability into the development process, Checkly aims to create a more balanced, efficient, and effective approach, empowering teams to deliver high-quality software efficiently and reliably.
At Checkly we are already sitting on the left, so I guess, we might be one of the few that instead of shifting left, we are shifting right with Observability as Code. Time will tell, but I truly believe this is the way.