Table of contents
- The Need for Observability
- Challenges with Traditional Observability
- Shifting Observability Left
- What Is Observability as Code?
- The Evolution of Infrastructure as Code (IaC) and Observability as Code
- Why You Need Observability as Code
- Meet Monitoring as Code
- How to Get Started With Observability as Code and Checkly
- Observability as Code for Modern Software Development
Traditional monitoring has become insufficient for managing complex systems. Modern infrastructures consist of numerous interconnected services, and simply monitoring individual metrics and logs fails to provide a comprehensive view. This is where observability becomes crucial.
The Need for Observability
As systems become more complex, moving from monolithic applications to distributed microservices and cloud resources, deeper insights are required. Observability collects and analyzes data from various system components, offering a detailed view of the system's health. It helps predict behavior, identify root causes, and optimize performance.
Challenges with Traditional Observability
Traditional observability practices often focus on production environments and are driven by operations teams. This approach leads to several problems:
- Limited Scope: Observing only production environments misses critical insights from staging and testing, leading to unexpected issues when code moves to production.
- Operational Silos: Operations teams handle observability, while developers may lack understanding of the tools, resulting in suboptimal instrumentation and delayed issue resolution.
- Reactive Approach: Addressing issues only after they occur in production can lead to costly downtime and poor user experience.
- Reduced Signal-to-Noise Ratio: Operation teams have very little understanding of the systems, therefore they are aiming to collect all possible data around a service. This turns out to be very counterproductive at scale, given that the brute force collection of data results in high costs (observability vendors, cloud egress of data, etc.), and worst of all, it adds a lot of noise to your data, reducing your signal to noise ratio.
Shifting Observability Left
To overcome these issues, observability is shifting left, integrating into early development stages.
Key reasons for shifting left include:
- Proactive Issue Detection: Shifting observability left means incorporating it during the development and testing phases, rather than waiting until production fails. This helps developers catch and resolve issues early, reducing the risk of downtime and performance problems later on.
- Integration with Development Practices: By making observability part of the development process, teams can continuously monitor their applications in the different environments (testing, staging, production), gain real-time insights, and make data-driven decisions. This proactive approach aligns well with modern DevOps practices, fostering better collaboration and efficiency.
- Team Accountability: More teams are accountable for operating their systems in production. They need autonomy in deploying and adapting observability artifacts to their needs, including alerts, dashboards, and Service Level Objectives (SLOs).
- Complexity and Data Management: As IT infrastructures grow more complex and the volume of data increases, traditional reactive methods become less effective. Advanced observability tools can automatically analyze data, detect anomalies, and provide detailed diagnostics, making it easier to manage and maintain system health.
- Business Impact: Enhancing observability early in the development cycle not only improves system reliability but also supports business objectives. It allows IT leaders to focus on critical issues that impact the bottom line, ensuring better overall performance and user experience.
By automating the setup and management of observability tools, OaC ensures consistent and efficient monitoring across development stages. This synergy between shifting left and OaC helps teams maintain high standards of performance and reliability in their applications.
Incorporating observability early and automating it through code allows for seamless integration with development workflows, promoting better collaboration and faster issue resolution.
By shifting observability left, we see two major changes:
- Empowering Engineers: Developers control how they instrument their code, ensuring observability is built-in from the start, improving collaboration between development and operations.
- Early Detection: Observing staging and testing environments helps identify and fix issues before they hit production, reducing surprises and enhancing system stability.
What Is Observability as Code?
Observability as Code shifts observability left by automating the setup and management of observability tools using code. It simplifies tasks like monitoring alerts, and dashboard creation to ensure consistent and efficient insights. This approach helps configure and deploy the observability artifacts alongside your cloud resources, extending the principles of Infrastructure as Code (IaC).
Imagine a large company running multiple microservices. Each service has its own complexities and requires monitoring to catch issues before they escalate. With OaC, the company can create standardized, reusable artifacts that set up observability tools for each microservice.
These artifacts can include SLOs, dashboards, browser checks, log-based metrics, notification channels, alerts, etc., providing a comprehensive view of the system's health. When a new microservice is developed, the team can quickly apply these artifacts, ensuring that the new service is monitored just like the existing ones.
The Evolution of Infrastructure as Code (IaC) and Observability as Code
Over a decade ago, Infrastructure as Code (IaC) revolutionized IT by automating infrastructure setup, making it faster and more consistent. "As code" means managing infrastructure configurations like software code, tracking changes, and applying them consistently.
With modern distributed systems, outages have become more frequent, and identifying the root cause is challenging. Observability helps by using outputs like traces, logs and metrics to understand system states and diagnose problems.
However, traditional operational practices haven't evolved much, leading to inconsistent and overwhelming alert management. Observability as Code automates observability configurations, ensuring consistency and reducing manual effort. This new approach treats observability settings as code, making them easier to manage and audit.
Why You Need Observability as Code
Now, let’s take a look at some of the benefits of Observability as Code
Automation
Observability as Code automates configuration tasks, ensuring they are applied consistently and rapidly across the entire infrastructure. This method eliminates manual errors and accelerates deployment. It's especially beneficial in dynamic and rapidly evolving environments where manual configurations would be too slow and error-prone. By treating observability configurations like code, teams can maintain high standards of performance and reliability, even as systems scale and change.
Collaboration
Observability as Code ensures that all team members have access to the same configurations and insights. This promotes clear communication and alignment across development, operations, and DevOps teams. Shared tools and practices reduce silos, streamline troubleshooting, and lead to faster issue resolution. By working together, teams can maintain consistent observability standards and innovate more effectively.
Consistency and Reproducibility
By automating observability configurations, OaC ensures that the same setup is applied uniformly across all environments. This eliminates manual errors and discrepancies. Version control tracks changes, making it easy to reproduce and audit configurations. Consistent configurations lead to reliable performance and simplified troubleshooting, as teams can rely on a standardized setup. This approach also supports scalability, ensuring that observability practices grow seamlessly with the infrastructure.
Resource Recovery
With Observability as Code, when failures or changes occur, configurations can be quickly restored, minimizing downtime. By leveraging version-controlled backups, teams can easily revert to previous states, ensuring reliable recovery. This consistency enhances system resilience and supports continuous operations, even in dynamic environments.
Flexibility
Defining observability as code allows teams to swiftly adjust and update observability configurations to meet changing needs. Code-based setups facilitate rapid modifications across different environments, ensuring observability remains effective. This adaptability supports evolving infrastructure, helping maintain optimal performance.
Scalability
As infrastructure grows, OaC ensures observability configurations can scale accordingly. This approach maintains effective monitoring and diagnostics across expanding environments. By adapting to increased demands, OaC helps sustain system performance and reliability as the infrastructure evolves.
Security
Observability as Code boosts security by managing configurations through code. This ensures that security best practices are uniformly applied across all environments. Automated reviews and version control help detect and address vulnerabilities early. Treating observability settings as code enables thorough auditing and compliance, significantly reducing security risks.
Speed of Deployment
Automated configurations allow for rapid setup and updates of observability tools. This acceleration minimizes the time needed to deploy new monitoring solutions or make changes. By streamlining the deployment process, OaC ensures that observability keeps pace with fast development cycles and quick infrastructure changes.
Version Control
By managing observability configurations with version control systems (like Git), teams can track changes, roll back to previous versions, and maintain a clear history of modifications. This ensures consistency and accountability, as every change is documented and can be reviewed. OaC's version control capabilities enhance transparency and facilitate collaborative troubleshooting and auditing.
Meet Monitoring as Code
The true power of Observability as Code lies in how it brings together the various elements of system health monitoring. By integrating Monitoring as Code (MaC) into the broader OaC strategy, the company can ensure that its observability efforts are both comprehensive and consistent.
Monitoring as Code has risen as one of the hottest trends in observability, allowing you to define monitoring rules, alerts, and dashboards as code.
These are the key elements of Monitoring as Code:
- Configuration Files: These files (check out Checkly’s CLI) specify monitoring parameters, thresholds, and alerts.
- Version Control: Configuration files are stored in version control systems, enabling change tracking, collaboration, and historical analysis.
- Automation: Essential to MaC, these tools automate the deployment and updating of monitoring configurations across different environments, ensuring efficient and consistent observability.
As infrastructure expands, Monitoring as Code (MaC) scales programmatically to cover new services and systems. It promotes collaboration among development, operations, and QA teams by storing monitoring configurations as code, making it easier for everyone to contribute. MaC also automates repetitive tasks, reducing manual effort and errors, and freeing up resources for other critical functions.
How to Get Started With Observability as Code and Checkly
With Checkly's CLI workflow, getting started with Observability as Code is pretty easy. You can be up and running in no time to ensure your crucial web apps and sites are performing up to spec. The Checkly CLI provides two main workflows:
- Coding: These encompass scripts (such as ApiCheck, BrowserCheck, or SlackAlertChannel) written in JavaScript/TypeScript. They are intended to be deployed and executed on the Checkly cloud backend.
- Command: These constitute the fundamental commands for executing your monitoring scripts. The `test` command is utilized for running monitoring checks locally or in continuous integration, while the `deploy` command is employed to push your monitoring scripts to the Checkly cloud backend.
Let's jump into the step-by-step process.
Step 1: Setting Up a Checkly Account
Ensure you have an active Checkly account. If not, sign up. The default free Team trial lasts for 14 days. To access advanced features beyond the trial, subscribe to a pricing plan based on your needs.
Step 2: Setting Up Checkly in Repository
After completing the installation steps, open a terminal in the directory of your project and run the following command:
npm create checkly
This command will bootstrap your repository with the basics needed to start using Checkly MaC in your project.
In your project directory, you will find a folder named “__checks__” containing the following check templates:
|__checks__
|- api.check.ts
|- heartbeat.check.ts
|- homepage.spec.ts
Once this setup is complete, log in to your Checkly account via the CLI using the following command:
npx checkly login
You can choose to log in from the browser or in your terminal. After logging in, you'll be able to update Checkly Checks from your local machine as long as you're connected to the internet.
Step 3: Writing Your Monitoring Scripts
In your development environment, write JavaScript/TypeScript tests for your code updates, similar to unit tests. We typically use the Playwright testing framework in the `.spec.ts` or `.check.ts` file.
Consider a scenario where you want to monitor the title of the Checkly documentation and take a screenshot of the page. To do this, replace the code in the `homepage.spec.ts` with the following:
import { test, expect } from '@playwright/test';
test('Checkly Docs', async ({ page }) => {
const response = await page.goto('https://www.checklyhq.com/docs/browser-checks/');
// Ensure the page is loaded successfully
expect(response?.status()).toBeLessThan(400);
// Check if the page title is as expected
const pageTitle = await page.title();
const expectedTitle = 'Introduction to Checkly | Checkly';
expect(pageTitle).toBe(expectedTitle);
// Optionally, you can take a screenshot if needed
await page.screenshot({ path: 'homepage.jpg' });
});
This test uses the `page.goto` method to navigate to the specified URL (‘https://www.checklyhq.com/docs/browser-checks/’). The method returns a response object, which is stored in the response variable.
Then we use the `expect` function to assert that the HTTP status code of the response is less than 400. This is a way to ensure that the page is loaded successfully without any HTTP errors.
`page.title()` retrieves the title of the page and compares it with the expected title ('Introduction to Checkly | Checkly') using the `expect` function. This ensures that the page title matches the expected value.
Finally, we take a screenshot of the page and save it as 'homepage.jpg'.
Step 4: Running Test Sessions
Now that we have our test scripts ready, let’s execute them. We can use the Check CLI command to execute our monitoring pipeline in our staging environment, recording the results for inspection if something fails. Run the following command in the terminal of your project repository to execute the test:
npx checkly test --record
The `--record` flag is optional, you can use it if you want to record a test session with git info, full logging, videos and traces. `--record` sessions can be reviewed within the Checkly web interface.
Here is the result of the test we just executed:
There are also links to the detailed summary of the test at the end of the result in the terminal. Here is an example of the test summary:
As seen in the result, the test failed because if you browse the URL (https://www.checklyhq.com/docs/browser-checks/) the title of the site is “Getting started | Checkly” and not “Introduction to Checkly | Checkly” as expected in the test case.
If we update the test case to expect “Getting started | Checkly” we will have a passed test. Here is the result of the test after updating the correct title:
If you check the detailed summary, we should have a passed test too:
Step 5: Deploying Checks
Now that you've reviewed and updated your tests, you can proceed to deploy your MaC workflow and related resources, such as alerts and dashboards. Run the following command in your project terminal to deploy the tests to your Checkly account:
npx checkly deploy
Once the deployment is complete, you'll see a success message in your terminal, indicating that the project has been deployed to your Checkly account.
To verify this, navigate to the home section on the left side of the Checkly UI, and you'll find the project with the name of the test script from your local repository.
Step 6: Setting up Alerts
Checkly offers alert services to notify you whenever a check fails. Various alert channels are available, including Slack, SMS, webhook, phone call, email, Opsgenie, PagerDuty, etc.
To set up alerts for your check, go to the specific project, in this case, "homepage.spec.ts." At the top right corner of the code editor, click the "Settings" button. In the revealed side panel, access "Alert Settings" under "Retries & alerting."
Here, configure monitoring parameters according to your needs, including check frequency, locations, retries and alerting. You can also set up your preferred alert channel using the Checkly CLI. Learn more about the alert channel from the official documentation.
With the appropriate alert channels set up, there is no need for customers to regularly visit the dashboard. Instead, they will be promptly notified, allowing them to react immediately upon receiving alerts.
Why Monitoring as Code With Checkly?
- Enhanced Efficiency and Reliability: Checkly ensures alerts are for genuine issues only, reducing noise and false positives. This proactive approach helps detect and resolve problems early, reducing Mean Time to Recovery (MTTR) and meeting Service Level Agreements (SLAs), thus securing uptime and client satisfaction.
- Seamless Integration and Automation: Checkly’s code-first workflow integrates smoothly into your stack, defining monitoring as code. This supports shift-left monitoring, allowing immediate issue resolution and ensuring app reliability. Automated setup and maintenance save time and eliminate errors, enhancing efficiency.
- Scalability and Future-Proofing: Checkly’s MaC approach scales monitoring programmatically as infrastructure grows, maintaining effective observability. Integrated with CI/CD pipelines, it ensures consistent monitoring from pre-production to production, supporting dynamic environments.
- Support for a Wide Range of Observability Tools: Checkly supports modern synthetic monitoring and integrates with tools like Playwright, Grafana, Coralogix, and Prometheus, ensuring thorough and accurate observability across all systems.
- Cost-Effective Solution: Checkly offers significant savings, potentially up to 80% compared to legacy providers. An all-inclusive subscription covers features, users, implementation, support, and maintenance, avoiding hidden fees and ensuring precise budget management.
Observability as Code for Modern Software Development
Observability as Code (OaC) is essential for modern software development, providing consistency, reliability, and scalability in monitoring practices.
Automating and standardizing observability configurations enhances efficiency, supports collaboration, and accommodates infrastructure growth.
With tools like Checkly, implementing Monitoring as Code (MaC) is straightforward, offering comprehensive monitoring solutions while reducing costs and manual effort.
Ready to improve your monitoring practices?
Get started with Monitoring as Code and Checkly today. Create your free account.