Shorten your mean time to resolution with Checkly traces

Table of contents

Why synthetic monitoring is critical to maintaining uptime
How to get started
Accelerate your Time-to-Resolution with Checkly Traces
Using Checkly to share insights
Conclusions: Checkly Traces and your SLA

We all know that Checkly is a ‘secret weapon’ for engineering teams who want to shorten their mean time to detection (MTTD). With Checkly, you can know within minutes if your service is unavailable for users, or acting unexpectedly. In this article we’ll talk about how Checkly traces can help you expand on the benefits of Checkly, adding insights that will help you diagnose root causes, and further reduce your mean time to repair (MTTR) for outages and other incidents.

Why synthetic monitoring is critical to maintaining uptime

Your users don’t care about the individual components of your service, they just want it to work. As such it’s critical to do frontend monitoring. Frontend monitoring isn’t just monitoring how your page composition, design, and JavaScript are running, for those components to load correctly, they’ll rely on backend services. Frontend monitoring means testing your whole service top to bottom in the ways that your real users will. Set up correctly, and using modern tools like Playwright, frontend monitoring with synthetic users (synthetic monitoring) can simulate full user paths through your application, revealing everything down to integration problems with third party services.

After detection, diagnosis

While synthetic monitoring is key to reducing your MTTD, that detection time is only part of the timeline of actually resolving issues for your users.

a timeline of mean time to resolution, with three chunks: time to detect, identifying root causes, and time to fix

Let’s talk about how backend data helps us reduce another large chunk of time in your MTTR.

Integrating backend data

Synthetic monitoring with Checkly helps us answer the basic question:

“Is our service working for our users?”

And while it’s true that’s the most important question, and the building block of our time to detection of a problem, to get the problem fixed we have to answer the next question:

“If we’re not working, what’s going wrong?”

When it comes to answering the ‘why’ it would be nice if we could gather data on backend performance and correlate it with what our synthetic monitor has seen.

This is where Checkly Traces comes in: an efficient, high-resolution view into your application’s back end, getting trace information for every time a request from the Checkly service kicked off a back end transaction.

Checkly traces: how it works

Checkly Traces lets you send OpenTelemetry traces up to Checkly’s OpenTelemetry collector.

With filter settings, you’ll only send trace data on requests that were kicked off by our synthetic monitoring requests, meaning a light and efficient way to get deep insights into your service’s performance.

How to get started

If you’re already using OpenTelemetry on your backend services, sending relevant traces to Checkly is as easy as editing some configuration. If you haven’t tried OpenTelemetry yet, you can still get started in just a few minutes. Full documentation for either path is on our Checkly Traces documentation site.

A key benefit of Checkly Traces is the highly focused and filtered trace data that’s sent to Checkly: since all we send is traces related to Checkly tests, there’s no significant network or storage overhead for adding deep insights to your Checkly dashboard.

Accelerate your Time-to-Resolution with Checkly Traces

A few years back I was the on-call engineer for the SRE team when a late night outage affected a large number of customers. The issue was easy enough to detect since affected users couldn’t use most of the app, but it proved very difficult to find the root cause. While digging through logs the SRE’s found many failures related to Postgres: various versions of ‘Postgres not available,’ ‘export to Postgres not found,’ etc. The SRE in charge asked me if some recent updates might have introduced a Postgres dependency. I checked with the team lead who had last pushed out updates, and they said no, definitely not. So we stayed up for several hours in the middle of the night trying to hunt down the problem, finally settling on rolling back all recent updates, and came back the next day to try to find the root cause.

In the end it turned out that a new engineer, having heard that a Postgres service was available for another separate microservice, had thought he could add Postgres dependencies to his service. The release had passed all pre-deployment checks since the database wasn’t always used for these requests.

Imagine this scenario if we’d been using Checkly. First off our users wouldn’t have had to report the outage, meaning our time to detection would have been much shorter. That would have saved 20 minutes, but the hours of worry and headache after detection could have been prevented with Checkly Traces.

With traces enhanced with backend instrumentation, the services this trace relied on are easy to read.

The failed Postgres dependencies would have stood out like a sore thumb in the Checkly trace viewer. That would mean no digging through logs to find our first clues, and no back-and-forth with the dev teams about how these requests were really being handled.

Instead of our CEO having to write an apology post, we could have gotten to bed at a reasonable hour.

One of the challenges of incident response is just communicating what is happening to everyone on the team. It can be a struggle to both show that the problem is real and capture what we know about the issue. But Checkly’s dashboards and trace viewer give clear, concise details to your whole team about the key details:

When the failure was detected
How broad the problem is (you can see which geographic regions are affected)
Details from the synthetic client like browser tracing and console errors
With Checkly Traces: backend instrumentation from requests

When using Playwright to simulate a browser, you can even see a video of the failure!

Conclusions: Checkly Traces and your SLA

We all want to maintain our SLA with our users. But just trying hard, staying up late, and wishful thinking won’t ensure that incidents are handled quickly and effectively. Worse, long incident response times hurt goodwill with your users and stress out your team. With Checkly Traces you can go from a short time to detection, to faster time to resolution, all with minimal resource demands on network and storage.

If you’d like to see a demo in action, take a look at a video tour:

And check our getting started guide. Happy tracing!

Detect

Communicate

Resolve

Monitoring as Code

Developers

Resources

Community

Shorten your MTTR with Checkly Traces