Optimize MTTD by monitoring your services with the right check frequency

Table of contents

How to decide a frequency for automated site checks
Calibrating your check frequency
Conclusions

Checkly enables engineers to automate the monitoring of their production services. Using the automation framework Playwright, you can run an end-to-end test on a regular cadence to make sure every feature is working for your users. But once you’ve got your check set up, either with Playwright scripting, a Terraform template, or an OpenAPI spec, we come to the question of what frequency you should run these checks. Should you be checking every few minutes, or every hour? Surely you have other ways to detect problems, so maybe just a daily check of your API and main page would be enough, right? While there’s no single right answer for everyone, this post tries to break down how you can find the right cadence for your site checks.

A timeline of mean time to resolution

A key metric for any Operations or IT team is the mean time to detect (MTTD) failures and outages. Minimizing the time it takes to detect issues makes it that much easier to identify root causes and fix the problem. The result is a lower mean time to resolution (the all-important MTTR) and helps your entire team meat your service level agreements.

In general, the process starts with defining your SLA, then looking at your past incidents to determine a time budget from each incident. After identifying an optimal time to detection, you can optimize Checkly settings to make sure your alerts reach you in time with without producing unecessary noise.

How to decide a frequency for automated site checks

The story begins with your SLA

The story of check frequency begins with your service level agreement with your users. Even if you don’t have an explicit contract with your users about uptime, you’ll still have expectations for how much downtime is acceptable. If you’re aiming for 99% uptime, that means you’ll have about 7 hours of acceptable downtime in a month before you break SLA. If you want three nines (99.9% uptime), you can only afford 43 minutes of downtime per month. How well you meet this SLA depends on the overall reliability of your service (how many failures you have) and your mean time to resolve those failures. And the first and often largest chunk of time in the time to resolve an issue, is the time it takes to detect an issue.

At 99% uptime with very few failures a month, it may be safe to rely on users to report problems on your service. With just 1-2 failures per month, you’ll be giving your users an hour to report the problem and still have a couple hours to get it fixed before breaking SLA.

However, if you’re shooting for 99.9% uptime, and have more than two failures per month, now you only have a few minutes to handle each incident. In that case, there’s no way users can report the problem before you’ve already broken SLA.

Use our Check Frequency Calculator to work out your own service's time budget and ideal time to detection.

In order to meet uptime expectations above 99% uptime, we’ll need a monitor tool to detect all outages before our users report them. At Checkly, it’s no problem to accurately simulate your users with our synthetic monitoring. Playwright makes it easy to script complex user paths with your application, so you can know when your tests are done running that everything works as expected for your users. Therefore, with the right check frequency, we can get our mean time to detect down to minutes or even seconds.

Look at your rate of incidents to determine check frequency

If you know how many failures your system experiences per month, it’s pretty easy to know what rate your automated checks should happen. The calculation looks like this:

(SLA Downtime Percentage * Month Length) / Incidents per month = time budget for each incident

For example, with 99.9% uptime expected and two incidents per month on average, we’d get a time budget of 22 minutes for each incident. That relatively generous time budget means we may be able to get by only running a check once every five minutes, and still give ourselves 10 minutes or so to activate the right on-call team and deploy a fix.

A timeline of incident resolution allowing 20 plus minutes for resolution

When everything is going right, failures can be resolved before many users have even noticed a problem, and without breaking SLA

Small changes can drasticly change this calculation: with the same SLA (99.9%) and an average of five outages a month, we only have a budget of 8 minutes for an outage before we start to break SLA. On the timeline above, it takes nearly 10 minutes to start work on a fix, so we’re bound to take too long for each incident. This leads to a key takeaway:

Promising an SLA without setting up monitoring to detect problems in time is setting your team up for failure.

Let’s talk about how the right configuration can help or hinder our effort to meat user expectations.

Retry strategy and MTTD

On the timeline above, there are a few minutes between the Checkly check run, starting about 5 minutes after a failure, and when a failure is detected a few minutes later. What’s the cause of this delay? While checks need time to actually complete, the other cause of delay here is the retry strategy in use. When a check fails, almost any failure has a small chance of being caused by one-time network issues or other glitches that don’t repeat, so we do want to perform an immediate retry to make sure the problem can be replicated.

Checkly retry settings

The Checkly retry settings offer highly customizable strategies for verifying issues before notifying your team.

Retry strategies vary from trying again just a few seconds later, to multiple checks over several minutes.

Note: if retries inherently delay alerts, you may wonder why someone would set up an exponential time for repeated retries. This setting is extremely useful if your check can set off large back-end processes or delayed jobs. If your test is performing a large update to lots of records, when it fails you want to give some time before trying again. If the update fails a second time, we want to wait even longer. The last thing we want to do with a synthetic monitor is kick off a DDoS attack on our own service! Thankfully you can configure the retry strategy on each individual check, so this longer retry timing only needs to be used on selected checks.

Our timeline, including this retry time, will look something like this:

A timeline requiring time for retries

This check is set for retries with a linear strategy, with retries at one minute and two minutes. After an initial failed check run, it takes about four minutes to alert the team of a problem.

At this point, you can start to see why a check every five minutes means that at least 10 minutes will often elapse before the relevant engineering team can actually start working on a fix.

Increasing frequency and improving check coverage

One question to determine as we explore check frequency is how well our checks overlap. We want a large variety of checks to simulate variations in user behavior. However, if multiple checks are all fundamentally testing the same components, it may be possible to have only one of them running at a high frequency with others running less frequently to catch edge cases. The benefits of a high frequency check. Recall that delivering 99.9% uptime with less than ten outages a month, means we only have 5-10 minutes to resolve each issue. On the timeline above the appropriate dev team hasn’t even heard there’s a problem until 10+ minutes after a failure. If we increase the check frequency to every minute, now our timeline looks more realistic to get a failure fixed in time:

a timeline, this time with a higher frequency set

With checks running every minute, we are setting the team up for success in meeting our SLA

To make sure that checks are prioritized and that overlapping checks are used to help ensure uptime, consider using tags and groups of checks to keep track of the areas covered by each check.

Geographic locations and check frequency

I remember well my first year as a developer in an on-call team. There was one incident type that I dreaded: hearing that “no users in [country] can access the site.” Geo-specific issues were often the hardest to diagnose. Worse, since only a subset of our users had experienced the problem it was likely that the problem had been going on some time before detection. With Checkly, our multiple geographic locations can run checks, meaning you shouldn’t be surprised when only one geographic area’s users are locked out.

One key decision for check frequency is whether you want round robin or parallel checks. With round robin checks, at each check only a single location will send a check:

maps showing each location pipped in turn at 5 minute intervals.

A round-robin check set to run every five minutes, from four locations

In this example, the North American location will send a check every 20 minutes, with the other three geo locations sending a check at the other five minute intervals. However, if geographically specific failures are common with your service, for example if only european users see a failure, this will mean you’ll wait about 20 minutes to detect the problem. To make sure that every region is running a check every time, run parallel checks:

Maps in a timeline with all four locations pipped every five minutes

Four locations set to run parrallel checks every five minutes

When geographically specific failures are a frequent problem, parrallel checks drastically decrease your MTTD.

Calibrating your check frequency

Working through these ideas, there are some simple steps to help you calibrate how often you should ping your site, and what kind of checks you should use.

Understand Service SLA and time budget: Set yourself up for success by defining how quickly you realistically need to resolve issues to meet your SLA.
Define a goal MTTD: With a time budget for each incident, determine how quickly you need to detect issues to give your team time to find a fix.
Set retry logic to give timely alerts with minimal noise: Retry logic affects the time to receive an outage alert, minimize the total time for retries, only increasing retry timing (or the total number of retries) when alerts become too noisy.
Schedule dynamically based on region: Review past issues to see how often outages have affected only some regions. With some intelligent logic on check timing, you can prioritize finding the status of a single region, or spread your checks around so at least one region is always checking site status. Take a look at the Checkly documentation to see some options on how to schedule checks across different regions.
Balance with Comprehensive Monitoring: Playwright checks are just one aspect of monitoring. They should be complemented with more detailed health checks and performance metrics to provide a holistic view of service health. Checkly can integrate with backend monitoring via OpenTelemetry, and give you a more complete view of performance on every check.
Refine with Feedback Loops: Use feedback from past incidents to refine your monitoring strategy. If frequent pinger checks are missing issues or causing unnecessary noise, adjust accordingly.

Conclusions

Choosing to probe at a rate inadequate to defend your SLA is planning to fail.

Remember, the goal is not just to detect issues but to do so in a way that aligns with your operational priorities and SLA. As you refine your approach, leverage Checkly to continuously improve your monitoring strategy. Ultimately, the right cadence for your checks will set you up for success, helping you maintain uptime, meet your SLAs, and deliver a seamless experience for your users.

If you’d like to join a community of engineers trying to work out the right way to ping their site, join the Checkly Slack and say hi!

DETECT

Uptime Monitoring

Synthetic Monitoring

Backend

Frontend

COMMUNICATE

Status Pages

Alerts

Dashboards

RESOLVE

Tracing

Developers

Resources

Webinars & Events

Community

Optimize MTTD with the right check frequency