How LinkedIn modernized their monitoring infrastructure with Checkly
LinkedIn, a professional networking platform, manages a vast and complex infrastructure that supports over a billion members worldwide. Operating at this scale brings unique challenges from LinkedIn’s on-premise infrastructure modernizing off legacy components.
To navigate these complexities and improve reliability, LinkedIn partnered with Checkly, leveraging its synthetic monitoring and testing capabilities. Let’s explore the intricacies of LinkedIn’s infrastructure, the modernization initiatives underway, and how Checkly’s Monitoring as Code (MaC) plays a crucial role in enhancing system reliability, reducing incident detection times, and supporting LinkedIn's broader technological goals.
50%
More monitoring coverage
76%
Reduced monitoring cost
1 Min.
Mean Time To Detection
The Challenge
Over the years, LinkedIn developed a suite of technologies using Espresso, Venice, and Kafka. These innovations have been instrumental in supporting LinkedIn’s rapid growth. However, as the company scaled, maintaining and expanding these systems became more complex.
As a result, 60% of incidents were tied to changes made in production. There was no way to ensure that critical user flows remained functional after app or service updates went into production, and LinkedIn largely depended on post-change signals like user-reported bugs and product experience metrics (RUM).
This reactive approach led to delayed issue detection, as highlighted by an incident, where a problem was identified via an end-to-end (E2E) synthetic test failure a full hour before other user signals started showing up.
Without effective synthetic monitoring, LinkedIn teams relied primarily on logs, metrics, and traces, which present challenges in quickly understanding the user experience implication. Some services have fluctuating levels of error rates, masking underlying issues, with these service teams needing a significant error-rate deviation for a problem to be detected. This delayed time to detect (TTD) to several hours in some cases, leaving LinkedIn vulnerable to service disruptions and user frustration.
Lack of clear ownership of E2E testing made it difficult to manage complex tests that spanned distributed systems, such as those for LinkedIn’s frontend APIs, which have hundreds of downstream services and dependencies. Operations teams used Logs, Metrics, and Traces to pin down flaky errors, but this cost the team hours in detection time when things actually went wrong.
The Requirements
LinkedIn required a fully programmable platform that enabled their engineers to implement and maintain the monitors that they were getting alerted for in the middle of the night. They needed something that worked within their development process and could leverage the tools they use every day to build and test their application and services.
LinkedIn needed a tool that could proactively monitor end-to-end scenarios and uptime of their global apps and services in real time. They had previously relied on testing user journeys pre-production and relying on simple uptime monitoring in production.
Leverage existing Playwright tests in their CI/CD to monitor more complex, end-to-end transactions and user flows in production.
As part of their digital transformation, LinkedIn was looking to bring more of their tooling into the development workflow, making it easier for developers to own the operations of their code.
Ensure the ability to run tests from private locations, enabling secure testing within LinkedIn’s internal infrastructure while protecting sensitive systems from exposure to the public internet.
“Checkly made it easier for us to simulate real user behavior and detect failures early in the process.
The flexibility of Checkly made it easier for us to simulate real user behavior and detect failures early in the process, improving our overall monitoring. We’d detect issues within minutes of them becoming prevalent, whereas other solutions took much longer. ”
Senior Staff Engineer - SRE
Because of these requirements and their longer-term goal of shifting operational excellence into their development teams, Checkly stood out as the optimal choice.
LinkedIn uses Checkly to run end-to-end Playwright automation on its key services, ensuring critical functions like loading the homepage, creating job postings, and managing ad campaigns work smoothly. Checkly runs automated tests at periodic intervals allowing them to quickly catch and fix issues before they affect users. When downtimes happen, LinkedIn’s engineers are alerted directly, and the on-call engineer is notified.
LinkedIn also used Checkly's automated SSL certificate monitoring to avoid disruptions from expired certificates. With alerts set 30 days before expiration, LinkedIn ensured their APIs and services stayed secure and uninterrupted, preventing risks tied to expired SSL certificates.
Additionally, LinkedIn can automate dynamic API checks by combining TypeScript code with programmatic reports. This process allows engineers to handle complex, custom use cases efficiently, ensuring smooth operations for tasks that traditional tools would struggle to manage.
LinkedIn uses Checkly’s private locations to securely monitor internal systems that are not exposed to the public internet, including staging & test servers. They are also able to divide up their data centers by region and monitor them individually, helping isolate regional issues.
The Results
As LinkedIn continues to evolve and modernize its infrastructure, its partnership with Checkly represents a shift in how the company approaches synthetic monitoring and testing at scale. By enabling faster, more efficient testing and incident detection, Checkly is helping LinkedIn ensure that its platform remains reliable, resilient, and ready for the future.
“It’s not just about testing functionality; Checkly allowed us to test at scale without overloading our resources, something that was critical for our infrastructure. What stood out with Checkly was how seamlessly it integrated with our systems and enabled a proactive approach to monitoring.”
With Checkly’s automated tests running in regular intervals, LinkedIn can now catch issues almost immediately, well before they are rolled out across multiple data center facilities. Integrating Checkly into LinkedIn’s Continuous Deployment (CD) Canary validation test suite allows even earlier detection before code is released.
"We’d detect issues within minutes of them becoming prevalent, whereas other solutions took much longer."
By empowering engineers to take responsibility for their monitoring processes, Checkly is playing a key role in the end goal of shifting E2E test ownership at LinkedIn from the SRE team to individual engineering teams. Even for non-native services like MySQL, LinkedIn can create specific checks, expanding testing beyond front-end and public APIs to critical infrastructure like internal Kubernetes clusters.
“Checkly enabled us to move from an SRE-driven model to team-based ownership, so teams are now more accountable for their monitors in production.”
Checkly’s blend of synthetic monitoring and testing, once rare, is now a game-changer for LinkedIn. It ensures comprehensive testing at every stage, maintaining system integrity from front-end experience to back-end data flow during new feature rollouts. Targeted end-to-end tests during canary deployments help catch mid-tier and back-end service issues early, preventing broader failures.
"Checkly allowed our test coverage to grow by a large margin, which helped us uncover issues that we otherwise couldn’t catch."
LinkedIn achieved significant cost savings and efficiency by adopting Checkly’s streamlined approach. By eliminating redundant tests across 32 locations and focusing on key functionality, LinkedIn drastically reduced testing expenses. Checkly’s usage-based pricing reduced costs by 76% with 51% more tests. Checkly’s Monitoring as Code enabled tests to be deployed about 99% faster, boosting agility and operational efficiency.
“It became significantly quicker to set up a test, and with monitoring as code, we were able to increase our coverage efficiently. This streamlined process allowed us to cut down on redundant tests and focus on key functionality, all while maintaining cost-effectiveness.”