Following the footsteps of our popular DevOps Glossary, we're excited to bring you a fresh glossary dedicated to Site Reliability Engineering (SRE) and the synergies with proactive monitoring. This time, we've not only dug deeper into the SRE world but also reached out to the Reddit community to gather real-life insights and suggestions. Big thanks to everyone who chimed in!
This glossary is more than just definitions—it's a collection of concepts, tools, and practices that SRE folks talk about, use, and refine. Whether you're just starting in SRE or looking to brush up on the latest terms, we've got you covered with a guide that's all about making complex SRE ideas a bit easier and quicker to grasp. Dive in and explore the key aspects of SRE, especially focusing on the essentials like monitoring, testing, and observability, which are crucial for keeping systems up and running smoothly.
Admittedly, these terms do include some of the buzzwords we're all tired of hearing. But one of the things that makes it a buzzword is that we're track of the original meeting. Hopefully this article helps remind us of what these terms are supposed to mean.
Deployment and Release Strategies
- Blue-Green Deployment: A strategy to minimize downtime and risk by maintaining two identical production environments and switching traffic between them.
- Canary Release: Incrementally rolling out changes to a small subset of users, allowing for real-time monitoring and testing.
Kubernetes and Containerization
- Horizontal Pod Autoscaler (HPA): Automatically adjusts Kubernetes pod numbers based on observed metrics, essential for performance and availability monitoring.
- Pod: Basic deployable units in Kubernetes that can be individually monitored to ensure the optimal performance of containerized applications.
Monitoring and Observability
- Golden Signals (From Reddit): Vital metrics that provide insights into system health and performance, including latency, traffic, errors, and saturation.
- Monitoring as Code: Managing and automating monitoring configurations and alerts through code for consistent and reliable system observations.
- Observability: A comprehensive approach to understanding system states and behaviors through logs, metrics, and traces.
- Service Level Indicator (SLI): Metrics that reflect service performance, foundational for effective monitoring and alerting.
- Service Level Objective (SLO): Targets for SLIs that guide monitoring efforts to ensure service meets or exceeds reliability standards.
Operational Concepts
- Chaos Engineering: Testing system resilience and monitoring effectiveness by introducing controlled disruptions.
- Incident Management: The process where monitoring plays a crucial role in the rapid detection and resolution of incidents.
- Toil: Identifying and reducing manual, repetitive operational tasks through improved monitoring and automation.
Service Management and Networking
- Service Mesh (From Reddit): Enhances observability and network control, offering insights into microservices communications within Kubernetes.
- Traffic Management: Monitoring and managing request distribution to maintain service performance and availability.
Testing and Validation
- End-to-End Testing: Ensures comprehensive workflow functionality, with monitoring confirming all user paths operate as intended.
- Synthetic Monitoring: Uses simulated user interactions to proactively monitor and test application performance and functionality.
Development and Infrastructure Management
- GitOps: Integrating monitoring configurations within the Git workflow, aligning deployment and monitoring for consistency and reliability. More generally, this can be understood to be ‘kicking off automations based on git actions.’
Infrastructure and Configuration
- Immutable Infrastructure: Infrastructure components that are replaced in entirety upon update, rather than being changed or upgraded. By ensuring no state information is carried over between updates, this concept is crucial for ensuring consistency and reliability in deployment processes.
- Configuration as Code (CaC): The practice of managing and provisioning infrastructure through machine-readable definition files, rather than manual processes or interactive configuration tools.
Performance and Optimization
- Load Balancing: Distributing network or application traffic across multiple servers to ensure reliability and performance.
- Auto-scaling: Automatically adjusting the number of computing resources in response to service demand.
Security and Compliance
- Security Policy as Code: Defining and managing security policies through code to automate enforcement and ensure compliance across the infrastructure.
- Compliance as Code: Embedding compliance and governance controls into the codebase to automate and streamline compliance checks and audits.
Collaboration and Culture
- Incident Command System (ICS): A standardized approach to the command, control, and coordination of emergency response, which can be adapted for managing IT incidents.
- Psychological Safety: Fostering an environment where team members feel safe to take risks and be vulnerable in front of each other, which is crucial for effective postmortems and learning from failures.
Advanced Kubernetes Concepts
- Operator Pattern: Extending Kubernetes' capabilities with custom resources and controllers designed to manage specific applications or configurations.
- Service Discovery: Automatically detecting services in a network, allowing applications and microservices to locate each other and communicate.
Additional Observability and Monitoring
- Anomaly Detection: Using automated monitoring tools to identify patterns or behaviors that deviate from the established norm, which can indicate potential issues.
- Tracing: Following a transaction or workflow through various stages of a distributed system to understand system behavior and identify potential bottlenecks.
Scalability and Reliability
- Failover: The process of switching to a redundant or standby system, server, network, or component upon the failure of the currently active operation.
- Disaster Recovery (DR): Strategies and processes for quickly reestablishing access to applications, data, and IT resources after an outage.
Core Principles
Site Reliability Engineering (SRE): Incorporates rigorous monitoring and observability to uphold system reliability and adapt to operational dynamics effectively.