By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Reliability is the best metric for retaining customers. Knowing this, Google spun up Site Reliability Engineering (SRE), a philosophy similar to DevOps (and oftentimes referred to as a subset or sibling of DevOps), that focuses on leveraging aspects of software engineering and applying them to infrastructure and operations problems.
Even today in most traditional on-premises environments, operations management is typically handled by an IT operations team in charge of infrastructure provisioning, capacity management, cost control, performance, and security for all of the organization’s assets. If you’ve ever worked in traditional IT roles or you’ve worked as a developer alongside IT ops team members, you know how difficult it can be to partner with another team to manage an application’s operations. IT ops usually doesn’t have full context of the applications they support since they are traditionally focused solely on ops and not development. Typically, the way they’d understand the relative importance and value of an application is by how it’s classified in the asset database. Typically, there are availability requirements for each application service tier, and IT ops folks do their best to ensure that they meet those service level objectives (SLOs).
When it comes to issues, it’s a total blame game, with teams deflecting responsibility as much as possible. The development team opens a ticket, IT ops investigates and blames it on a bug, and development blames it on IT ops; as a result, the infrastructure is not equipped to provide full visibility of an issue to arrive at a quick solution, and work is often very reactionary in nature and not proactive. Siloed teams often cause rifts in the organization. It’s similar to what occurred at the beginning of the COVID-19 outbreak in 2020, as we watched state and federal governments bicker back and forth about who was to blame for the lack of masks, ventilators, and other medical equipment, all while tens of thousands of people were dying. That may be a brutal analogy, but bickering about responsibility is a distinct possibility in the tech space as well and as you can imagine, this may affect your service level metrics. CYA (cover your “gluteus maximus”) is alive and well in many IT organizations. In fact, some organizations have turned office politics into an art form. Hopefully, your applications, or lack thereof, are designed to avoid causing any deaths if outages occur. For some software teams, however, their applications are literally keeping people alive. Imagine, for example, that your software is designed for a real-time blood glucose monitoring device. What happens if you have a major outage and you are not equipped to resolve reliability issues quickly? What happens to all the millions of users who depend on your device for understanding and responding to their blood sugar levels? Reliability is important, and it may actually mean life or death for some software teams and their users.
Word of wisdom for everyone—creating an inclusive environment is your job, no matter where you work. Working in silos only elevates a team’s feeling of self worth, as team members start to believe that they “do everything” themselves; in reality, of course, this is not the case. So when you’re engaging with people who need your help, or when you need help from other people, think about how you can create bridges between your teams and not elevate yourself or your team as a KIA (know-it-all). Build bridges, and don’t blow them up!
DevOps and SRE exist to avoid silos. DevOps is the bridge that brings together everyone across the development life cycle by sharing a set of principles, including development, testing, and operations roles. SRE is a dedicated engineering role that is often viewed as an implementation of DevOps with some idiosyncratic extensions. SRE focuses on the “what” and “how” of improving reliability. Site reliability engineers spend half of their time doing operations-related work, working on issues, on-call situations, and manual investigations. The other half of a site reliability engineer’s role is dedicated to developing new features, scaling, or automation that will improve availability and performance requirements for an application’s architecture. DevOps-oriented software teams typically are expected to be highly automatic and able to self-heal, and this includes the partnership between the developers, testers, and operations folks. This enables site reliability engineers to perform day-to-day activities and innovate to improve performance and reliability. A fundamental tenet of SRE is the concept of error budgets. Unlike traditional operation models, where the objective is to keep failures as close to zero as possible, a site reliability engineer actually has a error budget. Basically, if your objective is to provide 99.9 percent uptime, then technically you have 0.1 percent budget to deal with risks. If you are actually running at 100 percent uptime, this tells senior management that you are not taking enough risks to improve IT velocity. That error budget is meant to be spent! So when your development teams complain about having to deploy all these releases and your error budget is not spent, then you, as a good site reliability engineer, encourage and help developers to push the releases through into production. If these changes cause problems and consume your error budget, you can put the breaks on further changes so that as a team, you stay within your SLOs within the timeframe in question. If everyone in the organization is aligned to meeting these common objectives, then everyone will look for ways to stay within the allocated error budget while trying to rapidly innovate new business value from their applications.
The cloud enables teams to bring visibility into their work by building bridges between their roles to integrate while still being decoupled, to automate where they could not automate before, and to innovate in ways that could never have happened on-premises. For all that to be measurable and actionable, you need to have telemetry data that you can leverage to improve your key performance indicators and overall application reliability. This is commonly referred to as instrumentation.
Cloud Operations was formerly known as the Stackdriver suite. When you take the exam, it’s possible that the product name may appear as Stackdriver instead of Cloud Operations Suite. For example, you may see references to Stackdriver Logging, Stackdriver Monitoring, Stackdriver Trace, and so on.
Cloud Logging The first element of the Cloud Operations stack is Cloud Logging, a real-time log management and analysis tool that enables you to store, search, analyze, monitor, and alert on log data and events. It allows for ingestion of any custom log data from any source and is a fully managed service. Cloud Logging also natively integrates into Cloud Monitoring (discussed in the next section), so that you can define alerts based on certain metrics you select. Cloud Logging also natively integrates with Amazon Web Services (AWS) and supports a logging agent that is based on the Fluentd data collector and that can run on your virtual machine (VM) instances.
The Cloud Logging agent, based on the Fluentd log data collector, collects logs from user applications and sends them to the Cloud Logging API using Fluentd configuration files. There are many preconfigured Linux and Windows logs, and you can customize your own. Cloud Logging supports many common third-party solutions such as Apache, Chef, Jenkins, Mongodb, Cassandra, MySQL, and more. In Cloud Logging, you typically store your logs in a user interface known as the Logs Viewer, and you use an API to manage your logs programmatically. You can read and write log entries, query your logs, control how your logs are routed, and create exporting sinks and log-based metrics. Log entries are recorded events that are captured from products, services, third-party applications, or even your own applications. The messages that your log entries carry is known as a payload, and the collection of your log entries makes up a log; without log entries, there is no log. The Logs Viewer is a user interface that enables you to view and analyze your log data. In the Logs Viewer, you can build queries by using the GUI or by using its query builder language, and it saves your queries so that you can refer to them in the future.
You don’t have to know everything about the Logs Viewer for the exam, but it’s a good idea to have a very high-level understanding of using it. Log into the Cloud Console and take a look at some sample logs through the Logs Viewer. Get familiar with some very basic navigation and syntax. Where would you search for network logs? For network interface configuration changes to your GCE instances?
In Cloud Logging, your logs are stored in a logging project by default and can have a 400-day or 30-day retention period based on the log type. Some logs are customizable up to a 3650-day retention period within a logging project. Most organizations, however, use logs in Cloud Storage for long-term retention and use BigQuery for analysis. You can also route your logs to a Cloud Pub/Sub topic, where they can be ingested by any third-party application. A typical use case for Pub/Sub forwarding is integration with a third-party security information and event management (SIEM) platform such as Splunk. Many DevOps teams like to use their own set of tools to analyze and monitor their logs; others use the native Cloud Operations tools. Either way, you must set up your log architecture accordingly. If your logs are routed to any other log storage, whether GCS, BigQuery, or Pub/Sub, the logs are automatically passed through the Cloud Logging API, where they pass through the Logs Router. The Logs Router then looks at each log entry and the rules you’ve set to determine which logs to ingest, which logs to export, and which log entries to discard to save money and ensure efficiency. You can configure your logs to be exported into an appropriate log sink storage destination. For every project, a default log sink routes all your logs into a default log bucket. You can leverage exclusions to create filters and exclude certain types of logs from being stored in Cloud Logging by default to reduce costs and minimize the number of logs you’re storing. Aggregated sinks can be set up at the organization, folder, or project level.
The Cloud Logging API also enables ingestion of any custom log data from any data source. Being a fully managed service, it performs at scale and can support massive environments at a phenomenal price-to-performance ratio. All this while still being able to analyze your logs in real time! You can also export data with one click to BigQuery for advanced analytics, and SQL-like querying is incredibly powerful, enabling organizations to run massive queries in little time.
Many compliance frameworks, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Payment Card Industry (PCI) Data Security Standard, require long-term archival requirements for your logs. If these logs don’t need to be analyzed, think about how you can leverage various storage classes in GCS to store your logs and save money.
Log Types Many categories of logs are available in Cloud Logging, which receives, indexes, and stores log entries from services, instances running the Cloud Logging agent, other cloud providers, and custom log sources. Logs can be captured at every layer of the resource hierarchy, whether that’s at the organization level, folder level, or project level. Although there are many logging sources, there are a few key ones you should know about at a high level.
Admin Activity Audit Logs Admin activity audit logs come from Cloud Identity, which operates from admin.google.com rather than from the GCP. These contain administrative activity and user activity in the Cloud Identity platform and include things like account creation, deletion, authentication, configuration modifications to your identity provider (IdP), password changes, and so on.
Cloud Audit Logs Cloud audit logs consist of administrative activity audit logs, data access audit logs, and system event audit logs. - Administrative activity audit logs These logs contain log entries for API calls or any other user administrative modifications to configurations or resource metadata. For example, these logs may record when an administrator makes modifications to roles and permissions inside of GCP or spins up a VM. - Data access audit logs These logs contain log entries for API calls that read resource configurations, metadata, or read/write user-based API calls. These logs are disabled by default, and they do not log publicly shared resources. Log entries could include reading data within a service or writing data to a service. - System event audit logs These logs contain log entries for administrative activity that modifies resource configurations based on activity generated by Google systems and not user-based activity. For example, log entries could indicate that a transparent maintenance event occurred.
Network Logs Network logs consist of Virtual Private Cloud (VPC) flow logs, Domain Name System (DNS) logs, Cloud network address translation (NAT) logs, and firewall logs. - VPC flow logs: These logs provide visibility into VPC traffic and capture TCP and UDP traffic to and from internal traffic, network attachments, servers to Internet endpoints, and servers to Google APIs. - DNS logs: These logs record every DNS query received from VM instances and inbound forwarding flows within your networks. - Cloud NAT logs: These logs provide context into NAT connections and errors. - Firewall logs: These logs provide connection records for TCP and UDP traffic only. These records contain things like source and destination IPs, protocols, ports, times, and so on.
Access Transparency Logs These logs provide insight behind the scenes when a Google Cloud Support engineer had accessed parts of your infrastructure and for what purpose. Access Transparency is actually a critical feature in the modern day, increasing trust in Google Cloud by giving organizations the ability to know what exactly is going on behind the scenes of their Cloud Platform and having an audit trail to do so. Google Cloud has consistently held the position that no customer data is ever accessed for any reason aside from fulfilling contractual obligations. With this level of transparency, Google Cloud can provide customers the evidence behind that commitment.
Don’t worry about memorizing the details of every single log type. Just know that they exist. If you see a question about troubleshooting a DNS issue, your answer probably won’t be to review the admin audit logs.
Create Your Own Logs You can also create your own logs using supported client libraries. To create your own logs, you just need to write the log entries. There is no separate Create operation in Cloud Logging. However, don’t mix this up with identifying Create syntax in your actual logs. These could be referring to the point at which a certain resource was created. For instance, if you provision a network attachment to your GCE instance, you can look through the network logs to find the origin of this activity via a Create or Insert log entry. Cloud Trace, Cloud Profiler, and Cloud Debugger The application performance management tools included in Cloud Operations combine the capabilities of Cloud Logging and Cloud Monitoring, along with Cloud Trace, Cloud Debugger, and Cloud Profiler to help you reduce latency and cost and run more efficient applications. Cloud Trace is a distributed tracing service that you can use to collect latency from your applications and track how requests propagate through your applications. It can provide in-depth latency reports to surface performance issues and works across VMs, containers, or App Engine projects. Tracing refers to latency analysis between your applications or on incoming requests. In App Engine, traces are automatically submitted to Cloud Trace. For other applications, you can leverage the Cloud Trace SDK or API to send latency data for analysis. Think about scenarios where you deploy a new release and you’re getting feedback from your users about longer load times. You can use Cloud Trace to trace exactly where your request latency is higher than normal. Cloud Profiler is a continuous CPU and heap profiling tool that is used to analyze the performance of CPU or memory-intensive functions you run across an application. While Cloud Trace is focused on latency analysis, Cloud Profiler is able to determine which aspect of your code is causing higher CPU and memory consumption. Cloud Debugger is a real-time application debugging tool you can use to inspect your running applications and identify the behavior of your code, continuously searching for bugs in a live environment without incurring any performance impacts to your users. It integrates into your existing developer workflows, enabling developers to take snapshots directly from any area of your application and even add new logging statements on demand when you’re doing deep bug identification work.
Cloud Monitoring Cloud Monitoring is a full-stack, fully managed monitoring solution that gives you visibility into the performance, uptime, and overall health of your applications. You can define and gather metrics, events, and metadata from your GCP environment and non-GCP environments through agents, APIs, and partnerships with other third parties. Cloud Monitoring provides rich visualizations and customizable dashboards that help you analyze your data. Because it’s a managed service, you don’t have to worry about managing any infrastructure for the service. You can leverage a plethora of monitoring tools to do a variety of tasks such as gathering metrics, dashboarding, uptime monitoring, and building alerts, but Cloud Monitoring provides all of this functionality with a strong integration into many third-party products, making it a strong contender in the operational monitoring space. Cloud Monitoring offers both white-box and black-box monitoring techniques. Black-box monitoring enables you to monitor your application as if you were an end user, without having any underlying knowledge of the internal configuration of the service. It provides this via uptime checks. White-box monitoring enables you to monitor all aspects of your service with full underlying knowledge of the internal infrastructure. You can build custom metrics based on certain indicators by using the API or using an open source library like OpenCensus. You can also build log-based metrics from the logs you collect in your Cloud Logging architecture. When it comes to monitoring your application and environment in the cloud, some of your key goals may be to understand your application health, the load on your application, uptime, and performance of your applications. This requires collecting metrics from various sources, making them easy to view and digest, and generating alerts when metrics don’t meet your desired criteria.
Workspaces Workspaces are created to organize and manage key monitoring data across projects. Workspaces can manage the monitoring data for one or many projects, but a project cannot be assigned to multiple workspaces. The workspace is created for a host project, which is used as the basis for the workspace that stores all of the configuration items for dashboards, alerting policies, uptime checks, notification channels, and more. If you delete the host project, you delete the workspace along with it.
You can monitor up to 100 projects inside a workspace. Workspaces are not a great solution if you need to build a centralized, organization-wide monitoring and alerting system. The main team that will typically need an organization-wide monitoring solution with centralized metrics will be your security operations teams. That is where a SIEM third-party solution comes into play. Cloud Monitoring is designed for software teams, not security teams, and is similar to New Relic, Datadog, and Splunk. When you create a workspace, you should think about how you can group together applications that share similar metrics and even organize them within teams.
Monitoring Agent The Cloud Monitoring agent is a collectd-based daemon that gathers system and application metrics from your VMs. The Monitoring agent collects disk, CPU, network, and process metrics by default. You can also configure it to monitor third-party applications. The agent is optional, but it’s recommended that you install it on your instances, because the insights you can gain from having it on your VM instances will enable you to gather a much richer source of data than what’s provided by VMs by default. Also, having the ability to monitor third-party applications eliminates the need to install other third-party tools that do the same thing, making your deployments and configuration management quite a bit simpler.
Uptime Checks Uptime checks are pings that are sent to a resource to see if they respond. You’d want to leverage uptime checks to monitor the availability of VM instances, App Engine services, public websites, and even AWS load balancers. Similar to a load balancer health check, an uptime check is simply a request that says, “Hey service. Are you alive?” A response may be business as usual, but the event generated when there is a lack of a response can be leveraged to kick off an incident analysis workflow. You might set an alerting policy to trigger an alert if there are three consecutive failures on the uptime check, for example. That may trigger a slew of events, including paging your on-call, automatically generating a ticket with prefilled diagnostic details, failing over to another server or environment, and more.
Configuring uptime checks is pretty simple, but it’s common to forget to open your firewall rules when doing so. You need to ensure that your firewalls are set up to permit incoming traffic from the uptime-check servers to avoid issues.
Metrics and Alerts Metrics are a collection of measurements that help you understand how your application and services are performing. More than 1500 types of metrics are available by default in Cloud Monitoring, including metrics for Google Cloud, AWS, and third-party software. Metrics could include things like latency of requests to a service, amount of disk space on a machine, and number of tables in your SQL database. Metric metadata will typically contain details about the source of the measurement, timestamps, and details about the exact values of the measurement. This is a pretty simple concept, so we don’t need to spend too much time on it, but the “TLDR” (too long, didn’t read) is that you have many predefined metrics and the ability to generate your own custom metrics at the platform, application, and service levels. You use these metrics to gather key performance indicators for things you’re looking to measure. Think about availability requirements for your application, performance requirements, using metrics to track bottlenecks in your application and optimize the performance of code, and so on. When you’re trying to figure out how to break these down into actionable metrics, start with defining your SLOs and what your requirements are. If your SLA has an availability SLO of 99.95 percent, what metrics will help you understand your system’s availability? What are the service level indicators (SLIs) telling you? Is your system is at risk? In summary, the key here is to capture a number of SMART (Specific, Measurable, Attainable, Relevant, Timed) metrics that map to service indicators that demonstrate operational success or failure. These indicators then validate that you are meeting the business objectives of your organization.
Don’t forget about SLOs, SLIs, and SLA. You measure your SLOs, service level objectives, with your SLIs, service level indicators. Your SLA, service level agreement, is the performance level you’ve contractually guaranteed to provide to your customers. A breach of these agreements could cost you! The SLIs that are clear indicators of system degradation (failures) should be used to create an alerting policy. Alerting policies define the conditions in which one or multiple resources are in a state that requires you to take action and what actions to take upon meeting those conditions. Alerting policies consist of conditions, the indicators based on the breach of a metric threshold; notifications; and documentation that can be provided to help your support team resolve the issue. When an alert policy triggers, Cloud Monitoring will show an incident notification in the console, and it will also kick off any notifications to people or services that you’ve defined in the policy.
Dashboards One of the other powerful abilities that Cloud Monitoring is equipped to handle natively is the ability to provide predefined and custom dashboards to view and analyze your most important metric data. The predefined dashboards don’t require any effort to set up or configure. Custom dashboards can be configured using the Cloud Monitoring API. There is a lot of flexibility here on what you can do, from building custom charts and visualizations, to exporting that data, to sharing data with Grafana.
The Importance of Resilience Say you noticed a big flaw in a system design. You stepped up and analyzed the flaw. A peer reviewed it with a colleague. You finally mustered up the courage to raise the alarm with leadership, expecting them to take a strong decisive action toward remediating the vulnerability. Next thing you know, you get shot down. Budget issues, people issues, and “This is a significant risk but we have deadlines to meet in this quarter. We can visit this in a few sprint cycles.” Maybe you could’ve done a better job assessing and presenting the risk to your leadership team. But more often than not, poor leadership allows these incidents to occur in the first place. Most incidents occur because of preexisting conditions of an architecture that are already known to the hands-on technical teams working in the environment. Poor leadership is rampant in the world.
PSA to all current and future technology leaders: It costs significantly less to design your system to be resilient both from operational and security concerns today than it does to pay huge sums of money and time to recover, while demolishing your team’s morale throughout the process. Technical debt is a term cloud architects need to wrap their minds around. There is no free lunch. The act of cutting corners here or there, choosing less optimal solutions, reinventing your own wheel, just because you can, all have long-term consequences.
Business continuity. Disaster recovery. Resilience. What do all these things mean? They all sound so similar, yet they’re all focused on different scenarios. The end result of these is the same: to get your business back to baseline, and, for the most part, in the cloud, the technical solutions you implement are going to be similar. - Business continuity planning (BCP): A plan of action for getting the business back to full functionality after a crisis - Disaster recovery planning (DRP): A plan for getting the technology infrastructure and operations back in order after an outage - Resilience: Your infrastructure’s ability to withstand faults and failures and continue running with little to no downtime
In short, these three planning efforts are intended to answer these questions: How do you restore your technical infrastructure back to full baseline after an outage? How can you still serve your customers during an outage? How can you design your infrastructure to absorb failures and still operate without having an outage?
Let’s forget about BCP and DRP for a minute. We know that failure will happen and it is accepted as part of the principles of DevOps. But this is where SRE comes into play. The sole purpose of SRE is to help find opportunities to improve your infrastructure’s ability to handle fault and failure while continuing to serve your customers, and to do this, ideally, in a way that doesn’t require you to trigger your disaster recovery plan or your business continuity plan. Resilience is the Number 1 goal. But it is imperative you have plans B and C in place to kick off BCP and DRP if a full outage were to occur (which is more common than not).
How do you know that your system is resilient? You can do all the testing in the world when it comes to your applications and your environments, but you never know how an application will handle real live data and real live users in production without experimenting with your application in production!
Chaos Engineering is an SRE discipline focused on experimenting with your systems in production by injecting real faults, ranging from small experiments to massive experiments, to see how your application can truly handle them. The goal is to validate your hypotheses of how your system will perform based on the design iterations you implement to improve resilience. Let’s say, for example, that your application was hosted in a US-east region, and if that US-east region goes down, your application goes down. You decide to replicate your application and run multiregional resiliency. In design, your team did a great job of replicating your application stack and planning all of their dependencies to be redundant across the regions. But, in reality, you still have no idea how this will work when you have 50 million daily active users and an entire region goes down. You still need to provide service to all those users. You won’t know the answer unless you actually test and validate your architecture.
Netflix is behind the Chaos Engineering discipline. It started with a 2011 tool built to test resilience, called Chaos Monkey. Chaos Monkey intentionally disabled computers inside the Netflix production network, and then engineers assessed how many systems would respond to this type of outage. The name explains itself. Just imagine a bunch of monkeys, Planet of the Apes style, rolling into your data center and “going ham.” Racks flying everywhere, NICs ripped out, cables flung around, monkeying around. That’s a Chaos Monkey. It’s one thing to test such resiliency in testing environments. It’s another thing to do this in your production environment when it is providing services to paying customers.
Additional References If you’d like more information about the topics discussed in this chapter, check out these sources: - Principles of Chaos Engineering https://principlesofchaos.org/ - Site Reliability Engineering https://landing.google.com/sre/ - Patterns for Scalable and Resilient Apps https://cloud.google.com/solutions/scalable-and-resilient-apps
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.