Fatskills
Practice. Master. Repeat.
Study Guide: Storage, Databases, and Data Analytics
Source: https://www.fatskills.com/google-professional-cloud-architect-certification/chapter/storage-databases-and-data-analytics

Storage, Databases, and Data Analytics

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~32 min read

Storage
In Google Cloud, there are a few types of storage options—block storage, object storage, and file storage. As a cloud architect, you’ll hear these terms used quite often, and oftentimes incorrectly. Let’s talk about each storage type.
Block storage refers to a data system in which the data is broken up into blocks and then stored across a distributed system to maximize efficiency. These storage blocks get unique identifiers so that they can be referenced in order to piece the data back together. Block storage systems are great when you need to retrieve and manipulate data quickly through a layer of abstraction (such as an operating system) that is responsible for accessing the data points across the blocks. The downside to block storage, however, is that because your data is split and chunked across blocks, there is no ability to leverage metadata. In the traditional sense, storage area networks and hard drives were associated with block storage. In Google Cloud, block storage is associated with technologies such as Persistent Disks and Local SSDs, which essentially are attached disk drives in their data centers. Think about a scenario, for example, in which your company wants to install a MySQL database on a virtual machine because they’re not ready to fully migrate to a managed solution—or they just want to have full control. What type of storage would be the best solution for storing all of the underlying data from the database? This is a case in which an attached Persistent Disk or Local SSD would provide strong, durable block storage for a database stored on a virtual machine (VM).
File storage refers to a data system in which your data is stored in files, and those files are organized in a folder hierarchy, providing a simple user interface to organize and sort your data and the ability to leverage metadata within your files. File storage is great for use cases such as network-attached storage or for storing files within your operating system. Think about the Windows OS, with a file system that maps where your data is, but the underlying storage is chunked across block storage. In Google Cloud, file storage is associated with technologies like Cloud Filestore, which is a network-attached file storage solution in the cloud. If a bunch of your applications need to share a file server to share and access files commonly in a safe manner, this is where a solution like Cloud Filestore would come into play.
Object storage refers to a data system with a flat structure that contains objects, and within objects are data, metadata, and a unique identifier. This type of storage is very versatile, enabling you to store massive amounts of unstructured data and still maintain simple data accessibility. You can use object storage for anything—unstructured data, large data sets, you name it. However, because of the nature of this flat system, you’ll need to manage your metadata effectively to be able to keep your objects accessible. In Google Cloud, object storage is associated with technologies such as Google Cloud Storage. For example, a mega-company like the National Football League (NFL) may have petabytes of game footage stored in an unstructured data store, but it most likely leverages a homegrown or third-party solution such as a media asset manager to add an organizational schema to make the data easily accessible, sortable, retrievable, and manageable.

When you’re taking the exam, knowing which Google Cloud storage technologies are related to file, object, and block storage may help you get to a more clear answer. Be careful, though, and don’t assume a Google-managed service is always the answer. Read through each question very carefully for the requirements.

Google Cloud Storage
Google Cloud Storage (GCS) is a globally unified, scalable, and highly durable object storage offering in GCP. GCS is often used for content delivery, data lakes, and backup. Data in GCS is encrypted by default at rest. But you can also leverage all of the other Google Cloud encryption offerings on GCS.
It offers varying availability service level objectives (SLOs), depending on the storage class, ranging from 99.0 to 99.95 percent. GCS offers Object Lifecycle Management to move your data automatically to lower-cost storage classes based on criteria you define to optimize your costs.

Durability refers to the ability for data to be protected from bit rot, degradation, or other corruption. Durability is also measured in nines, and GCS provides 11 nines of durability, or 99.999999999 percent annual durability.
In GCS, you store your objects, which are immutable pieces of data that can be any file of any format, in containers called buckets, which are associated with a project. Upon bucket creation, you select a globally unique name and a geographical location where you are going to store the bucket and the objects within it. That means your bucket name cannot be the same as any other bucket in the world, so you need to follow a strong naming convention. You also select a storage class that all of the files will be aligned to.

You should leverage the various storage classes based on your data’s availability requirements:
- Standard storage is great for data that is frequently accessed and needs the strongest availability.
- Nearline storage is a low-cost solution that is good for data that is infrequently accessed. If you are okay with a slightly lower availability and a 30-day minimum storage duration, the lower storage costs can be a greater benefit than the increased costs for accessing your data. This is ideal if you want to read or modify your data once a month or less on average.
- Coldline storage is a very-low-cost solution that is suitable for data that is infrequently accessed. The at-rest storage costs are even lower than nearline. Coldline storage is great for data that you need to read or modify only once a quarter.
- Archive storage is the lowest-cost solution. It has no availability service level objective (SLO), but its true availability is typically equivalent to nearline and coldline storage. This is good for data that needs to be accessed only once a year, and it has a 365-day minimum storage. If, for example, a compliance requirement necessitates that you retain audit logs for six years, you’d want to throw it in archive. The good thing about GCP is that, while other cloud providers offer similar storage classes, GCP can surface all of the data instantly when it is accessed. In AWS, if you store data in an equivalent archive solution, it may take hours or days to get your data out.

You can apply key-value pair labels to your buckets that enable you to group your buckets with other resources, such as VMs or persistent disks. You may want to use labels to classify data sensitivity according to your data classification model, to identify which team the data belongs to, or for other purposes. You can use up to 64 labels per bucket. Labels are often used for billing accounting purposes as well.
Object names are not globally unique, but because GCS is a flat storage, object names have to be unique within your buckets. For example, you can have two files named pwned.jpg in two different buckets, but they cannot be in the same bucket. You can also leverage object versioning to manage version control inside of your buckets. Object versioning is a feature that maintains old versions of files in your bucket when they are overwritten or deleted, based on parameters you set. Obviously, this will increase the cost of storage because you would be maintaining multiple versions of files, so you wouldn’t want to use this feature if it’s not needed.

You cannot recover objects from a deleted bucket, regardless of whether or not you’re using object versioning.
To manage cost effectively, you should leverage Object Lifecycle Management, which enables you to apply a configuration policy to your buckets to determine what actions to take automatically based on a condition your objects meet. For example, if objects haven’t been accessed in more than 60 days, you may want to downgrade their storage class to coldline to save money, or you may want to delete them entirely. Your life cycle rule can be composed of two types of actions: Delete and SetStorageClass. As you can probably guess, you can either delete or change your storage class based on conditions such as object age, created dates, number of newer versions, and so on.

When it comes to the exam, think about how you can leverage Google Cloud Storage in your architecture. It can be a great solution for both archival data, with its variety of storage classes, and for any applications that need object storage that is georedundant, with very strong service level objectives.
Imagine a scenario in which your compliance regulators require that you retain logs for six years. If your logs are not being accessed frequently, why would you want to store the data anywhere other than a long-term archival storage class in GCS? Or what about a backup data store for disaster recovery?
You can interact with GCS through the Cloud Console, through the gsutil command-line tool, using client libraries, and via the REST API.

Remember the syntax for using gsutil: gs://[BUCKET_NAME]/[OBJECT_NAME]. You can use this tool to do a variety of bucket and object management tasks, such as creating and deleting buckets; uploading and downloading files; moving, copying, and renaming objects; or editing ACLs.
You should remember that GCS is one of the most often used staging solutions for bringing data in and out of the cloud. When you use any data transfer solutions, it typically goes into GCS as its first step into GCP. When you move data between cloud services, it often is stored in GCS as an interim placeholder. When you need to move data out of GCP, GCS is the standard staging storage location for your processes.

Cloud Filestore
Cloud Filestore is high-performance managed file storage for applications that require a file system. Like the Network Filesystem (NFS) protocol, Filestore offers the ability to stand up a network-attached storage on your Google Compute Engine (GCE) or Google Kubernetes Engine (GKE) instances. Filestore is highly consistent, fast, fully managed, and scalable using Elastifile to grow or shrink your clusters. Filestore offers a 99.9 percent SLO.
When you create a Filestore instance, you’re creating a single NFS file share with default Unix permissions. Filestore instances are required to be created in the same project and Virtual Private Cloud (VPC) network as the GCE or GKE clients that are connected to it (unless you use a shared VPC). Basically, you’d want it to be on the same RFC 1918 address space as your clients, and you can enable Filestore to select an available IP automatically in the RFC 1918 space that you designate to create the instances.

Persistent Disk
Persistent Disk (PD) is high-performance, highly durable block storage that provides solid-state drive (SSD) and hard disk drive (HDD) storage and can be attached to GCE or GKE instances. Storage volumes can be resized and backed up, and they can support simultaneous reads. The maximum size of a single persistent disk is 64TB, but you can use more than one disk.
There are three types of persistent disks:
- Standard persistent disks These are best for large data processing workloads that mostly leverage sequential I/Os.
- SSD persistent disks These are best for high-performance databases and applications that require low latency and more I/O operations per second (IOPS) than standard PDs. They provide single-digit millisecond latencies.
- Balanced persistent disks An alternative to SSD PDs, these are a great balance between performance and costs and suitable for most general-purpose applications.
If you need to modify the size of your persistent disk, it’s as easy as increasing the size in the Cloud Console. If you need to resize your mounted file system, you can use the standard resize2fs command in Linux to do online resizing. PDs are not actually physically attached to the servers that host your VMs, but they are virtually attached. You can only resize up, but not down!

The command to modify the persistent disk auto-delete behavior for instances attached to VMs is gcloud compute instances set-disk-auto-delete. Auto-delete is on by default, so you will need to turn this syntax off if you don’t want your PD to be deleted when the instance attached to it is deleted.

Local SSD
Local solid-state drives (SSDs) are high-performance, ephemeral block storage disks that are physically attached to the servers that host your VM instances. They offer superior performance, high IOPS, and ultra low latency compared to other block storage options. These are typically used for temporary storage use cases such as caching or scratch processing space. Think of workloads like high-performance computing (HPC), media rendering, and data analytics.

Local SSDs disappear when you stop an instance, whereas all three types of persistent disks persist when you stop an instance—hence the name, persistent disk.
Each Local SSD is only 375GB, but you can attach 24 Local SSDs per instance. Because of their benefits and limitations, Local SSDs make a great use case for temporary storage such as caches, processing space, or low-value data.

Databases
There are a variety of database solutions in GCP, from installing your own database servers on virtual machines and using persistent disks to store your data, to leveraging a variety of managed services on GCP. For most cloud-native organizations, migrating to managed databases is a no-brainer, especially with the feature parity that GCP’s database offerings have been able to achieve over the last few years compared to traditional databases. Managing and scaling a database operationally have quite the overhead. Why not let your cloud provider handle all of that work for you? Databases in GCP offer flexible performance and enormous scalability. They are often highly compatible with a broad set of open source technologies, and they are strongly integrated with key analytics and ML/AI products. Google Cloud’s relational database offerings are Cloud SQL and Cloud Spanner, and its NoSQL/nonrelational database offerings are Cloud Bigtable, Cloud Firestore, Firebase Realtime Database, and Cloud Memorystore.

The atomicity, consistency, isolation, and durability properties of database transactions are commonly referred to by the acronym ACID. The sequence of database operations that satisfies the ACID properties is called a transaction. Not all database offerings fulfill all ACID requirements, nor are they intended to do so.

Cloud SQL
Cloud SQL is a fully managed relational database for MySQL, PostgreSQL, and SQL Server that offers a simple integration from just about any application, similar to GCE, GKE, and Google App Engine (GAE). You can use BigQuery to query your Cloud SQL databases directly. Cloud SQL offers a 99.95 percent availability SLO and supports standard SQL queries.
When you’re deploying a Cloud SQL instance, you get to choose between the three database servers: MySQL, PostgreSQL, and SQL Server. You can also select the region where the instance and its data are stored, as well as the zones. Ideally, you’d want to choose the same region for your data and the applications interfacing with it to minimize latency.
Replicating your instances between zones can be done by configuring Cloud SQL for high availability (HA). When you deploy an HA configuration, commonly known as a cluster, you’re providing data redundancy across zones. An HA configuration typically has a primary instance in one zone and a standby instance in another zone. All of the data from the primary instance is stored in a regional persistent disk, which then uses synchronous replication to persistent disks attached in each zone. The standby instance is activated only if the primary instance becomes unresponsive, and it will automatically failover to the standby. After a failover happens, you’ll need to perform a failback manually to resume serving data from your primary instance’s zone after you are able to get it back up and running.

You can also configure your instances to support read replicas to offload traffic from a Cloud SQL instance for read-heavy workloads. You cannot write to read replicas. Cloud SQL also supports sharding—it’s always a best practice to leverage many smaller Cloud SQL instances rather than one large instance to improve efficiency and scalability.

Cloud Spanner
Cloud Spanner is a fully managed, scalable relational database for regional and globally distributed application data. It combines the benefits of a relational database structure while scaling horizontally like a nonrelational database would. This allows for strong consistency across rows, regions, and contents, with a 99.999 percent availability SLO. Cloud Spanner solved a major issue with traditional databases by eliminating the trade-off between scale and consistency with its horizontally scaling, low latency, and highly consistent characteristics. Cloud Spanner is an online transaction processing database and supports standard SQL queries.

Don’t forget the term “horizontal scaling.” You may see questions about a requirement for a fully managed transactional SQL database that is also horizontally scaling like a traditional nonrelational database would be. Cloud Spanner is the only solution that would meet the criteria.
When databases claim they support standard SQL, that does not mean that they are SQL compatible. Cloud SQL supports MySQL, PostgreSQL, and SQL Server, which all claim to support standard SQL queries. Cloud Spanner also claims to support standard SQL queries. Be aware that you cannot, for example, switch from using MySQL or Microsoft SQL Server to Cloud Spanner just because you want horizontal scaling for your database and then expect that your applications will work. Assume that there will be some level of development needed to refactor your database.
When you create a Cloud Spanner instance, you select your instance configuration and node count. Your instance configuration determines where your database is geographically located and where it is replicated. You can select a regional or multiregional deployment. After choosing your instance configuration, you will be able to choose the node count, which determines the amount of serving and storage resources that are available in that instance. You can modify the node count later if you need to. Each node supports up to 2TB of storage, and its performance is based on the instance configuration, schema design, and your dataset characteristics. As an estimate, each Cloud Spanner node can provide up to 10,000 queries per second of reads and 2000 queries per second of writes. The ability to add nodes is what makes Cloud Spanner horizontally scalable, and that also makes it a much more powerful system than Cloud SQL.

Cloud Bigtable
Cloud Bigtable is a fully managed and scalable NoSQL database for large analytical and operational workloads. It’s able to handle millions of requests per second at a consistent sub-10ms latency. Bigtable is ideal for things like personalization engines, ad-tech, digital media, and Internet of Things (IoT), and it connects easily to other database services such as BigQuery and the Apache ecosystem. Bigtable offers a 99.99 percent availability SLO. It can scale to billions of rows and thousands of columns, enabling you to store petabytes of data. It also has an extension to multiple client libraries, including the Apache HBase library for Java. Bigtable is a great solution for MapReduce operations, stream processing/analytics, and ML applications. By offering low-latency read/write access, high-throughput analytics, and native time series support, it’s commonly used to store and query the following workloads:

- Time-series data
- Marketing data, personalization, recommendations
- Financial data
- IoT data, geospatial datasets
- Graph data

Cloud Firestore
Cloud Firestore is a fully managed, fast, serverless, cloud-native NoSQL document database that is designed for mobile, web, and IoT applications at global scale. Firestore is the next generation of Google Cloud Datastore, which was the original highly scalable NoSQL database for mobile and web-based applications. Firestore offers a 99.999 percent availability SLO. Firestore is a great solution for common use cases such as user profiles, product catalogs, and game state.
EXAM TIP If the JencoMart case study is used in the exam, question(s) will outline a requirement to leverage managed services where possible. JencoMart uses an Oracle database for storing 20TB of user profiles. Firestore would be a great solution for this use case. And don’t forget that Firestore is the next major version of Datastore and still supports existing Datastore APIs.

Cloud Memorystore
Cloud Memorystore is a scalable, secure, and highly available in-memory service for Redis and Memcached. It enables you to build application caches that provide sub-millisecond data access, and it’s entirely compatible with the open source Redis and Memcached. Memorystore provides a 99.9 percent availability SLO.

Redis and Memcached are open source distributed memory caching systems that are often used to speed up dynamic database–driven applications by caching the data and objects in RAM. You can use these tools to cache the queries for your database backend. To do this effectively, you’ll want to get the hashes of your queries and use those to build a key-value store, where your data then gets returned by the caching system rather than your database backend. Google App Engine provides a built-in memcache service by default.
Most organizations use Redis or Memcached, but Cloud Memorystore has an advantage by eliminating the burden of management from the organization. Because it is fully managed, you don’t have to worry about managing the deployment, scaling, node configurations, monitoring, and patching. Its compatibility with Redis and Memcached makes it simple for you to migrate your applications without making code changes. Memorystore provides rich metrics that enable you to scale your instances up and down easily, so that you can optimize your cache-hit ratio and your costs.

Data Analytics
Building on the database technology, one of the key value propositions Google Cloud offers over other cloud providers is its powerful data analytics offerings, called Smart Analytics solutions. At the end of the day, Google is a data company, and externalizing its infrastructure as a cloud to customers was a natural business path. Building incredibly powerful analytical tools based on its 20-plus years in the data space is another no-brainer. Forrester Research named Google Cloud a leader in Data Management for Analytics. Google offers unified cloud-native data analytics platforms that provide easy access to streaming and batch processing data at an unmatched scale.

You’ll see a lot of questions on the exam that may sound like they are looking for database solutions, but don’t forget to distill the requirements down to an online transactional processing (OLTP) or online analytical processing (OLAP) type of data solution. Oftentimes, when the exam refers to analytics, your answers will involve analytical solutions such as BigQuery, Dataproc, Dataflow, and Pub/Sub (covered in the following sections). Don’t forget the power of Cloud Bigtable as an underlying database for analytical workloads that integrates with big data tools such as Hadoop, Dataflow, and Dataproc.

BigQuery
BigQuery is a highly scalable, cost-effective, serverless solution for multi-cloud data warehousing. Use it to analyze petabyte-scale data with zero operational overhead. BigQuery is one of Google Cloud’s top products and is based on the Google-developed Dremel query engine. It has a 99.99 percent availability SLO. BigQuery is an OLAP database that supports standard SQL queries. It is a great use case for migrating data from Teradata or any of your on-premises legacy data warehouses to increase performance and scalability by an exponential order of magnitude, while decreasing operational burden and cost.
In BigQuery, you can interact with the Cloud Console, the bq command-line tool, or directly from the BigQuery REST API. You can also use many third-party tools to interact with BigQuery, such as visualization tools. BigQuery jobs are actions that BigQuery will run on your behalf to load, export, query, or copy data. You can run jobs concurrently so that they can be executed asynchronously, and you can use polling to check the status of the jobs.
BigQuery datasets are top-level containers that are used to organize and control access to tables and views. As you can imagine, tables consist of rows and columns of data records. Like other databases, each table is defined by a table schema, which should be created in the most efficient way. Views are virtual tables defined by a query. You can create authorized views to share query results with only a particular set of users and groups without giving them access to the tables directly.
Although BigQuery can run massive queries in a fraction of the time that other data solutions can, you should still partition your tables to optimize your database. Think about how this would play out if you had petabyte-size log storage. Rather than storing the log data in a single table partition, you can consider splitting them into partitions to optimize your queries and reduce costs. You can use time-partitioned tables and add rules, such as a rule to delete a partition after 30 days. You can also use table partitioning to manage fine-grained access to a data source between multiple teams.

The predefined roles created for BigQuery are important for managing the various tasks involved with standing up, migrating to, managing, and interacting with BigQuery. Get familiar with these roles at a high level for the exam.

Cloud Dataproc
Cloud Dataproc is a fully managed data and analytics processing solution that’s based on open source tools. You can build fully managed Apache Spark, Apache Hadoop, Presto, and other open source clusters. It’s pay as you go, making it a very cost-effective solution that offers per-second pricing. In Dataproc, you get the benefit of automated cluster management, which handles deployment, logging, and monitoring, so that you can focus on your data.

If you see questions about Hadoop or Spark on the exam—don’t forget about Dataproc!
When you build your jobs, you can containerize them with Kubernetes (K8s) and deploy them into any GKE cluster. Dataproc is a great use case for migrating Apache Spark jobs and Hadoop Distributed File System (HDFS) data to GCP, as well as for creating a strong data science environment by integrating with ML services and running Jupyter Notebook.

When you’re interacting with Hadoop and Apache Spark, you’ll oftentimes be working with web interfaces that you’ll leverage to manage and monitor your cluster resources and facilities. Be very careful not to open the wrong firewall rules to the public and create a Secure Shell (SSH) tunnel to secure the connection to your cluster’s master instance. I can’t stress how often this happens and organizations get pwned. Dear “you-know-who,” if you’re reading this book, this paragraph was dedicated to the time you opened 0.0.0.0/0 to the world to access the cluster web UI and your organization got pwned by a cryptominer. It could’ve been worse. Moral of the story—don’t screw with your firewall rules without going through information security.

Cloud Dataflow
Cloud Dataflow is a serverless, cost-effective, unified stream and batch data processing service that is fully managed and is based on the Apache Beam SDK. Dataflow enables incredibly fast streaming data analytics and eliminates the overhead of managing clusters. Dataflow is a great use case for stream analytics, real-time AI solutions such as fraud detection, personalization, predictive analytics, and sensor and log data processing in IoT solutions.
Because Dataflow is based on the Apache Beam open source model, while being a fully serverless solution, your key goals when developing your pipelines are to figure out where your input data is stored, what your data looks like, what you want to do with your data, and where the output data will go. This serves as a great use case for extraction, transformation, and load (ETL) jobs and pure data integration. Dataflow supports batch jobs and streaming jobs.

You may see questions that involve historical data or real-time data. Historical data is often aligned to batch processing, and real-time data is aligned to streaming solutions.

Cloud Pub/Sub
Cloud Pub/Sub is a global messaging and event ingestion solution that provides a simple and reliable staging location for event-based data before it gets processed, stored, and analyzed. Pub/Sub offers “at-least-once” delivery, with exactly once processing and no provisioning, and is global by default. Pub/Sub offers a 99.95 percent availability SLO. You’ve probably heard of similar open source tools such as RabbitMQ that offer similar messaging solutions. (In the Dress4Win case study in Chapter 1, the company runs three RabbitMQ servers for messaging, social notifications, and events.)
Messaging tools have strong use cases in an organization. Pub/Sub is short for publish/subscribe messaging, which is an asynchronous communication method to exchange messages between applications. When you think about all your GCP logs, you’ll realize that there is no 1:1 team or 1:1 application mapping between the logs and where they should go. Oftentimes you’ll have many consumers of this log data. Let’s imagine, for example, that your security team needs to ingest logs into its SIEM, your app team is using Databricks or Datadog and wants to ingest logs into their application, and other teams need the log data for other applications. Rather than designing pipelines to stream your log data across to a variety of applications, you can simply use Pub/Sub as a mechanism that enables any subscriber to consume this data.

There are five key elements in Pub/Sub:
- Publisher The client that creates messages and publishes them to a specific topic
- Message The data that moves through Pub/Sub
- Topic A named resource that represents a feed of messages
- Subscription A named resource that receives all specified messages on a topic
- Subscriber The client that receives messages from a subscription
Pub/Sub is incredibly powerful, because it’s built on a core infrastructure component that Google uses to power all of its messaging across its entire product stack, including Ads, Search, and G-mail. Google uses this infrastructure to send more than 500 million messages per second, or 1Tbps of data. Pub/Sub is also a great use case for streaming IoT data, distributing event notifications, balancing workloads in network clusters, and implementing asynchronous workflows.

Data Security
Protecting your data is one of the pillars of migrating to the cloud. For the longest time, organizations held off migrating to the cloud because of security concerns about cloud providers potentially snooping into or stealing their data and because the attack surface seemed to become larger to the world’s malicious actors. The security concerns of cloud providers’ default security has been mitigated by now, and most organizations know that the cloud provider isn’t at fault when a security incident occurs. In fact, most traditional on-premises security teams know that their organization’s technology infrastructure is more like Swiss cheese, and it’s only a matter of time till a breach. The focus has become increasingly targeted toward the customers of the cloud making mistakes, because almost all breaches in the cloud happen because a customer fails to follow best practices for the elements of the shared responsibility model that fall under the customer’s purview. Every cloud service provider, particularly Google Cloud, has an insane amount of security controls that it attests to and that are continuously audited against to ensure that the provider is not dropping the ball with regard to protecting its customers.

Data Classification
When it comes to data security in the cloud, you should think about how you are classifying and labeling your data; the security controls you’re applying to protect it, including access controls, encryption, and data exfiltration prevention; and how you’ll monitor your classified data against misconfigurations.
How do you plan on governing your environment if you have no idea where your sensitive data is stored? Data classification is one of the most important elements of protecting your data. Virtually all major organizations use a data classification model, although it may not always be followed. (It’s like the annual security course you’re forced to take, where your sole purpose is to figure out how fast you can get through the course without actually reading anything…yeah. But let’s reserve judgement over the efficacy of mandated security courses for another book.) Data classification is the process of classifying your data to its type, sensitivity, and criticality to your organization to secure and govern your data more effectively.
Your organization’s information security team will likely have a policy that dictates what type of data falls under which classification tier. You’ll commonly see a three- or four-tier model that starts with public data, then confidential data, confidential proprietary data, and optionally a highest sensitivity tier for data. Think about the Internal Revenue Service, which gathers and stores sensitive financial data on all Americans that is quite sensitive in nature and should fall under a high classification tier. It’s one thing if my tax records are leaked and the world knows that I have a lot of IOUs, but the problem is quite different if the US president’s tax records are leaked. That’s the kind of situation where some organizations may include a fourth and highest sensitivity tier to ensure that the highest criticality of data is protected. Truth is, this classification exercise is going to be imposed on your company sooner or later because of either the EU General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA) regulations.

For each classification category, your information security team likely provides the following security and privacy requirements:
- Data ownership (such as data owner and data custodian)
- Labeling (such as sensitivity and impact)
- Types of data (such as structured and unstructured data)
- Data life cycle (such as generation, retention, and disposal)
- Monitoring requirements
- Data anonymization
- Consent requirements

The reason why this data classification model needs to be sorted out before you start migrating all of your data into data stores in GCP is so that you can properly label and organize the data effectively. When you put labels on the data, your security team will typically use these labels to detect and respond to noncompliance. You can also use Cloud Data Loss Prevention (DLP) to detect, redact, or mask your streaming data or at-rest data automatically to ensure that Social Security numbers and the like don’t end up in public buckets. You won’t see anything about data classification on the exam, but as a cloud architect, you should strike up this conversation before you migrate your data, because it is a lot easier to architect your data environments accordingly before migration than it is to do it retroactively.

Cloud DLP
Cloud Data Loss Prevention (DLP) is a fully managed service that minimizes the risk of data exfiltration by enabling you to discover, classify, and protect your sensitive data. With Cloud DLP, you are able to perform de-identification on streaming and stored data. You can also continuously scan for environments where data does not meet the classification requirements. Despite its name, Cloud DLP is not like a traditional DLP solution. Cloud DLP does not actually prevent data exfiltration explicitly, as a traditional DLP does, by protecting data from leaving your perimeter.

Encryption
When it comes to encryption, Google Cloud may offer the best premise. Because Google owns its entire infrastructure stack, your cloud environment can take advantage of the same encryption that Google uses in its own corporate environment, which serves billions of users worldwide. Organizations with more stringent encryption needs have a variety of options, each of which has its pros and cons. Let’s dive right in.
Default Encryption B. default, Google encrypts all data at rest inside its infrastructure using the AES 256 encryption algorithm. Data gets chunked automatically, and each chunk is encrypted at the storage level with a data encryption key (DEK). These keys are stored near the data to provide ultra-low latency and high availability. The DEK is then wrapped by the key encryption key (KEK) as part of the standard envelope encryption process. When you use default encryption, you don’t have to worry about managing any of this because it happens by default. All the operational burden of managing key rings, keys, rotations, and all that jazz is out the door. These keys are managed by a key management service that falls under Google’s purview outside the customer’s organization. Google has very strict policies for managing these keys, and as mentioned, these are the same exact policies that protect Google’s own production services. Default encryption is recommended as the way to go for most organizations.

Cloud KMS
With Cloud Key Management Service (KMS), you can manage your cryptographic keys on Google Cloud. KMS enables you to generate and manage the KEKs that protect sensitive data. KMS supports customer-managed encryption keys (CMEKs), customer-supplied encryption keys (CSEKs), and the external key manager. KMS integrates with Cloud HSM, providing you with the ability to generate a key protected by a FIPS 140-2 Level 3 device.
AWS CloudHSM is a managed, cloud-hosted hardware security module that enables you to protect your cryptographic keys in a FIPS 140-2 Level 3 certified hardware security module (HSM). CloudHSM easily integrates with Cloud KMS, and you pay for only what you use.

Customer-Managed Encryption Keys
With customer-managed encryption keys, customers can generate their own KEKs to protect their data. The benefit of doing this is not so much for customers to own their encryption keys, but to have more control over the management of the keys. At the end of the day, Google is the one generating the keying material. One reason for using CMEK would be if you require more strict key management processes, including faster rotations and revocation, and you need strong audit trails to monitor which users and services access your keys. When you use CMEK, you can also leverage CloudHSM to protect your keys in a cloud-based HSM.

Customer-Supplied Encryption Keys
The customer-supplied encryption key service enables customers to bring their own AES 256 keys so that they can have the most control over the keys to their data. Google Cloud doesn’t permanently store CSEKs on its servers, and the keys are purged after every operation, so there is no way that Google or any government agency requesting access to a customer environment could decrypt your data. The issue with CSEK is that it did not get widely adopted among all of the Google services; it’s quite a complex engineering feat to roll this out across the entire service list of products. CSEK is supported by only a few services such as GCE, GCS, and BigQuery.

External Key Manager
To counter the lack of CSEK’s growth, Google launched a new service, External Key Manager (EKM), a service that enables organizations to supply their own keys through a supported external key management partner to protect their data in the cloud. With EKM, your keys are not stored in Google; they are stored at a third party, and you would typically own and control their location, distribution, access, and management. It’s a pretty new service that is going to take some time before it’s supported across the spectrum, but EKM is the most promising solution to customers that need to fully own their encryption keys. In my opinion, you should either use default encryption or EKM and nothing in between. But it’s going to take some time before the product is fully mature and compatible.

To quickly recap:
- If you need SQL queries via an OLTP system, use Cloud Spanner or Cloud SQL.
- If you need interactive querying via an OLAP system, use BigQuery.
- If you need a strong NoSQL database for analytical workloads such as time-series data and IoT data, use Bigtable.
- If you need to store structured data in a document database with support for ACID transactions and SQL-like queries, use Cloud Firestore.
- If you need in-memory data storage, use Memorystore.

Data security is quite an important aspect of designing a secure cloud environment. Data classification is a mechanism that improves your ability to govern and secure your data by identifying key information and tagging your data accordingly. Cloud DLP offers a data discovery engine that can discover, tag, and de-identify data to minimize the threat of exfiltration. There are a variety of encryption offerings in Google Cloud, but the most important one is that everyone gets encryption at rest by default, a feat that no other cloud provider offers, which is available because Google owns its entire network stack end-to-end.

Additional References
If you’d like more information about the topics discussed in this chapter, check out these sources:
- gsutil Tool https://cloud.google.com/storage/docs/gsutil
- Using the bq Command-Line Tool https://cloud.google.com/bigquery/docs/bq-command-line-tool
- Platform Overview - Data & Storage https://www.youtube.com/watch?v=tc2940Zwvyk



ADVERTISEMENT